[00:00:33] <icinga-wm>	 RECOVERY - Check systemd state on centrallog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:27] <icinga-wm>	 RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:02:04] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:16:47] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:36:47] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:38:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981435
[00:38:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981435 (owner: 10TrainBranchBot)
[00:39:27] <icinga-wm>	 PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:41:31] <icinga-wm>	 PROBLEM - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:51] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:52:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[00:57:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981435 (owner: 10TrainBranchBot)
[01:08:00] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[01:15:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:51:09] <wikibugs>	 (03PS1) 10Andrew Bogott: Update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/981702
[01:51:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/981702 (owner: 10Andrew Bogott)
[02:01:19] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:16:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:24:41] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:24:43] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:25:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:26:03] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:39:09] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:35] <wikibugs>	 (03PS1) 10RLazarus: admin_ng: Add namespace and ClusterRole for Job sidecar controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284)
[02:42:51] <wikibugs>	 (03PS1) 10RLazarus: admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284)
[03:05:29] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:06:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:09:09] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:12:02] <wikibugs>	 (03PS2) 10Stang: zhwiki: Remove abusefilter-view-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398)
[03:33:23] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: update version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/981706 (https://phabricator.wikimedia.org/T326818)
[03:34:30] <jinxer-wm>	 (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[03:35:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update version in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/981706 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott)
[03:46:27] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:47:57] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:56:53] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:01:23] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:25:03] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:26:45] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:28:15] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:34:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:37:04] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:44:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:51:01] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-12-08-151348-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/981709 (https://phabricator.wikimedia.org/T352690)
[05:34:06] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981710 (https://phabricator.wikimedia.org/T351787)
[05:34:34] <wikibugs>	 (03PS1) 10Marostegui: pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/981711
[05:34:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787
[05:34:50] <stashbot>	 T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787
[05:35:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787
[05:35:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/981711 (owner: 10Marostegui)
[05:35:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981710 (https://phabricator.wikimedia.org/T351787) (owner: 10Marostegui)
[05:36:23] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981710 (https://phabricator.wikimedia.org/T351787) (owner: 10Marostegui)
[05:37:14] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]]
[05:37:18] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/981712
[05:37:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/981712 (owner: 10Marostegui)
[05:46:43] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[05:46:47] <stashbot>	 T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787
[05:47:26] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[05:54:09] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]] (duration: 16m 54s)
[05:54:13] <stashbot>	 T351787: Upgrade pc1 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351787
[05:55:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1011.eqiad.wmnet with OS bookworm
[06:03:50] <kart_>	 marostegui: OK to deploy MinT?
[06:05:30] <marostegui>	 kart_: go for it!
[06:06:40] <kart_>	 Thanks!
[06:07:02] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[06:07:06] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[06:07:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1011.eqiad.wmnet with reason: host reimage
[06:07:25] <kart_>	 ah. I forgot to merge the patch ;)
[06:07:44] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-12-08-151348-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/981709 (https://phabricator.wikimedia.org/T352690) (owner: 10KartikMistry)
[06:08:22] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981727
[06:08:50] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2023-12-08-151348-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/981709 (https://phabricator.wikimedia.org/T352690) (owner: 10KartikMistry)
[06:09:08] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981728
[06:10:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1011.eqiad.wmnet with reason: host reimage
[06:13:13] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[06:13:21] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729
[06:13:29] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 (owner: 10Marostegui)
[06:16:34] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[06:19:56] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[06:24:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981727 (owner: 10Marostegui)
[06:26:47] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[06:26:55] <marostegui>	 kart_: Can I deploy?
[06:27:00] <marostegui>	 Once you are done
[06:27:39] <kart_>	 marostegui: Yes. 
[06:27:56] <kart_>	 marostegui: MinT deployment is little slow..
[06:28:21] <marostegui>	 kart_: no problem, let me know when I can
[06:28:40] <kart_>	 Sure
[06:28:52] <marostegui>	 thank you
[06:29:02] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[06:29:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1011.eqiad.wmnet with OS bookworm
[06:29:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/981728 (owner: 10Marostegui)
[06:30:11] <wikibugs>	 (03CR) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 (owner: 10Marostegui)
[06:30:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10Marostegui) Thank you! The host looks all green in Icinga!
[06:32:39] <wikibugs>	 (03PS1) 10MilkyDefer: Enable action blocks for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120)
[06:32:41] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer)
[06:34:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Is this still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) (owner: 10Ladsgroup)
[06:34:20] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[06:34:36] <kart_>	 marostegui: done.
[06:34:40] <marostegui>	 kart_: thanks!
[06:34:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 (owner: 10Marostegui)
[06:35:00] <_joe_>	 !log update sirenbot to 0.3.7
[06:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:26] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981729 (owner: 10Marostegui)
[06:35:43] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:981729|Revert "ProductionServices.php: Promote pc1014 to pc1"]]
[06:37:01] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:981729|Revert "ProductionServices.php: Promote pc1014 to pc1"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:37:18] <wikibugs>	 (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981714 (https://phabricator.wikimedia.org/T353120) (owner: 10MilkyDefer)
[06:37:23] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[06:44:06] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:981729|Revert "ProductionServices.php: Promote pc1014 to pc1"]] (duration: 08m 22s)
[06:44:39] <marostegui>	 kart_: I am done with all my deployments
[06:45:27] <kart_>	 cool. I'm also :)
[06:59:36] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/981716 (https://phabricator.wikimedia.org/T351864)
[07:05:30] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:07:15] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:08:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:11:33] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/981716 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[07:12:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/981716 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[07:12:44] <marostegui>	 !log Failvoer m3-master from dbproxy1020 to dbproxy1026 org
[07:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:50] <marostegui>	 !log Failvoer m3-master from dbproxy1020 to dbproxy1026 T351864
[07:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:54] <stashbot>	 T351864: Migrate dbproxy hosts to Bookworm - https://phabricator.wikimedia.org/T351864
[07:24:11] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2185.codfw.wmnet with reason: reboot for upgrade
[07:24:24] <logmsgbot>	 !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on db2185.codfw.wmnet with reason: reboot for upgrade
[07:31:38] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2185.codfw.wmnet with reason: reboot for upgrade
[07:31:41] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2185.codfw.wmnet with reason: reboot for upgrade
[07:34:31] <jinxer-wm>	 (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[07:39:30] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Keymanagement: SSH keys are in some cases not synced to LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/978056 (https://phabricator.wikimedia.org/T351139) (owner: 10Slyngshede)
[07:39:32] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Keymanagement: SSH keys are in some cases not synced to LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/978056 (https://phabricator.wikimedia.org/T351139) (owner: 10Slyngshede)
[07:41:33] <wikibugs>	 (03PS1) 10Slyngshede: C:idm:deployment restart Bitu on configuration changes. [puppet] - 10https://gerrit.wikimedia.org/r/981942
[07:43:20] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/857/con" [puppet] - 10https://gerrit.wikimedia.org/r/981942 (owner: 10Slyngshede)
[07:49:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[07:53:41] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: reboot for upgrade
[07:53:55] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: reboot for upgrade
[07:54:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T0800).
[08:00:05] <jouncebot>	 xSavitar and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:26] <kostajh>	 hi
[08:00:29] <xSavitar>	 o/
[08:01:34] <xSavitar>	 kostajh, do you want to deploy first?
[08:01:39] <kostajh>	 sure
[08:01:49] <xSavitar>	 Okay, ping me when you're done and I'll take it from there
[08:02:06] <kostajh>	 xSavitar: I can sync your patch but was wondering about the comment you left in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/976252/23#message-943e233d7e8473f64d21d5f2948873072b55d999. Is anyone calling isTest()?
[08:02:59] <xSavitar>	 kostajh, I don't see any public consumers for now, per code search. So I'm going to test this on a debug host (internally) to make sure it's doing the right thing. So we're good.
[08:03:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[08:03:18] <xSavitar>	 If you want, you can sync it and I'll test too.
[08:03:31] <xSavitar>	 That's after your own patch is done.
[08:05:56] <wikibugs>	 (03Merged) 10jenkins-bot: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[08:06:19] <logmsgbot>	 !log kharlan@deploy2002 Started scap: Backport for [[gerrit:979969|MediaModeration: Set MediaModerationDeveloperMode to false]]
[08:07:56] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:979969|MediaModeration: Set MediaModerationDeveloperMode to false]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:09:05] <wikibugs>	 (03PS5) 10Effie Mouzeli: [admin] Add ehughes shell account with no ssh key [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) (owner: 10EoghanGaffney)
[08:09:12] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[08:10:45] <wikibugs>	 (03CR) 10JMeybohm: admin_ng: Add namespace and ClusterRole for Job sidecar controller (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[08:11:38] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[08:14:49] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] [admin] Add ehughes shell account with no ssh key [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) (owner: 10EoghanGaffney)
[08:15:22] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: reboot for upgrade
[08:15:35] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: reboot for upgrade
[08:16:14] <jinxer-wm>	 (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:16:15] <logmsgbot>	 !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:979969|MediaModeration: Set MediaModerationDeveloperMode to false]] (duration: 09m 55s)
[08:19:45] <kostajh>	 xSavitar: ok, on to your patch
[08:19:52] <xSavitar>	 Ack
[08:20:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[08:20:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981424 (https://phabricator.wikimedia.org/T304604) (owner: 10Kosta Harlan)
[08:21:02] <wikibugs>	 (03Merged) 10jenkins-bot: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[08:21:06] <wikibugs>	 (03Merged) 10jenkins-bot: IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981424 (https://phabricator.wikimedia.org/T304604) (owner: 10Kosta Harlan)
[08:21:20] <logmsgbot>	 !log kharlan@deploy2002 Started scap: Backport for [[gerrit:976252|ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]]
[08:21:25] <stashbot>	 T347366: Follow-up on wmf-config "ClusterConfig::isTest" method - https://phabricator.wikimedia.org/T347366
[08:21:26] <stashbot>	 T304604: Set config for path to MaxMind files on production - https://phabricator.wikimedia.org/T304604
[08:22:30] <wikibugs>	 (03PS8) 10Brouberol: An an option to configure the event log storage location for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/980859 (https://phabricator.wikimedia.org/T352849)
[08:22:42] <logmsgbot>	 !log kharlan@deploy2002 kharlan and d3r1ck01: Backport for [[gerrit:976252|ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:23:44] <xSavitar>	 kostajh, looks like I can test now?
[08:24:09] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:25:21] <kostajh>	 xSavitar: yes, please go ahead
[08:25:31] <xSavitar>	 Okay, testing now...
[08:27:50] <wikibugs>	 (03PS1) 10Brouberol: Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639)
[08:28:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[08:29:09] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:29:22] <wikibugs>	 (03PS2) 10Brouberol: Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639)
[08:31:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:31:16] <wikibugs>	 (03PS1) 10Effie Mouzeli: admin: Add ehughes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/981946 (https://phabricator.wikimedia.org/T351387)
[08:31:39] <xSavitar>	 kostajh, I've tested on a debug host and on a k8s REPL. The later works fine meaning the patch is doing what it's expected to do.
[08:32:35] <xSavitar>	 But the former case didn't work and I know why. The hostname has is just "deploy2002" so the code doesn't see "debug" in the name (as in the hostnames before). Looks like they've been renamed?
[08:32:53] <xSavitar>	 I'll talk with Krinkle about this and see if we can improve the patch a little bit. But yes, it works. :)
[08:33:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] admin: Add ehughes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/981946 (https://phabricator.wikimedia.org/T351387) (owner: 10Effie Mouzeli)
[08:33:36] <xSavitar>	 I remember seeing hostnames like: mwdebug1001 or something like that :)
[08:34:03] <xSavitar>	 kostajh, so yeah, you can sync this :), I'll leave a comment on the task in Phab
[08:35:50] <kostajh>	 xSavitar: OK
[08:36:11] <logmsgbot>	 !log kharlan@deploy2002 kharlan and d3r1ck01: Continuing with sync
[08:37:05] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:40:12] <dcausse>	 !log restarted blazegraph on wdqs1006 (BlazegraphFreeAllocatorsDecreasingRapidly)
[08:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:22] <logmsgbot>	 !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:976252|ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (T347366)]], [[gerrit:981424|IPInfo: Add comment clarifying $wgIPInfoGeoIP2EnterprisePath (T304604)]] (duration: 22m 02s)
[08:43:28] <stashbot>	 T347366: Follow-up on wmf-config "ClusterConfig::isTest" method - https://phabricator.wikimedia.org/T347366
[08:43:28] <stashbot>	 T304604: Set config for path to MaxMind files on production - https://phabricator.wikimedia.org/T304604
[08:43:29] <kostajh>	 !log UTC morning deploys done
[08:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:49] <xSavitar>	 kostajh, thank you very much for deploying my patch. I appreciate
[08:45:36] <wikibugs>	 (03PS1) 10Effie Mouzeli: admin: add mcastro-wmf to ldap_only _users [puppet] - 10https://gerrit.wikimedia.org/r/981947 (https://phabricator.wikimedia.org/T352684)
[08:46:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add mcastro-wmf to ldap_only _users [puppet] - 10https://gerrit.wikimedia.org/r/981947 (https://phabricator.wikimedia.org/T352684) (owner: 10Effie Mouzeli)
[08:48:20] <wikibugs>	 (03PS2) 10Effie Mouzeli: admin: add mcastro-wmf to ldap_only _users [puppet] - 10https://gerrit.wikimedia.org/r/981947 (https://phabricator.wikimedia.org/T352684)
[08:50:31] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] admin: add mcastro-wmf to ldap_only _users [puppet] - 10https://gerrit.wikimedia.org/r/981947 (https://phabricator.wikimedia.org/T352684) (owner: 10Effie Mouzeli)
[08:54:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for mcastro-wmf - https://phabricator.wikimedia.org/T352684 (10jijiki) @Mcastro done, please reopen if something is not right
[08:55:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for mcastro-wmf - https://phabricator.wikimedia.org/T352684 (10jijiki) 05Open→03Resolved a:03jijiki
[08:55:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10jijiki) 05Open→03Resolved done:)
[08:59:21] <wikibugs>	 (03PS18) 10Stevemunene: C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede)
[09:04:40] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] kubernetes: Remove cergen certs from kubernetes secrets [labs/private] - 10https://gerrit.wikimedia.org/r/980891 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:10:58] <wikibugs>	 (03PS1) 10Jelto: phabricator: add dedicated blackbox check for collab team and severity task [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517)
[09:11:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: add dedicated blackbox check for collab team and severity task [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517) (owner: 10Jelto)
[09:12:55] <wikibugs>	 (03PS2) 10Jelto: phabricator: add dedicated blackbox check for collab team and severity task [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517)
[09:23:32] <wikibugs>	 (03PS3) 10Jelto: wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (https://phabricator.wikimedia.org/T352003) (owner: 10Giuseppe Lavagetto)
[09:24:25] <wikibugs>	 (03PS4) 10Jelto: wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (https://phabricator.wikimedia.org/T352003) (owner: 10Giuseppe Lavagetto)
[09:27:24] <wikibugs>	 (03CR) 10Jelto: "rebased to latest weekly rebuild" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (https://phabricator.wikimedia.org/T352003) (owner: 10Giuseppe Lavagetto)
[09:27:51] <wikibugs>	 (03PS14) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[09:29:00] <wikibugs>	 (03CR) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[09:30:37] <wikibugs>	 (03CR) 10Jelto: [V: 03+2 C: 03+2] wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (https://phabricator.wikimedia.org/T352003) (owner: 10Giuseppe Lavagetto)
[09:33:28] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Disable rp_filter for ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/981955 (https://phabricator.wikimedia.org/T351069)
[09:37:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "lgtm, I also checked that the service names were the same in the "if $install_via_git" code path." [puppet] - 10https://gerrit.wikimedia.org/r/981942 (owner: 10Slyngshede)
[09:37:25] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/863/con" [puppet] - 10https://gerrit.wikimedia.org/r/981955 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[09:38:03] <wikibugs>	 (03PS3) 10D3r1ck01: ClusterConfig: Followup on I955168f072315e0064c69a66483e61dfc23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981954 (https://phabricator.wikimedia.org/T347366)
[09:38:22] <wikibugs>	 (03PS1) 10Elukey: service: update recommendation-api's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981956 (https://phabricator.wikimedia.org/T349118)
[09:39:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] klaxon: Ensure the klaxon user has a home directory [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[09:40:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] openstack_apis_response: add value to the description [alerts] - 10https://gerrit.wikimedia.org/r/981450 (owner: 10David Caro)
[09:40:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] service: update recommendation-api's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981956 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey)
[09:41:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/981407 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[09:41:14] <wikibugs>	 (03PS3) 10Brouberol: [yarn] Add the option to configure the spark history server address [puppet] - 10https://gerrit.wikimedia.org/r/981948 (https://phabricator.wikimedia.org/T352863)
[09:43:07] <wikibugs>	 (03PS3) 10Brouberol: Configure the Spark History server host for the an-test yarn [puppet] - 10https://gerrit.wikimedia.org/r/981949 (https://phabricator.wikimedia.org/T352863)
[09:43:23] <wikibugs>	 (03PS3) 10Brouberol: Configure the Spark History server host for the analytics yarn [puppet] - 10https://gerrit.wikimedia.org/r/981950 (https://phabricator.wikimedia.org/T352863)
[09:44:05] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/867/con" [puppet] - 10https://gerrit.wikimedia.org/r/981950 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol)
[09:44:17] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync
[09:44:34] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync
[09:48:35] <wikibugs>	 (03CR) 10Hashar: "recheck after CI config https://gerrit.wikimedia.org/r/c/integration/config/+/981464" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede)
[09:49:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede)
[09:50:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync
[09:50:43] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync
[09:54:19] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069)
[09:54:39] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 1547
[09:54:59] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 1547
[09:55:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron)
[09:55:08] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync
[09:55:33] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync
[09:55:47] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/868/con" [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[09:56:32] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38753
[09:57:20] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38753
[10:02:23] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on text|secondary LVS in esams [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069)
[10:02:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack_apis_response: add value to the description [alerts] - 10https://gerrit.wikimedia.org/r/981450 (owner: 10David Caro)
[10:03:12] <jayme>	 !log removed cergen certs of all k8s servies from private puppet in commit d36a97aa23e21824f95d22264d06e2c3bf3c6ac3 - T300033
[10:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:23] <stashbot>	 T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033
[10:03:55] <jayme>	 😬
[10:04:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] cloud: add missing codfw1dev:openstack_control_nodes [puppet] - 10https://gerrit.wikimedia.org/r/981448 (https://phabricator.wikimedia.org/T353048) (owner: 10David Caro)
[10:04:16] <jelto>	 🤞
[10:04:31] <wikibugs>	 (03Merged) 10jenkins-bot: openstack_apis_response: add value to the description [alerts] - 10https://gerrit.wikimedia.org/r/981450 (owner: 10David Caro)
[10:04:51] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:06:14] <wikibugs>	 (03CR) 10Btullis: "This looks good. Can we deploy manually to the test cluster first?" [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede)
[10:06:21] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] C:bigtop::hadoop switch to new topology script. [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede)
[10:07:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nicely done! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/981358 (https://phabricator.wikimedia.org/T163996) (owner: 10Majavah)
[10:07:24] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] An an option to configure the event log storage location for all spark jobs [puppet] - 10https://gerrit.wikimedia.org/r/980859 (https://phabricator.wikimedia.org/T352849) (owner: 10Brouberol)
[10:11:06] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:11:51] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:12:54] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/981955 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:13:04] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on text|secondary LVS in esams [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:13:16] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] hiera: Enable IPIP encapsulation on text|secondary LVS in esams [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:13:23] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable rp_filter for ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/981955 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:17:55] <wikibugs>	 (03PS1) 10Elukey: Revert "service: update recommendation-api's docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981732
[10:20:54] <wikibugs>	 (03CR) 10JMeybohm: "If this can wait another day, you could pull in the latest ingress module as well: I95662e864cd4e10cca9c5357db42deffd06ba9e9" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey)
[10:22:06] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] networktests: use tool network-tests instead of personal one [puppet] - 10https://gerrit.wikimedia.org/r/967932 (owner: 10David Caro)
[10:23:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] disable_tool: use the gitlab repository [puppet] - 10https://gerrit.wikimedia.org/r/963260 (https://phabricator.wikimedia.org/T327057) (owner: 10David Caro)
[10:24:18] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "It is fine installing `wheel` with `pip` at least to be consistent with the other images :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[10:24:54] <claime>	 jouncebot: nowandnext
[10:24:54] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 35 minute(s)
[10:24:54] <jouncebot>	 In 0 hour(s) and 35 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1100)
[10:28:16] <wikibugs>	 (03PS1) 10Jelto: wmf-debci: fix templating in Dockerfile RUN command [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982041 (https://phabricator.wikimedia.org/T352003)
[10:31:00] <wikibugs>	 (03CR) 10Jelto: "unfortunately there is a little typo in the template, this should fix the issue (tested locally)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982041 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto)
[10:32:19] <wikibugs>	 (03CR) 10Vgutierrez: "you need to pool at least 4 nodes before merging this. You could adjust the depool_threshold too" [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[10:33:08] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10hnowlan)
[10:33:43] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] service_proxy/mesh: Bump to newer version globally [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[10:35:13] <wikibugs>	 (03PS1) 10Ayounsi: Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042
[10:35:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "service: update recommendation-api's docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981732 (owner: 10Elukey)
[10:36:06] <wikibugs>	 (03CR) 10Hashar: "The build fails cause `pip` in Debian Bookworm has been to error out when someone tries to install a package to `/usr/local/` with a more " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[10:37:23] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:37:23] <claime>	 !log Repooling dse-k8s-worker nodes - sudo confctl select "service=kubesvc,cluster=dse-k8s" set/pooled=yes - T352639
[10:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:27] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[10:37:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (owner: 10Ayounsi)
[10:37:31] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/982038 (https://phabricator.wikimedia.org/T351069)
[10:38:38] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync
[10:38:50] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync
[10:38:55] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[10:38:59] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync
[10:39:14] <wikibugs>	 (03PS2) 10Hashar: Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[10:39:41] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[10:42:23] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync
[10:42:23] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync
[10:42:23] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync
[10:43:06] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2) [puppet] - 10https://gerrit.wikimedia.org/r/981944 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[10:45:52] <claime>	 !log Disabling puppet on O:lvs::balancer - T352639
[10:45:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:56] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[10:46:16] <claime>	 !log Running puppet on O:lvs::balancer - T352639
[10:46:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:01] <wikibugs>	 (03PS1) 10Brouberol: Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733
[10:48:11] <wikibugs>	 (03PS2) 10Brouberol: Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733
[10:50:16] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1019-1020].eqiad.wmnet} and A:lvs (T352639)
[10:51:20] <wikibugs>	 (03CR) 10Brouberol: "Not to be deployed until we have successfully deployed the k8s-ingress-dse LVS service in lvs_state" [dns] - 10https://gerrit.wikimedia.org/r/981733 (owner: 10Brouberol)
[10:54:27] <wikibugs>	 (03PS6) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639)
[10:54:48] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[1019-1020].eqiad.wmnet} and A:lvs (T352639)
[10:54:52] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[10:55:48] <dcausse>	 !log (properly) restarting blazegraph on wdqs1006 (BlazegraphFreeAllocatorsDecreasingRapidly)
[10:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1100)
[11:01:56] <wikibugs>	 (03CR) 10Brouberol: "This PR can now be deployed (when +2ed)" [dns] - 10https://gerrit.wikimedia.org/r/981733 (owner: 10Brouberol)
[11:02:34] <wikibugs>	 (03PS3) 10Brouberol: Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733 (https://phabricator.wikimedia.org/T352639)
[11:03:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:04:25] <wikibugs>	 (03CR) 10Klausman: python-webapp: update mesh and base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey)
[11:04:27] <wikibugs>	 (03PS1) 10Brouberol: Switch state of k8s-ingress-dse LVS service to production [puppet] - 10https://gerrit.wikimedia.org/r/982045 (https://phabricator.wikimedia.org/T352639)
[11:05:07] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Switch state of k8s-ingress-dse LVS service to production [puppet] - 10https://gerrit.wikimedia.org/r/982045 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:05:22] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Switch state of k8s-ingress-dse LVS service to production [puppet] - 10https://gerrit.wikimedia.org/r/982045 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:05:30] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:06:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:06:34] <wikibugs>	 (03CR) 10Elukey: python-webapp: update mesh and base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey)
[11:10:25] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 128, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:12:10] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service"" [dns] - 10https://gerrit.wikimedia.org/r/981733 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:12:51] <brouberol>	 !log Add discovery records for the k8s-ingress-dse LVS service - T352639
[11:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:55] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[11:14:53] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 127, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:15:16] <wikibugs>	 (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12)
[11:15:59] <wikibugs>	 (03CR) 10Reedy: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12)
[11:16:22] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on text|secondary LVS in esams [puppet] - 10https://gerrit.wikimedia.org/r/982040 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:16:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[11:16:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[11:16:57] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982041 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto)
[11:18:29] <wikibugs>	 (03PS4) 10EoghanGaffney: [apt-staging] Add script to pull artifacts from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/979912
[11:18:38] <claime>	 !log sudo confctl --object-type discovery select 'name=eqiad,dnsdisc=k8s-ingress-dse' set/pooled=true - T352639
[11:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:41] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[11:19:23] <wikibugs>	 (03CR) 10Reedy: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12)
[11:20:03] <vgutierrez>	 !log rolling restart of pybal on lvs3010 and lvs3008 effectively enabling IPIP encapsulation on ncredir@esams - T351069
[11:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:08] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[11:22:08] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10MatthewVernon) @fgiunchedi are you in a position to reduce some thanos disk usage/retention? Most swift drives are 93/4% full now: ` mvernon@thanos-fe1001:~$ sudo swift-...
[11:24:06] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10MatthewVernon) Quite significant growth in thanos disk usage over the last 6 months: https://grafana.wikimedia.org/d/NDWQoBiGk/thanos-swift?orgId=1&from=1686482606897&to...
[11:30:31] <wikibugs>	 (03PS2) 10Dreamy Jazz: Enable read new on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829)
[11:30:43] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[11:31:53] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Growth: Enable Welcome survey user research for ar/en/es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981734 (https://phabricator.wikimedia.org/T351266)
[11:32:00] <wikibugs>	 (03PS2) 10Urbanecm: Revert "Growth: Enable Welcome survey user research for ar/en/es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981734 (https://phabricator.wikimedia.org/T351266)
[11:34:13] <wikibugs>	 10SRE, 10Observability-Metrics, 10Goal, 10Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10elukey) @colewhite hi again! I added some context to https://gerrit.wikimedia.org/r/c/mediawiki/services/recommendation-api/+/982047, now I have a better idea about...
[11:34:31] <jinxer-wm>	 (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[11:36:24] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Add lists1003 grants for mailman dbs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) (owner: 10Ladsgroup)
[11:36:29] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Add lists1003 grants for mailman dbs [puppet] - 10https://gerrit.wikimedia.org/r/910598 (https://phabricator.wikimedia.org/T331706) (owner: 10Ladsgroup)
[11:38:07] <wikibugs>	 (03CR) 10Hnowlan: changeprop: refactor templating for Kafka producer/consumer settings (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[11:45:51] <wikibugs>	 (03CR) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[11:48:20] <wikibugs>	 (03CR) 10Klausman: python-webapp: update mesh and base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey)
[11:50:48] <wikibugs>	 (03CR) 10Hashar: contint: rename jenkins-slave to jenkins-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[11:51:56] <wikibugs>	 10SRE, 10serviceops: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10Clement_Goubert)
[11:52:23] <wikibugs>	 (03PS10) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646)
[11:52:58] <wikibugs>	 10SRE, 10serviceops: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10Clement_Goubert) p:05Triage→03Medium
[11:52:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:57:27] <wikibugs>	 (03CR) 10Jelto: [V: 03+2 C: 03+2] wmf-debci: fix templating in Dockerfile RUN command [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982041 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto)
[11:57:29] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] python-webapp: update mesh and base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey)
[12:00:10] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes10[59-62]: add to eqiad.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135)
[12:02:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Growth: Enable Welcome survey user research for ar/en/es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981734 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm)
[12:02:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:03:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Growth: Enable Welcome survey user research for ar/en/es" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981734 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm)
[12:03:57] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:981734|Revert "Growth: Enable Welcome survey user research for ar/en/es" (T351266)]]
[12:04:02] <stashbot>	 T351266: enable the T342353 checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T351266
[12:05:24] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:981734|Revert "Growth: Enable Welcome survey user research for ar/en/es" (T351266)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:05:28] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[12:08:37] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:08:44] <wikibugs>	 (03PS7) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639)
[12:11:25] <brouberol>	 !log Adding spark-history(-test).svc.eqiad.wmnet CNAMEs pointing to k8s-ingress-dse.svc.eqiad.wmnet. - T352639
[12:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:40] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[12:12:17] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:981734|Revert "Growth: Enable Welcome survey user research for ar/en/es" (T351266)]] (duration: 08m 20s)
[12:12:21] <stashbot>	 T351266: enable the T342353 checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T351266
[12:15:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "In prevision of deploying I1c0bfa369b886c648bf1f27afd6ee581daed0625" [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[12:24:34] <wikibugs>	 (03CR) 10Clément Goubert: kubernetes10[59-62]: add to eqiad.k8s (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[12:25:34] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/981439
[12:26:10] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Disable rp_filter on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982063 (https://phabricator.wikimedia.org/T351069)
[12:27:58] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/870/con" [puppet] - 10https://gerrit.wikimedia.org/r/982063 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[12:28:54] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069)
[12:30:11] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/871/con" [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[12:32:12] <wikibugs>	 (03PS1) 10Clément Goubert: wikikube: put kubernetes10[59-62] in production [puppet] - 10https://gerrit.wikimedia.org/r/982071 (https://phabricator.wikimedia.org/T353135)
[12:32:14] <wikibugs>	 (03PS1) 10Clément Goubert: wikikube: add kubernetes10[59-62] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/982072 (https://phabricator.wikimedia.org/T353135)
[12:37:05] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:43:39] <wikibugs>	 (03PS2) 10Clément Goubert: kubernetes10[59-62]: add to devices.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135)
[12:47:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] restbase: add missing keys & certs, remove obsolete [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[12:48:07] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[12:51:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Nice!  Removing those huge dicts is very pleasing to the eye :)" [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[12:54:05] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1005 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[12:54:17] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:31] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[12:55:24] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[12:56:29] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:21] <claime>	 !log Rebuilding production-images for python3-build-bookworm - T352733
[12:57:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:25] <stashbot>	 T352733: Provide python3-build-bookworm docker image - https://phabricator.wikimedia.org/T352733
[13:01:43] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:02:27] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] grafana: add dashboard datasource usage (graphite) exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron)
[13:04:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[13:05:12] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[13:06:58] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: Override Cumin batch sleep+size from command line [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/719470 (owner: 10Filippo Giunchedi)
[13:11:21] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I did a first pass. There are still a lot of references to the server part and is missing a bunch of refactoring needed for the split. See" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede)
[13:11:56] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10fgiunchedi) >>! In T353091#9395917, @MatthewVernon wrote: > @fgiunchedi are you in a position to reduce some thanos disk usage/retention? Most swift drives are 93/4% ful...
[13:12:48] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance
[13:13:02] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance
[13:14:09] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: decommission db1138 [puppet] - 10https://gerrit.wikimedia.org/r/981440 (https://phabricator.wikimedia.org/T350458)
[13:14:58] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM - One nit in-line but all code looks good to me :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[13:17:37] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: decomission pre downtime
[13:17:54] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: decomission pre downtime
[13:18:04] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm now, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney)
[13:18:33] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] python-webapp: update mesh and base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 (owner: 10Elukey)
[13:20:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: decommission db1138 [puppet] - 10https://gerrit.wikimedia.org/r/981440 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb)
[13:20:28] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1138.eqiad.wmnet
[13:20:37] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: decommission db1138 [puppet] - 10https://gerrit.wikimedia.org/r/981440 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb)
[13:22:50] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'decommission db1138', diff saved to https://phabricator.wikimedia.org/P54328 and previous config saved to /var/cache/conftool/dbconfig/20231211-132250-arnaudb.json
[13:25:32] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox
[13:25:38] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1138.eqiad.wmnet - https://phabricator.wikimedia.org/T353148 (10ABran-WMF) 05In progress→03Open
[13:26:57] <logmsgbot>	 !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:26:58] <logmsgbot>	 !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db1138.eqiad.wmnet
[13:27:04] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1138.eqiad.wmnet - https://phabricator.wikimedia.org/T353148 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by arnaudb@cumin1001 for hosts: `db1138.eqiad.wmnet` - db1138.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - F...
[13:28:42] <wikibugs>	 (03PS1) 10Marostegui: report_users: Remove 10.64.48.43 [software] - 10https://gerrit.wikimedia.org/r/982084
[13:29:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] wikikube: put kubernetes10[59-62] in production [puppet] - 10https://gerrit.wikimedia.org/r/982071 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[13:29:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] wikikube: add kubernetes10[59-62] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/982072 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[13:29:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] report_users: Remove 10.64.48.43 [software] - 10https://gerrit.wikimedia.org/r/982084 (owner: 10Marostegui)
[13:30:20] <wikibugs>	 (03Merged) 10jenkins-bot: report_users: Remove 10.64.48.43 [software] - 10https://gerrit.wikimedia.org/r/982084 (owner: 10Marostegui)
[13:39:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Move to standard rsyslog-rotate shared script [puppet] - 10https://gerrit.wikimedia.org/r/982085 (https://phabricator.wikimedia.org/T351710)
[13:41:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The kafkatee logrotate is causing errors on centrallog where we upgraded rsyslog:" [puppet] - 10https://gerrit.wikimedia.org/r/982085 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[13:41:21] <wikibugs>	 (03PS1) 10LSobanski: Switch alerts deployment source to GitLab [puppet] - 10https://gerrit.wikimedia.org/r/982086 (https://phabricator.wikimedia.org/T349626)
[13:42:40] <wikibugs>	 (03CR) 10LSobanski: [C: 04-2] "Prerequisites are not met yet so blocking for now." [puppet] - 10https://gerrit.wikimedia.org/r/982086 (https://phabricator.wikimedia.org/T349626) (owner: 10LSobanski)
[13:45:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[13:45:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[13:48:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:49:10] <wikibugs>	 (03PS7) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649)
[13:49:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:50:58] <wikibugs>	 (03CR) 10Ayounsi: Expose Netbox's BGP servers to Homer (035 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[13:53:22] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Thanks, I think I follow how this all works!" [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi)
[13:54:06] <wikibugs>	 (03PS9) 10Ladsgroup: Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253)
[13:56:53] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox
[13:58:43] <logmsgbot>	 !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:58:50] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:59:15] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox
[14:00:06] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1400)
[14:00:06] <jouncebot>	 Dreamy_Jazz and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:09] <wikibugs>	 (03PS4) 10Anzx: hewikivoyage: update vector 2022 wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981726 (https://phabricator.wikimedia.org/T351981)
[14:00:20] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:28] <TheresNoTime>	 I can deploy
[14:00:33] <Dreamy_Jazz>	 \o
[14:00:35] <anzx>	 o/
[14:01:01] <TheresNoTime>	 wait one
[14:02:07] <TheresNoTime>	 Dreamy_Jazz: starting with yours
[14:02:18] <Dreamy_Jazz>	 I will be unable to test my patch as I don't have CU rights on the wikis having the change enabled (testwiki already has the change enabled). I have informed checkusers about the change on checkuser-l.
[14:02:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[14:02:20] <wikibugs>	 (03PS1) 10Clément Goubert: prometheus-php-fpm-exporter: fix build script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982089
[14:02:22] <wikibugs>	 (03PS1) 10Clément Goubert: Fix some Build-Depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090
[14:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: Enable read new on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[14:03:48] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:979986|Enable read new on group0 wikis (T341829)]]
[14:04:00] <stashbot>	 T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829
[14:04:41] <TheresNoTime>	 Dreamy_Jazz: did the testwiki deploy go okay? 
[14:04:44] <Dreamy_Jazz>	 Yes
[14:05:02] <logmsgbot>	 !log samtar@deploy2002 samtar and dreamyjazz: Backport for [[gerrit:979986|Enable read new on group0 wikis (T341829)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:05:05] <logmsgbot>	 !log samtar@deploy2002 samtar and dreamyjazz: Continuing with sync
[14:05:36] <TheresNoTime>	 Then we'll continue and watch for issues :-)
[14:05:59] <Dreamy_Jazz>	 Thanks. I intend to monitor logstash.
[14:06:22] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:06:23] <wikibugs>	 (03PS5) 10Samtar: hewikivoyage: update vector 2022 wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981726 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx)
[14:10:29] <Kizule>	 Hi, I forgot again that backport window is now. Do we have some time to give https://phabricator.wikimedia.org/T350431 another try? :)
[14:11:35] <TheresNoTime>	 Kizule: quite probably, just got anzx's patch to do next
[14:11:46] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:979986|Enable read new on group0 wikis (T341829)]] (duration: 07m 57s)
[14:11:51] <Kizule>	 At least on srwikisource and such smaller projects firstly. :)
[14:11:51] <TheresNoTime>	 Dreamy_Jazz: deployed
[14:11:51] <stashbot>	 T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829
[14:11:56] <Dreamy_Jazz>	 Thanks!
[14:12:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981726 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx)
[14:12:29] <Kizule>	 721-725 and then Serbian Wikipedia if everything works out fine. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/dblists/s3.dblist#721
[14:12:37] <Kizule>	 I'll add task in Deployments page.
[14:13:32] <Kizule>	 Done
[14:14:03] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on text|secondary LVS in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982096 (https://phabricator.wikimedia.org/T351069)
[14:14:15] <wikibugs>	 (03Merged) 10jenkins-bot: hewikivoyage: update vector 2022 wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981726 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx)
[14:14:28] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:981726|hewikivoyage: update vector 2022 wordmark and tagline (T351981)]]
[14:14:32] <stashbot>	 T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981
[14:15:44] <logmsgbot>	 !log samtar@deploy2002 samtar and anzx: Backport for [[gerrit:981726|hewikivoyage: update vector 2022 wordmark and tagline (T351981)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:15:46] <anzx>	 TheresNoTime: checking
[14:15:54] <TheresNoTime>	 ack
[14:16:50] <anzx>	 TheresNoTime: looks good 
[14:16:59] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1138.eqiad.wmnet - arnaudb@cumin1001"
[14:17:03] <logmsgbot>	 !log samtar@deploy2002 samtar and anzx: Continuing with sync
[14:17:06] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/982096 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[14:18:19] <wikibugs>	 10SRE-swift-storage: Q2 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10MatthewVernon)
[14:18:54] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1138.eqiad.wmnet - arnaudb@cumin1001"
[14:18:54] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:19:35] <wikibugs>	 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10MatthewVernon)
[14:20:05] <Kizule>	 TheresNoTime: mwscript namespaceDupes.php srwikibooks and so on. Firstly without --fix, so I can check output. And after that, if everything looks good, with --fix. :)
[14:20:20] <TheresNoTime>	 Kizule: ack, will do :)
[14:20:37] <Kizule>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/dblists/s3.dblist#721 to 725, and then Serbian Wikipedia (srwiki).
[14:20:39] <Kizule>	 Thanks!
[14:21:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10MatthewVernon)
[14:23:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[14:23:42] <wikibugs>	 (03Abandoned) 10D3r1ck01: ClusterConfig: Followup on I955168f072315e0064c69a66483e61dfc23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981954 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[14:25:03] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:981726|hewikivoyage: update vector 2022 wordmark and tagline (T351981)]] (duration: 10m 35s)
[14:25:05] <TheresNoTime>	 anzx: live :)
[14:25:10] <stashbot>	 T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981
[14:25:25] <TheresNoTime>	 Kizule: starting with `mwscript namespaceDupes.php srwikibooks`
[14:25:43] <Kizule>	 Okay :)
[14:25:44] <TheresNoTime>	 `Unsafe to run at this time. See: T350443`
[14:25:45] <stashbot>	 T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443
[14:26:29] <TheresNoTime>	 (investigating)
[14:26:46] <Kizule>	 Duh, it's not supposed to be there anymore.
[14:27:12] <Kizule>	 It's not in master anymore at least https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/maintenance/namespaceDupes.php
[14:27:12] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: remove db1141 db1142 db1143 [puppet] - 10https://gerrit.wikimedia.org/r/981441 (https://phabricator.wikimedia.org/T350458)
[14:27:39] <TheresNoTime>	 it's not been backported, https://phabricator.wikimedia.org/T350443#9379293
[14:27:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: remove db1141 db1142 db1143 [puppet] - 10https://gerrit.wikimedia.org/r/981441 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb)
[14:29:11] <TheresNoTime>	 Kizule: easiest way of dealing with this would be to wait until later in the week
[14:30:52] <Kizule>	 TheresNoTime: Ok, let's give it a try in UTC late backport window on Thursday, after MW train.
[14:30:58] <anzx>	 TheresNoTime: it still displays old logo maybe ```run echo 'https://en.wikipedia.org/static/images/mobile/copyright/wikivoyage-wordmark-he.svg' | mwscript purgeList.php```
[14:31:05] <TheresNoTime>	 anzx: ack
[14:31:45] <TheresNoTime>	 anzx: done
[14:32:38] <anzx>	 TheresNoTime: I think it should be done for tagline also
[14:33:08] <anzx>	 Never mind now it appears correct, thanks 
[14:33:25] <Kizule>	 Alright, I did a reschedule. :)
[14:33:32] <TheresNoTime>	 Kizule: okay, sorry! :)
[14:33:44] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:33:46] <Kizule>	 No problem :)
[14:36:33] <wikibugs>	 (03PS1) 10Milimetric: aqs: update mw history snapshot probably last time [puppet] - 10https://gerrit.wikimedia.org/r/982097
[14:36:50] <wikibugs>	 (03PS1) 10Ottomata: changeprop - bump image version to discard canary events [deployment-charts] - 10https://gerrit.wikimedia.org/r/982098 (https://phabricator.wikimedia.org/T351247)
[14:37:03] <TheresNoTime>	 !log close UTC afternoon backport window
[14:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:36] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Move to standard rsyslog-rotate shared script [puppet] - 10https://gerrit.wikimedia.org/r/982085 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[14:37:41] <wikibugs>	 (03PS1) 10Milimetric: edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982099
[14:39:10] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] aqs: update mw history snapshot probably last time [puppet] - 10https://gerrit.wikimedia.org/r/982097 (owner: 10Milimetric)
[14:39:18] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] aqs: update mw history snapshot probably last time [puppet] - 10https://gerrit.wikimedia.org/r/982097 (owner: 10Milimetric)
[14:40:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/982099 (owner: 10Milimetric)
[14:42:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] changeprop - bump image version to discard canary events [deployment-charts] - 10https://gerrit.wikimedia.org/r/982098 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata)
[14:43:07] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop - bump image version to discard canary events [deployment-charts] - 10https://gerrit.wikimedia.org/r/982098 (https://phabricator.wikimedia.org/T351247) (owner: 10Ottomata)
[14:44:08] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982099 (owner: 10Milimetric)
[14:45:07] <wikibugs>	 (03Merged) 10jenkins-bot: edit*-analytics: update mediawiki_history snapshot version [deployment-charts] - 10https://gerrit.wikimedia.org/r/982099 (owner: 10Milimetric)
[14:45:11] <ottomata>	 !log deploying changeprop to pick up https://phabricator.wikimedia.org/T351247
[14:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:57] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply
[14:46:39] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[14:47:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye
[14:47:44] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host sessionstore2006.codfw.wmnet with OS bullseye
[14:48:19] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] "will be handy!" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[14:48:41] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: remove db1141 db1142 db1143 [puppet] - 10https://gerrit.wikimedia.org/r/981441 (https://phabricator.wikimedia.org/T350458) (owner: 10Arnaudb)
[14:48:48] <logmsgbot>	 !log milimetric@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
[14:48:55] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[14:49:06] <logmsgbot>	 !log milimetric@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
[14:49:35] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[14:50:36] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[14:50:41] <logmsgbot>	 !log milimetric@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[14:51:06] <wikibugs>	 (03PS1) 10JMeybohm: Bump cert-manager to 1.10.1-2 (bullseye) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933)
[14:51:10] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[14:51:17] <logmsgbot>	 !log milimetric@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[14:51:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1141.eqiad.wmnet
[14:52:07] <logmsgbot>	 !log milimetric@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply
[14:52:09] <wikibugs>	 (03PS1) 10JMeybohm: Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735
[14:52:21] <logmsgbot>	 !log milimetric@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply
[14:52:39] <wikibugs>	 (03PS2) 10JMeybohm: Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933)
[14:53:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'decommission db1141 42 and 43', diff saved to https://phabricator.wikimedia.org/P54330 and previous config saved to /var/cache/conftool/dbconfig/20231211-145300-arnaudb.json
[14:53:04] <logmsgbot>	 !log milimetric@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[14:53:34] <logmsgbot>	 !log milimetric@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[14:54:10] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:20] <wikibugs>	 (03CR) 10Ayounsi: "thx good catch on the proper servers locations." [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[14:56:41] <logmsgbot>	 !log milimetric@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[14:56:53] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox
[14:57:14] <logmsgbot>	 !log milimetric@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[14:57:23] <logmsgbot>	 !log milimetric@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[14:57:41] <logmsgbot>	 !log milimetric@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[15:01:16] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1141.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[15:03:24] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1141.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[15:03:24] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:03:26] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1141.eqiad.wmnet
[15:04:32] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM when CI is happy" [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (owner: 10Ayounsi)
[15:04:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2006.codfw.wmnet with reason: host reimage
[15:05:09] <wikibugs>	 (03PS3) 10Ottomata: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798)
[15:05:30] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:06:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: add sink notifications capability [puppet] - 10https://gerrit.wikimedia.org/r/982103 (https://phabricator.wikimedia.org/T353060)
[15:06:37] <wikibugs>	 (03PS4) 10Ottomata: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798)
[15:08:08] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2006.codfw.wmnet with reason: host reimage
[15:08:33] <wikibugs>	 (03PS3) 10Filippo Giunchedi: swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968)
[15:08:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: swift: write to local files and ban before centrallog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi)
[15:08:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi)
[15:09:44] <wikibugs>	 (03CR) 10MVernon: [C: 04-1] "One thing that looks a bit strange to me here, but perhaps I misunderstand..." [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[15:09:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] P:mail: use wmcloud.org instead of wmflabs.org in envelopes [puppet] - 10https://gerrit.wikimedia.org/r/981635 (owner: 10Majavah)
[15:10:08] <wikibugs>	 (03PS1) 10Dreamy Jazz: CheckUser: Enable read new for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982105 (https://phabricator.wikimedia.org/T341829)
[15:10:48] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1141.eqiad.wmnet - https://phabricator.wikimedia.org/T353152 (10ABran-WMF) a:05ABran-WMF→03None
[15:10:59] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1138.eqiad.wmnet - https://phabricator.wikimedia.org/T353148 (10ABran-WMF) a:05ABran-WMF→03None
[15:11:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm)
[15:11:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add new istio module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981332 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:12:22] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1142.eqiad.wmnet
[15:13:09] <wikibugs>	 (03PS1) 10Brouberol: Fix: make sure to generate a TLS certificate for the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639)
[15:13:49] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982096 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:14:14] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:14:57] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/982063 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:17:39] <wikibugs>	 (03PS2) 10Brouberol: admin_ng: fix gateway TLS setting for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639)
[15:17:42] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:17:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "The new annotation looks good, but not 100% clear why we need it, buuut it is informative so I trust your judgement!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm)
[15:17:50] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:17:57] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable rp_filter on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982063 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:18:49] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox
[15:19:28] <wikibugs>	 (03CR) 10Elukey: "Thanks! Do we need a version bump since those images were built? Or it is just to allow rebuilds? I don't have a strong opinion, just rais" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 (owner: 10Clément Goubert)
[15:20:32] <wikibugs>	 (03PS3) 10Brouberol: admin_ng: fix gateway TLS setting for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639)
[15:20:54] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1142.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[15:21:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] admin_ng: fix gateway TLS setting for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[15:21:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:21:48] <wikibugs>	 (03CR) 10JMeybohm: Bump cert-manager to 1.10.1-2 (bullseye) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm)
[15:21:59] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1142.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[15:21:59] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:21:59] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1142.eqiad.wmnet
[15:22:58] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1142.eqiad.wmnet - https://phabricator.wikimedia.org/T353154 (10ABran-WMF)
[15:23:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:23:10] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] admin_ng: fix gateway TLS setting for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/982106 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[15:23:36] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982110 (https://phabricator.wikimedia.org/T128546)
[15:23:57] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:24:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:24:35] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:24:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Bump cert-manager to 1.10.1-2 (bullseye) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm)
[15:25:08] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:25:12] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1143.eqiad.wmnet
[15:25:55] <brouberol>	 !log provisioning TLS certificates for the spark-history and spark-history-test namespaces in dse-k8s-eqiad - T352639
[15:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:59] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[15:26:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cephosd2002.mgmt.codfw.wmnet with reboot policy FORCED
[15:27:15] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: backport 598bfa3aabe9cf2c1d09f58d4a0745462e80b1bc to 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/982111 (https://phabricator.wikimedia.org/T326818)
[15:27:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: fix double-logging of proxy-server access logs [puppet] - 10https://gerrit.wikimedia.org/r/982112 (https://phabricator.wikimedia.org/T352968)
[15:27:55] <wikibugs>	 (03Merged) 10jenkins-bot: Bump cert-manager to 1.10.1-2 (bullseye) [deployment-charts] - 10https://gerrit.wikimedia.org/r/982100 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm)
[15:28:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: backport 598bfa3aabe9cf2c1d09f58d4a0745462e80b1bc to 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/982111 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott)
[15:28:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Thanks. I don't think we technically need a version bump" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/982090 (owner: 10Clément Goubert)
[15:30:31] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Diffscan: host off-infra - https://phabricator.wikimedia.org/T265595 (10joanna_borun) p:05Triage→03Low
[15:30:31] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox
[15:30:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd2002.mgmt.codfw.wmnet with reboot policy FORCED
[15:30:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: fix double-logging of proxy-server access logs [puppet] - 10https://gerrit.wikimedia.org/r/982112 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi)
[15:31:14] <wikibugs>	 (03CR) 10Elukey: "Looks good! I left a comment since I got lost in one bit of the change, be patient :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:32:09] <claime>	 jouncebot: nowandnext
[15:32:09] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[15:32:09] <jouncebot>	 In 0 hour(s) and 57 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1630)
[15:32:32] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1143.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[15:32:53] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wikikube: put kubernetes10[59-62] in production [puppet] - 10https://gerrit.wikimedia.org/r/982071 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[15:33:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye
[15:33:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye
[15:33:33] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1143.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[15:33:33] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:33:34] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1143.eqiad.wmnet
[15:34:31] <jinxer-wm>	 (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[15:34:48] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1143.eqiad.wmnet - https://phabricator.wikimedia.org/T353156 (10ABran-WMF) a:05ABran-WMF→03None
[15:36:55] <wikibugs>	 (03CR) 10JMeybohm: ingress.istio: Remove trust for every SAN but the default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:38:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ingress.istio: Remove trust for every SAN but the default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:38:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] function-orchestrator: Update to ingress.istio:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981336 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:39:48] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:39:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2006.codfw.wmnet with OS bullseye
[15:39:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host sessionstore2006.codfw.wmnet with OS bullseye completed: - sessionstore...
[15:40:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] kubernetes10[59-62]: add to devices.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[15:40:29] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] function-orchestrator: Update to ingress.istio:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981336 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:40:33] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:40:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add new istio module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981332 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:41:04] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm)
[15:41:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye
[15:41:07] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] kubernetes10[59-62]: add to devices.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[15:41:08] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Introduce Spicerack.kafka module, along with the method to transfer offset state between consumer groups and clusters - https://phabricator.wikimedia.org/T291681 (10joanna_borun) 05Open→03Resolved p:05Triage→03Medium
[15:41:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host sessionstore2005.codfw.wmnet with OS bullseye
[15:41:31] <wikibugs>	 (03Merged) 10jenkins-bot: Add new istio module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981332 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:41:50] <wikibugs>	 (03Merged) 10jenkins-bot: ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:42:04] <wikibugs>	 (03PS1) 10Brouberol: Register ingress CNAME record for the echoserver-dse-k8s-eqiad service [dns] - 10https://gerrit.wikimedia.org/r/982116
[15:42:08] <wikibugs>	 (03Merged) 10jenkins-bot: function-orchestrator: Update to ingress.istio:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981336 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:42:19] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1026 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:47] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes10[59-62]: add to devices.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/982051 (https://phabricator.wikimedia.org/T353135) (owner: 10Clément Goubert)
[15:42:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:43:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: admin: Add validation checks for missing realname and email in data.yaml - https://phabricator.wikimedia.org/T320937 (10joanna_borun) a:03jhathaway
[15:44:43] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:48] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:45:54] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982070 (https://phabricator.wikimedia.org/T351069)
[15:47:45] <wikibugs>	 (03PS15) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[15:48:43] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:49:13] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:50:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm)
[15:51:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd2002.codfw.wmnet with reason: host reimage
[15:52:00] <wikibugs>	 (03CR) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[15:52:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "cert-manager: bump version in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981735 (https://phabricator.wikimedia.org/T351933) (owner: 10JMeybohm)
[15:53:11] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:53:28] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:53:47] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:53:55] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi)
[15:54:17] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:54:52] <wikibugs>	 10SRE-swift-storage, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2), 10User-fgiunchedi: Stop sending swift access logs to centrallog for non state-changing requests - https://phabricator.wikimedia.org/T352968 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done! W...
[15:54:54] <wikibugs>	 (03PS1) 10EoghanGaffney: [apt-staging] Deploy gitlab-package-puller script [puppet] - 10https://gerrit.wikimedia.org/r/982119 (https://phabricator.wikimedia.org/T347004)
[15:55:01] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:55:09] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:55:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd2002.codfw.wmnet with reason: host reimage
[15:55:42] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:55:50] <claime>	 !log homer lsw1-*eqiad* commit "Put kubernetes10[59-62] in production - T353135"
[15:55:51] <logmsgbot>	 !log jayme@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:54] <stashbot>	 T353135: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135
[15:56:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage
[15:57:07] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:57:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Move to standard rsyslog-rotate shared script [puppet] - 10https://gerrit.wikimedia.org/r/982085 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[15:57:47] <logmsgbot>	 !log jayme@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:58:35] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:58:46] <icinga-wm>	 PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:59:28] <claime>	 That's me ^ I think we need to fix the doc so we do the homer commit after the reimage
[15:59:41] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:00:22] <icinga-wm>	 PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:00:57] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:01:38] <icinga-wm>	 PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:01:52] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[16:01:59] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1059.eqiad.wmnet with OS bullseye
[16:02:09] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye
[16:02:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage
[16:02:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1060.eqiad.wmnet with OS bullseye
[16:02:47] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye
[16:03:10] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1061.eqiad.wmnet with OS bullseye
[16:03:22] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye
[16:03:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1062.eqiad.wmnet with OS bullseye
[16:03:52] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye
[16:04:30] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata)
[16:05:03] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on text|secondary LVS in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/982096 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:05:16] <claime>	 XioNoX: About the BGP alerts because I messed up reimage/homer order, should I rollback homer changes or are we ok with them alerting until the cookbook runs its course?
[16:05:17] <wikibugs>	 (03Merged) 10jenkins-bot: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata)
[16:05:35] <XioNoX>	 claime: it's fine, no worries
[16:05:44] <claime>	 Awesome. I'll change the doc, thanks
[16:05:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Postfix MTA Profile - https://phabricator.wikimedia.org/T325398 (10jhathaway) p:05Triage→03Low
[16:05:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-infra - https://phabricator.wikimedia.org/T325401 (10jhathaway) p:05Triage→03Low
[16:06:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-infra - https://phabricator.wikimedia.org/T325402 (10jhathaway) p:05Triage→03Low
[16:06:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-wiki - https://phabricator.wikimedia.org/T325407 (10jhathaway) p:05Triage→03Low
[16:06:20] <wikibugs>	 (03CR) 10Eevans: restbase: add missing keys & certs, remove obsolete (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[16:06:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-wiki - https://phabricator.wikimedia.org/T325406 (10jhathaway) p:05Triage→03Low
[16:06:27] <wikibugs>	 (03PS4) 10Eevans: restbase: add missing keys & certs, remove obsolete [labs/private] - 10https://gerrit.wikimedia.org/r/981601 (https://phabricator.wikimedia.org/T352468)
[16:06:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Replace Null client configs - https://phabricator.wikimedia.org/T325408 (10jhathaway) p:05Triage→03Low
[16:06:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Remove Exim based MTAs - https://phabricator.wikimedia.org/T325409 (10jhathaway) p:05Triage→03Low
[16:06:52] <wikibugs>	 (03PS1) 10Ottomata: Revert accidental portals submodule change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982122 (https://phabricator.wikimedia.org/T266798)
[16:07:51] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Revert accidental portals submodule change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982122 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata)
[16:09:13] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [dns] - 10https://gerrit.wikimedia.org/r/982116 (owner: 10Brouberol)
[16:09:16] <wikibugs>	 (03Merged) 10jenkins-bot: Revert accidental portals submodule change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982122 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata)
[16:10:02] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Register ingress CNAME record for the echoserver-dse-k8s-eqiad service [dns] - 10https://gerrit.wikimedia.org/r/982116 (owner: 10Brouberol)
[16:10:34] <ottomata>	 !log enabling  canary events for all mediawiki state change event streams - T266798
[16:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:42] <stashbot>	 T266798: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798
[16:13:10] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:13:21] <vgutierrez>	 !log rolling restart of pybal on lvs1020 and lvs1017 effectively enabling IPIP encapsulation on ncredir@eqiad - T351069
[16:13:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:30] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[16:15:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:16:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) 05Open→03Resolved
[16:16:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez)
[16:16:31] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10Volans) @MatthewVernon is there still anything pending from I/F on this task or can be resolved in light of the follow up work done i...
[16:16:54] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1059.eqiad.wmnet with reason: host reimage
[16:17:26] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1060.eqiad.wmnet with reason: host reimage
[16:18:05] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1061.eqiad.wmnet with reason: host reimage
[16:18:51] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1062.eqiad.wmnet with reason: host reimage
[16:18:54] <icinga-wm>	 PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:19:06] <logmsgbot>	 !log otto@deploy2002 Synchronized wmf-config/ext-EventStreamConfig.php: Config: [[gerrit:968344|Enable canary events for all MediaWiki event streams (T266798)]] (duration: 08m 25s)
[16:19:10] <stashbot>	 T266798: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798
[16:19:11] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Unify ncredir IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/982124 (https://phabricator.wikimedia.org/T351069)
[16:19:53] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1059.eqiad.wmnet with reason: host reimage
[16:20:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:20:39] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Volans) a:03cmooney Assigning to Cathal as per meeting discussion.
[16:21:08] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:21:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:21:10] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2005.codfw.wmnet with OS bullseye
[16:21:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host sessionstore2005.codfw.wmnet with OS bullseye completed: - sessionstore...
[16:21:46] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/873/console" [puppet] - 10https://gerrit.wikimedia.org/r/982124 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:22:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye
[16:22:57] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1062.eqiad.wmnet with reason: host reimage
[16:23:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host sessionstore2004.codfw.wmnet with OS bullseye
[16:23:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:25:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1061.eqiad.wmnet with reason: host reimage
[16:26:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:26:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd2002.codfw.wmnet with OS bullseye
[16:26:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye completed: - cephosd2002 (...
[16:27:26] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1060.eqiad.wmnet with reason: host reimage
[16:28:55] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] hiera: Unify ncredir IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/982124 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:29:14] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Unify ncredir IPIP encapsulation settings [puppet] - 10https://gerrit.wikimedia.org/r/982124 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:29:34] <icinga-wm>	 RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:30:05] <jouncebot>	 jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1630).
[16:32:34] <icinga-wm>	 RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:33:36] <icinga-wm>	 PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:35:29] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982110 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:35:31] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on kubernetes1060 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.131.26. Check system logs on 10.64.131.26 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T353165 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[16:35:36] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] netops: prometheus::hosts: also probe ipv6 if available [puppet] - 10https://gerrit.wikimedia.org/r/981358 (https://phabricator.wikimedia.org/T163996) (owner: 10Majavah)
[16:35:36] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on kubernetes1060 - https://phabricator.wikimedia.org/T353165 (10ops-monitoring-bot)
[16:35:42] <icinga-wm>	 RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:36:27] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982110 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:37:05] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:37:30] <icinga-wm>	 RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:37:54] <icinga-wm>	 PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:38:52] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add db1247 to instances [puppet] - 10https://gerrit.wikimedia.org/r/981443 (https://phabricator.wikimedia.org/T344036)
[16:39:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage
[16:40:14] <icinga-wm>	 RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:40:58] <icinga-wm>	 PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:41:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:mail: use wmcloud.org instead of wmflabs.org in envelopes [puppet] - 10https://gerrit.wikimedia.org/r/981635 (owner: 10Majavah)
[16:41:50] <icinga-wm>	 RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:42:39] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage
[16:43:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:43:41] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1059.eqiad.wmnet with OS bullseye
[16:43:51] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye completed: - kubernetes1059 (**WARN**)   - Down...
[16:47:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:47:31] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1062.eqiad.wmnet with OS bullseye
[16:47:41] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye completed: - kubernetes1062 (**WARN**)   - Down...
[16:49:34] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1061.eqiad.wmnet with OS bullseye
[16:49:43] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye completed: - kubernetes1061 (**WARN**)   - Down...
[16:50:04] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1060.eqiad.wmnet with OS bullseye
[16:50:14] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes10[59-62] - https://phabricator.wikimedia.org/T353135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye completed: - kubernetes1060 (**WARN**)   - Down...
[16:52:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:52:20] <wikibugs>	 (03PS1) 10Jhancock.wm: Add testhost2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982132 (https://phabricator.wikimedia.org/T352703)
[16:52:22] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/982132 (https://phabricator.wikimedia.org/T352703) (owner: 10Jhancock.wm)
[16:53:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:55:35] <wikibugs>	 (03CR) 10Jhancock.wm: [C: 03+2] Add testhost2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/982132 (https://phabricator.wikimedia.org/T352703) (owner: 10Jhancock.wm)
[16:56:25] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:982110| Bumping portals to master (T128546)]] (duration: 10m 12s)
[16:56:29] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[16:56:40] <icinga-wm>	 RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:57:23] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.02% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:57:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:58:14] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Volans) Interesting... I guess we could try to do the same test with redfish API instead and see if that works all the time and consider convertin...
[17:00:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[17:00:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2004.codfw.wmnet with OS bullseye
[17:00:52] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host sessionstore2004.codfw.wmnet with OS bullseye completed: - sessionstore...
[17:03:47] <wikibugs>	 (03PS2) 10Ayounsi: Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823)
[17:04:40] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:982110| Bumping portals to master (T128546)]] (duration: 08m 15s)
[17:04:45] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[17:05:29] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @Andrew   Dell is requesting smartctl output showing what drives errors are coming from if you can se...
[17:05:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add retry logic to Netbox API [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi)
[17:13:10] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:16:34] <wikibugs>	 (03PS1) 10Jhancock.wm: Add testhost2001 to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/982137 (https://phabricator.wikimedia.org/T352703)
[17:18:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:19:28] <wikibugs>	 (03CR) 10Jhancock.wm: [C: 03+2] Add testhost2001 to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/982137 (https://phabricator.wikimedia.org/T352703) (owner: 10Jhancock.wm)
[17:20:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm)
[17:21:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) 05Open→03Resolved @BTullis this is completed!
[17:29:37] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] changeprop: refactor templating for Kafka producer/consumer settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[17:33:10] <wikibugs>	 (03CR) 10Ottomata: "see recent ideas about kafka broker round robin DNS in https://phabricator.wikimedia.org/T213561#9391755" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[17:40:57] <wikibugs>	 (03PS1) 10Bking: wdqs: Try icinga-based check instead of blackbox [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355)
[17:42:41] <logmsgbot>	 !log jayme@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[17:43:38] <logmsgbot>	 !log jayme@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[17:43:39] <logmsgbot>	 !log jayme@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[17:43:51] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[17:45:41] <logmsgbot>	 !log jayme@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[17:45:42] <logmsgbot>	 !log jayme@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[17:47:05] <logmsgbot>	 !log jayme@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:47:06] <logmsgbot>	 !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[17:48:19] <logmsgbot>	 !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:49:35] <wikibugs>	 (03PS2) 10RLazarus: admin_ng: Add namespace and ClusterRole for Job sidecar controller [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284)
[17:49:37] <wikibugs>	 (03PS2) 10RLazarus: admin_ng: Switch on enableJobSidecarController for mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/981704 (https://phabricator.wikimedia.org/T348284)
[17:52:11] <wikibugs>	 (03CR) 10RLazarus: admin_ng: Add namespace and ClusterRole for Job sidecar controller (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981703 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1800)
[18:00:04] <jouncebot>	 ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T1800).
[18:15:39] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12)
[18:29:37] <wikibugs>	 (03PS13) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722)
[18:29:45] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web: raise replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/982145
[18:30:10] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web: raise replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/982145 (owner: 10Clément Goubert)
[18:31:14] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: raise replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/982145 (owner: 10Clément Goubert)
[18:31:58] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:32:02] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:32:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:32:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:32:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:32:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:34:57] <claime>	 !log Raised replicas for mw-web
[18:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:05] <wikibugs>	 (03CR) 10Volans: Add retry logic to Netbox API (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/982042 (https://phabricator.wikimedia.org/T329823) (owner: 10Ayounsi)
[18:42:51] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: eqiad1: permit memcached access via cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/977600 (owner: 10Majavah)
[18:51:55] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2031 [puppet] - 10https://gerrit.wikimedia.org/r/981605 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[18:52:22] <wikibugs>	 (03CR) 10Herron: [V: 03+1] grafana: add dashboard datasource usage (graphite) exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron)
[18:57:38] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:59:10] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:05:31] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:18:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: add db1247 to instances [puppet] - 10https://gerrit.wikimedia.org/r/981443 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[19:20:07] <wikibugs>	 (03PS1) 10Jforrester: api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981737 (https://phabricator.wikimedia.org/T351237)
[19:24:13] <wikibugs>	 (03PS2) 10Pols12: Make wiktionary and mw.org provide og:site_name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203)
[19:26:02] <wikibugs>	 (03CR) 10Pols12: Make wiktionary and mw.org provide og:site_name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12)
[19:29:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10Sandeeps)
[19:34:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Releng for sandeeps - https://phabricator.wikimedia.org/T353186 (10thcipriani) Approved from me!
[19:34:31] <jinxer-wm>	 (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[19:36:52] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.227:9042 on restbase2031 is CRITICAL: connect to address 10.192.32.227 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[19:39:20] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.227:7000 on restbase2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:41:48] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2031 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:42:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "per team chat today" [puppet] - 10https://gerrit.wikimedia.org/r/981591 (https://phabricator.wikimedia.org/T347355) (owner: 10Dzahn)
[19:44:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, just needs to be watched because experience is there are a lot of ways it can fail unexpectedly in the first attempt" [puppet] - 10https://gerrit.wikimedia.org/r/981951 (https://phabricator.wikimedia.org/T343517) (owner: 10Jelto)
[19:44:14] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.228:9042 on restbase2031 is CRITICAL: connect to address 10.192.32.228 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[19:46:07] <wikibugs>	 (03CR) 10Dzahn: "re: puppet compiler, I think you'd have to run this on the wdqs backend rather than alert1001, because exported resources are used" [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:46:40] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.32.228:7000 on restbase2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[19:47:34] <wikibugs>	 (03PS2) 10Bking: wdqs: Try icinga-based check instead of blackbox [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355)
[19:49:10] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2031 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:50:10] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:59:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "check_https_url_for_string!query.wikidata.org!/bigdata/ldf?subject=wd%3AQ42&predicate=wdt%3AP31&object=!wd:Q42  wdt:P31  wd:Q5 ."," [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[20:08:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group for ArthurTaylor - https://phabricator.wikimedia.org/T352653 (10KFrancis) Hello all, I am confirming the NDA is now complete.  Please proceed with the access request.  Thank you!
[20:16:03] <wikibugs>	 (03PS1) 10Dzahn: Switch planet to bookworm VM backends [dns] - 10https://gerrit.wikimedia.org/r/982156 (https://phabricator.wikimedia.org/T348392)
[20:17:08] <wikibugs>	 (03PS1) 10Dzahn: site: remove buster VMs from planet regex [puppet] - 10https://gerrit.wikimedia.org/r/982157 (https://phabricator.wikimedia.org/T348392)
[20:37:06] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:39:46] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:41:14] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:51:41] <wikibugs>	 (03PS1) 10Jdrewniak: [Zebra] Fix scrolling behavior in dropdowns [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930)
[20:52:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[20:54:04] <wikibugs>	 (03PS1) 10Jdrewniak: [Vector] Deploy the Zebra CSS refactor under feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982162 (https://phabricator.wikimedia.org/T353008)
[20:58:14] <wikibugs>	 (03PS1) 10Ottomata: varnishkafka::instance - Add ensure param [puppet] - 10https://gerrit.wikimedia.org/r/982163 (https://phabricator.wikimedia.org/T238230)
[21:00:06] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T2100). Please do the needful.
[21:00:07] <jouncebot>	 jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:40] * jan_drewniak Looks like I'm the only one with a backport today, so I can do my own deploy.
[21:02:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak)
[21:07:59] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:21:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [Zebra] Fix scrolling behavior in dropdowns [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak)
[21:26:21] <wikibugs>	 (03CR) 10Jdrewniak: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[21:27:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak)
[21:28:46] <wikibugs>	 (03CR) 10Jdrewniak: "recheck" [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak)
[21:35:34] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: Try icinga-based check instead of blackbox [puppet] - 10https://gerrit.wikimedia.org/r/982138 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[21:44:57] <wikibugs>	 (03Merged) 10jenkins-bot: [Zebra] Fix scrolling behavior in dropdowns [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981740 (https://phabricator.wikimedia.org/T352930) (owner: 10Jdrewniak)
[21:53:58] <Amir1>	 jouncebot: nowandnext
[21:53:59] <jouncebot>	 For the next 0 hour(s) and 6 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T2100)
[21:53:59] <jouncebot>	 In 0 hour(s) and 6 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T2200)
[21:54:18] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981737 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester)
[21:56:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981737 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester)
[21:59:22] <jan_drewniak>	 Amir1: hey, the backport was delayed due to a failed test, I rechecked and the test passed (like 45min after the original +2), so I still have this to sync this after you're done https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/981740 
[21:59:38] <Amir1>	 oh sure thing
[21:59:43] <Amir1>	 I thought it was over
[21:59:44] <Amir1>	 my bad
[22:00:02] <jan_drewniak>	 np
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231211T2200). nyaa~
[22:01:56] <icinga-wm>	 PROBLEM - WDQS Linked Data Fragments Endpoint on wdqs1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string wd:Q42 wdt:P31 wd:Q5 . not found on https://query.wikidata.org:443/bigdata/ldf?subject=wd%3AQ42&predicate=wdt%3AP31&object= - 8890 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:03:22] <wikibugs>	 (03CR) 10Ottomata: "No op https://puppet-compiler.wmflabs.org/output/982163/874/cp1102.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/982163 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[22:09:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 18:00:00 on wdqs1015.eqiad.wmnet with reason: T347355
[22:09:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on wdqs1015.eqiad.wmnet with reason: T347355
[22:09:59] <stashbot>	 T347355: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355
[22:11:30] <wikibugs>	 (03Merged) 10jenkins-bot: api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/981737 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester)
[22:12:09] <Amir1>	 jan_drewniak: I think it'll deploy both at the same time
[22:12:21] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:981737|api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery (T351237)]]
[22:12:35] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[22:13:59] <logmsgbot>	 !log ladsgroup@deploy2002 jforrester and ladsgroup: Backport for [[gerrit:981737|api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:14:13] <Amir1>	 jan_drewniak: it's live in mwdebug
[22:15:40] <logmsgbot>	 !log ladsgroup@deploy2002 jforrester and ladsgroup: Continuing with sync
[22:17:03] <wikibugs>	 (03PS1) 10Bking: wdqs: Change LDF monitoring URI [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355)
[22:19:36] <jan_drewniak>	 Amir1: thanks, I'll check it now
[22:20:44] <jan_drewniak>	 Amir1: my patch looks good to deploy, are you going to do the sync? 
[22:20:53] <Amir1>	 yup
[22:21:12] <jan_drewniak>	 ok thanks, I have one more config patch to deploy after this if that's ok 
[22:21:38] <jan_drewniak>	 This one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/982162/
[22:23:03] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:981737|api: Add support for pagelinks migration in ApiQueryBacklinks::runSecondQuery (T351237)]] (duration: 10m 42s)
[22:23:08] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[22:23:14] <Amir1>	 I'm done
[22:25:02] <jan_drewniak>	 Amir1: thanks! 
[22:25:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982162 (https://phabricator.wikimedia.org/T353008) (owner: 10Jdrewniak)
[22:26:41] <wikibugs>	 (03Merged) 10jenkins-bot: [Vector] Deploy the Zebra CSS refactor under feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/982162 (https://phabricator.wikimedia.org/T353008) (owner: 10Jdrewniak)
[22:26:55] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:982162|[Vector] Deploy the Zebra CSS refactor under feature flag (T353008)]]
[22:27:00] <stashbot>	 T353008: Deploy Zebra everywhere - https://phabricator.wikimedia.org/T353008
[22:28:26] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:982162|[Vector] Deploy the Zebra CSS refactor under feature flag (T353008)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:32:00] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak: Continuing with sync
[22:39:10] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:982162|[Vector] Deploy the Zebra CSS refactor under feature flag (T353008)]] (duration: 12m 14s)
[22:39:14] <stashbot>	 T353008: Deploy Zebra everywhere - https://phabricator.wikimedia.org/T353008
[22:42:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) kubernetes2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[22:43:45] <wikibugs>	 (03PS2) 10Hashar: Add a banner for the 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109)
[22:44:42] <icinga-wm>	 PROBLEM - SSH on kubemaster2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:46:34] <icinga-wm>	 PROBLEM - SSH on kubemaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:47:32] <wikibugs>	 (03CR) 10Hashar: "PS2 adds a `[DISMISS]` button next to the link.  On click that:" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar)
[22:47:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (60) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[22:50:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2031:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2031 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:50:40] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: k8s@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[22:52:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (67) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[22:53:22] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:53:40] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:54:18] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:54:22] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:55:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (17) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:55:31] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (17) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:58:08] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:58:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:58:54] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:59:24] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:00:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (24) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:01:27] <wikibugs>	 (03CR) 10Dwisehaupt: "I just realized I draft commented this on Thursday but never sent it. I also learned about the puppet request window so I'm happy to add i" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[23:04:48] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:05:18] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (34) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:05:24] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:05:31] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:05:40] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:05:42] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:06:16] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:06:34] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:06:54] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:07:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:10:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (40) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:14:33] <wikibugs>	 (03CR) 10Dzahn: "ARG3 is a string to be found in the content." [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[23:15:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (40) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:15:42] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: OpenSent - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:16:03] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "changing ARG3 won't change that it's a 500. You can skip it entirely and still:" [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[23:19:18] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "the $ARG2 (-u) is what makes it turn from 200 into 500:" [puppet] - 10https://gerrit.wikimedia.org/r/982172 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[23:19:36] <icinga-wm>	 RECOVERY - SSH on kubemaster2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:20:12] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:20:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (42) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:20:20] <icinga-wm>	 PROBLEM - Check systemd state on kubemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:20:22] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:20:42] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:24:12] <icinga-wm>	 PROBLEM - SSH on kubemaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:25:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (42) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:26:08] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:26:26] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:26:44] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:27:40] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:30:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (37) rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:31:00] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:31:18] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:32:48] <icinga-wm>	 RECOVERY - SSH on kubemaster2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:33:34] <icinga-wm>	 PROBLEM - Check systemd state on kubemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:34:32] <jinxer-wm>	 (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[23:35:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (35) rsyslog on kubernetes2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:37:26] <icinga-wm>	 PROBLEM - SSH on kubemaster2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:40:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (38) rsyslog on kubernetes2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:42:08] <icinga-wm>	 RECOVERY - SSH on kubemaster2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:42:58] <icinga-wm>	 PROBLEM - Check systemd state on kubemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:45:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (43) rsyslog on kubernetes2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:45:54] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv
[23:45:54] <icinga-wm>	 e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IP
[23:45:54] <icinga-wm>	 ve - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:47:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (67) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[23:47:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: (67) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[23:48:26] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:48:40] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:50:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (43) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:50:31] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (43) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:50:50] <icinga-wm>	 RECOVERY - SSH on kubemaster2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:51:32] <icinga-wm>	 PROBLEM - Check systemd state on kubemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:52:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (43) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[23:52:46] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51008 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:53:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:53:28] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv
[23:53:28] <icinga-wm>	 e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IP
[23:53:28] <icinga-wm>	 ve - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitor
[23:53:28] <icinga-wm>	 P_status
[23:55:18] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (39) rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:55:40] <jinxer-wm>	 (KubernetesAPINotScrapable) resolved: k8s@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[23:55:42] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv
[23:55:42] <icinga-wm>	 e - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IP
[23:55:42] <icinga-wm>	 ve - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/I
[23:55:42] <icinga-wm>	 ive - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/
[23:55:42] <icinga-wm>	 tive - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602
[23:55:43] <icinga-wm>	 ctive - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6460
[23:55:43] <icinga-wm>	 Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS646
[23:55:44] <icinga-wm>	  Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64
[23:55:44] <icinga-wm>	 : Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS6
[23:55:45] <icinga-wm>	 6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:56:54] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:57:34] <jinxer-wm>	 (KubernetesCalicoDown) firing: (53) kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[23:58:26] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:58:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2056.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2052.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2059.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2047.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2029.codfw.wmnet, kubernetes2033.codfw.
[23:58:44] <icinga-wm>	 ubernetes2008.codfw.wmnet, kubernetes2055.codfw.wmnet, kubernetes2044.codfw.wmnet are marked down but pooled: linkrecommendation-external_4006: Servers kubernetes2046.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2058.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes
[23:58:44] <icinga-wm>	 fw.wmnet, kubernetes2049.codfw.wmnet, kubernetes2043.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2055.codfw.wmnet, kubernetes2027.codfw.wmnet are marked down but pooled: push-not https://wikitech.wikimedia.org/wiki/PyBal
[23:59:44] <jinxer-wm>	 (ProbeDown) firing: Service miscweb2003:30443 has failed probes (http_transparency_archive_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:59:51] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate