[00:03:42] <zabe>	 hmm, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/InitialiseSettings.php#4149 looks a bit wrong
[00:04:24] <wikibugs>	 (03PS1) 10Zabe: Fix array declaration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840
[00:06:51] <wikibugs>	 (03PS2) 10Zabe: Fix array declaration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840 (https://phabricator.wikimedia.org/T197058)
[00:27:35] <wikibugs>	 (03PS1) 10Dzahn: simplelap: fix a typo introduced in a previous change [puppet] - 10https://gerrit.wikimedia.org/r/732842
[00:29:22] <wikibugs>	 (03CR) 10Dzahn: simplelap: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714155 (owner: 10RhinosF1)
[00:29:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] simplelap: fix a typo introduced in a previous change [puppet] - 10https://gerrit.wikimedia.org/r/732842 (owner: 10Dzahn)
[00:30:05] <wikibugs>	 (03PS1) 10Ahmon Dancy: docker: Mostly documentation updates [puppet] - 10https://gerrit.wikimedia.org/r/732844
[00:31:24] <wikibugs>	 (03PS2) 10Dzahn: simplelap: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731184
[00:32:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] simplelap: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731184 (owner: 10Dzahn)
[00:35:22] <wikibugs>	 (03CR) 10Dzahn: "tested in cloud VPS" [puppet] - 10https://gerrit.wikimedia.org/r/731184 (owner: 10Dzahn)
[02:56:31] <Juan_90264>	 Two tgr's?
[03:21:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:23:37] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:47:28] <wikibugs>	 (03CR) 10Ahmon Dancy: "Note: I have not tested this yet.  I will tomorrow unless someone beats me to it. :-)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy)
[04:05:37] <icinga-wm>	 PROBLEM - SSH on puppetmaster1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:37:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732807
[04:38:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732807 (owner: 10Marostegui)
[04:38:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17575 and previous config saved to /var/cache/conftool/dbconfig/20211022-043845-root.json
[04:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:46:03] <marostegui_>	 !log Deploy schema change on s8 codfw - T291719
[04:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:46:09] <stashbot>	 T291719: Remove abuse_filter_log.afl_filter column and adjust schema consequently from Wikimedia production - https://phabricator.wikimedia.org/T291719
[04:53:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17576 and previous config saved to /var/cache/conftool/dbconfig/20211022-045349-root.json
[04:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:05:51] <icinga-wm>	 RECOVERY - SSH on puppetmaster1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:08:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17577 and previous config saved to /var/cache/conftool/dbconfig/20211022-050852-root.json
[05:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:23:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17578 and previous config saved to /var/cache/conftool/dbconfig/20211022-052356-root.json
[05:24:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:39:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17579 and previous config saved to /var/cache/conftool/dbconfig/20211022-053900-root.json
[05:39:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:54:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17580 and previous config saved to /var/cache/conftool/dbconfig/20211022-055403-root.json
[05:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:05] <wikibugs>	 (03PS4) 10Ayounsi: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787)
[06:57:24] <wikibugs>	 (03PS5) 10Ayounsi: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211022T0700)
[07:07:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add remaining ownership annotations for ML services [puppet] - 10https://gerrit.wikimedia.org/r/732268 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff)
[07:08:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add ownership annotations for Data Engineering services [puppet] - 10https://gerrit.wikimedia.org/r/732257 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff)
[07:09:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[07:11:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732414 (owner: 10Dzahn)
[07:15:47] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:16:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732383 (https://phabricator.wikimedia.org/T276632) (owner: 10Legoktm)
[07:21:01] <wikibugs>	 10SRE, 10Observability-Logging, 10Traffic, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema)
[07:21:39] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:40:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add an alert for HDFS corrupt blocks [alerts] - 10https://gerrit.wikimedia.org/r/732748 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis)
[07:42:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add enable_relay flag to statsd exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/732827 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite)
[07:42:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732635 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[07:44:22] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] prometheus: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732635 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[07:45:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Set ganeti2025/2026 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/732911 (https://phabricator.wikimedia.org/T282603)
[07:46:41] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti2025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[07:46:47] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti2025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[07:47:27] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti2025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[07:49:45] <wikibugs>	 (03PS1) 10Ema: Add debian/patches/0008-vsl_check_e_inval_assertion.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/732913 (https://phabricator.wikimedia.org/T293879)
[08:00:23] <ema>	 !log deployment-cache-text06: test 0008-vsl_check_e_inval_assertion.patch https://gerrit.wikimedia.org/r/c/operations/debs/varnish4/+/732913/ T293879
[08:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:31] <stashbot>	 T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough  - https://phabricator.wikimedia.org/T293879
[08:01:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Set ganeti2025/2026 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/732911 (https://phabricator.wikimedia.org/T282603) (owner: 10Muehlenhoff)
[08:09:42] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "please hold off merging this i want to discuss it wit mortiz to see if/who we want to track the uid/gid" [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[08:14:32] <wikibugs>	 (03CR) 10Jbond: cumin: add an alias for new pki roles and add to misc-others (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[08:14:55] <wikibugs>	 (03PS3) 10Jbond: cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[08:14:59] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:15:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[08:18:19] <wikibugs>	 (03PS4) 10Jbond: cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[08:18:34] <godog>	 mmhh I'll take a look at the grafana sync failure
[08:19:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add debian/patches/0008-vsl_check_e_inval_assertion.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/732913 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[08:20:02] <godog>	 hah, an rsync race while transferring the sqlite journal
[08:20:05] <wikibugs>	 (03PS5) 10Jbond: cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[08:20:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[08:21:28] <godog>	 I'll file a task for now, we'll likely need rsync options in quickdatacopy I think to be able to tweak file selection
[08:23:48] <ema>	 !log cp3062: test 0008-vsl_check_e_inval_assertion.patch https://gerrit.wikimedia.org/r/c/operations/debs/varnish4/+/732913/ T293879
[08:23:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:55] <stashbot>	 T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough  - https://phabricator.wikimedia.org/T293879
[08:24:07] <wikibugs>	 10SRE, 10Observability-Metrics: Occasional rsync race while syncing /var/lib/grafana - https://phabricator.wikimedia.org/T294080 (10fgiunchedi)
[08:24:10] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3062.esams.wmnet,service=(varnish-fe|ats-tls)
[08:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:47] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3062.esams.wmnet,service=(varnish-fe|ats-tls)
[08:27:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:10] <wikibugs>	 (03CR) 10Muehlenhoff: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[08:28:44] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] Add debian/patches/0008-vsl_check_e_inval_assertion.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/732913 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[08:31:44] <wikibugs>	 (03PS1) 10MMandere: exim: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732917 (https://phabricator.wikimedia.org/T282787)
[08:34:16] <wikibugs>	 (03PS1) 10Majavah: Use most specific prefix for dns record site assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/732919 (https://phabricator.wikimedia.org/T294082)
[08:36:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2025.codfw.wmnet with OS buster
[08:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS buster
[08:36:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732917 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[08:44:06] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:49:42] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:51:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add an alert for HDFS corrupt blocks [alerts] - 10https://gerrit.wikimedia.org/r/732748 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis)
[08:55:11] <wikibugs>	 (03PS1) 10Btullis: Remove the HDFS corrupt blocks check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/732922 (https://phabricator.wikimedia.org/T293399)
[08:55:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add an alert for HDFS corrupt blocks [alerts] - 10https://gerrit.wikimedia.org/r/732748 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis)
[08:56:24] <btullis>	 joal: good news, those pendng compactions have started to drop on cassandra aqs1012-b. Not wedged after all.
[08:58:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/732924
[08:59:19] <wikibugs>	 (03PS1) 10Ema: varnishttfb.mtail: use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/732925 (https://phabricator.wikimedia.org/T293879)
[09:00:54] <btullis>	 Wrong channel, sorry.
[09:01:32] <Lucas_WMDE>	 good news still sounds good ^^
[09:01:59] <kormat>	 :)
[09:04:20] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:04:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2025.codfw.wmnet with OS buster
[09:04:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS buster completed: - ganeti2025 (**PASS**)   - Downtimed on Ici...
[09:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:02] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:11:12] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/ResubmitChanges.php wikidatawiki --minimum-age $((60*60*12)) # T294029
[09:11:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:18] <stashbot>	 T294029: Run ResubmitChanges.php to resubmit stuck changes from 2021-10-21 14:26 UTC - https://phabricator.wikimedia.org/T294029
[09:18:52] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM but best wait for volans to comment." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/732919 (https://phabricator.wikimedia.org/T294082) (owner: 10Majavah)
[09:20:39] <wikibugs>	 (03CR) 10Muehlenhoff: Add ownership annotations for WMCS services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff)
[09:25:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/732925 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[09:25:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) Thanks!  > I take it the main concern here is allocating a public IPv4 address, which is a scarce resource, no? That's one...
[09:27:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/732924 (owner: 10Muehlenhoff)
[09:32:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) (owner: 10Legoktm)
[09:35:59] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:01:17] <wikibugs>	 (03PS1) 10Jbond: cas 6.4.2: exlucde tomcat-embed-el-9.0.52.jar [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732930
[10:03:36] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[10:05:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2026.codfw.wmnet with OS buster
[10:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS buster
[10:16:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: change headers test for /static/current [puppet] - 10https://gerrit.wikimedia.org/r/732280 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto)
[10:27:24] <wikibugs>	 (03PS1) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939
[10:33:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2026.codfw.wmnet with OS buster
[10:33:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS buster completed: - ganeti2026 (**PASS**)   - Downtimed on Ici...
[10:33:37] <wikibugs>	 (03PS2) 10Jbond: cas 6.4.2: exlucde tomcat-embed-el-9.0.52.jar [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732930
[10:33:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] cas 6.4.2: exlucde tomcat-embed-el-9.0.52.jar [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732930 (owner: 10Jbond)
[10:36:11] <wikibugs>	 (03PS2) 10Majavah: debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288
[10:37:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) a:05aborrero→03ayounsi >>! In T289882#7450242, @ayounsi wrote: > Which means increasing our attack surface as well as...
[10:40:22] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:40:47] <wikibugs>	 (03PS1) 10Jbond: changelog: fix distro [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732941
[10:46:53] <jbond>	 !log upload cas_6.4.2-1+wmf10u1
[10:46:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:52] <wikibugs>	 (03CR) 10Btullis: "Thanks both." [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[10:48:55] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828)
[10:53:16] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:48] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:00:28] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:15] <wikibugs>	 (03CR) 10Jbond: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[11:08:41] <wikibugs>	 (03PS1) 10Jbond: Revert "P:idp::standalon: remove unused profile" [puppet] - 10https://gerrit.wikimedia.org/r/732812
[11:08:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:idp::standalon: remove unused profile" [puppet] - 10https://gerrit.wikimedia.org/r/732812 (owner: 10Jbond)
[11:12:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE))
[11:16:38] <wikibugs>	 (03PS1) 10Jbond: P:idp::standalon: switch to P:base::production [puppet] - 10https://gerrit.wikimedia.org/r/732945
[11:17:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:idp::standalon: switch to P:base::production [puppet] - 10https://gerrit.wikimedia.org/r/732945 (owner: 10Jbond)
[11:19:52] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE))
[11:29:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10EChetty) @Dzahn  Hey Dan! Thanks for setting this up.  However it seems that some of the sites are not authenticating me correctly when I try to access them:  For eg. When I...
[11:34:06] <wikibugs>	 10SRE, 10Cassandra, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability): Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java - https://phabricator.wikimedia.org/T261966 (10Aklapper)
[11:34:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288 (owner: 10Majavah)
[11:34:18] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288 (owner: 10Majavah)
[11:35:36] <wikibugs>	 (03Merged) 10jenkins-bot: debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288 (owner: 10Majavah)
[11:36:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm::kubectl: install kubectl-sudo [puppet] - 10https://gerrit.wikimedia.org/r/732747 (owner: 10Majavah)
[11:39:15] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatchChanges.php-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604)
[11:39:17] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove wmg variables for dispatchChanges.php Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732950 (https://phabricator.wikimedia.org/T292604)
[11:39:19] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove wikibaseDispatchRedisLockManager config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604)
[11:40:16] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:40:32] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] exim: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732917 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[11:43:38] <wikibugs>	 (03PS1) 10Btullis: Remove unused dummy keytabs and an SSH key for alluxio [labs/private] - 10https://gerrit.wikimedia.org/r/732952 (https://phabricator.wikimedia.org/T266641)
[11:44:28] <wikibugs>	 (03PS1) 10Majavah: kubectl::kubeadm: make kubectl-sudo executable [puppet] - 10https://gerrit.wikimedia.org/r/732953
[11:49:16] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I’m not sure if I want to deploy this on Monday or wait longer, but putting it up for review already." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE))
[11:49:21] <wikibugs>	 (03CR) 10Btullis: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[11:50:15] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "This feels like a riskier change than others, because it touches a MediaWiki core setting – I didn’t find any other uses of this lock mana" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE))
[11:53:24] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:55:08] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:27] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-jbond: Add additional prometheus metrics to puppet runs - https://phabricator.wikimedia.org/T283585 (10fgiunchedi) @jbond IIRC for this we went the logstash way, anything else to be done and/or missing ?
[11:58:15] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-jbond: Add additional prometheus metrics to puppet runs - https://phabricator.wikimedia.org/T283585 (10jbond) 05Open→03Resolved a:03jbond Thats correct all though it still a work in progress, however this one i thin...
[12:00:37] <wikibugs>	 (03PS1) 10MMandere: ntp: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732954 (https://phabricator.wikimedia.org/T282787)
[12:01:02] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:18] <wikibugs>	 (03CR) 10Btullis: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[12:02:25] <wikibugs>	 (03PS2) 10Btullis: Remove all remaining references to alluxio [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641)
[12:03:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732954 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[12:04:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubectl::kubeadm: make kubectl-sudo executable [puppet] - 10https://gerrit.wikimedia.org/r/732953 (owner: 10Majavah)
[12:05:25] <wikibugs>	 10SRE, 10observability, 10User-jbond: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Closing as we're in a good place nowadays  ` root@prometheus1004:~# apache2ctl graceful root@prometheus1004:~#  `
[12:15:05] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] ntp: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732954 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[12:15:09] <wikibugs>	 (03PS1) 10MVernon: codfw-prod: final weight to ms-be20[62-65] [software/swift-ring] - 10https://gerrit.wikimedia.org/r/732957 (https://phabricator.wikimedia.org/T288458)
[12:23:27] <wikibugs>	 (03PS3) 10Seddon: Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner)
[12:25:28] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:28:05] <wikibugs>	 (03PS1) 10MMandere: grafana: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732959 (https://phabricator.wikimedia.org/T282787)
[12:29:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/732957 (https://phabricator.wikimedia.org/T288458) (owner: 10MVernon)
[12:31:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] add centrallog2002 to codfw anycast_neighbors and syslog fw allows [homer/public] - 10https://gerrit.wikimedia.org/r/731828 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron)
[12:31:44] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:32:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] upgrade-varnish: support frontend instance only [cookbooks] - 10https://gerrit.wikimedia.org/r/731935 (owner: 10Ema)
[12:32:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi)
[12:32:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: nfs: refresh exclude pattern for nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/732960 (https://phabricator.wikimedia.org/T294086)
[12:33:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add drmrs network to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/732351 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi)
[12:33:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Remove GRE tunnel between cr4-ulsfo and cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/732616 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi)
[12:34:33] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud: nfs: refresh exclude pattern for nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/732960 (https://phabricator.wikimedia.org/T294086)
[12:35:31] <wikibugs>	 (03PS3) 10Jbond: O:puppetboard::ng: add new role [puppet] - 10https://gerrit.wikimedia.org/r/732368
[12:35:34] <wikibugs>	 (03CR) 10Muehlenhoff: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[12:36:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:puppetboard::ng: add new role [puppet] - 10https://gerrit.wikimedia.org/r/732368 (owner: 10Jbond)
[12:36:44] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving since graphite failover nowadays is much better and documented at https://wikitech.wiki...
[12:36:53] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Documentation, 10Graphite: document graphite failover/backfill procedures - https://phabricator.wikimedia.org/T102575 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done https://wikitech.wikimedia.org/wiki/Graphite#Operations_manual
[12:37:02] <wikibugs>	 10SRE, 10WMDE-Analytics-Engineering, 10Graphite, 10Patch-For-Review, 10Tracking-Neverending: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451 (10fgiunchedi)
[12:38:56] <wikibugs>	 10SRE, 10Observability-Metrics, 10observability: grafana access control - https://phabricator.wikimedia.org/T108546 (10fgiunchedi) 05Open→03Declined Resolving this as we're moving away from Graphite
[12:39:46] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: final weight to ms-be20[62-65] [software/swift-ring] - 10https://gerrit.wikimedia.org/r/732957 (https://phabricator.wikimedia.org/T288458) (owner: 10MVernon)
[12:40:02] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:03] <wikibugs>	 10SRE, 10Observability-Metrics, 10observability: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This happened as part of {T247963} where we recreated whisper files on the reimaged hosts
[12:40:52] <wikibugs>	 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Doing): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10hashar)
[12:41:04] <wikibugs>	 (03PS1) 10Jbond: P:puppetboard::ng: use P:base:::production [puppet] - 10https://gerrit.wikimedia.org/r/732961
[12:42:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:puppetboard::ng: use P:base:::production [puppet] - 10https://gerrit.wikimedia.org/r/732961 (owner: 10Jbond)
[12:44:41] <wikibugs>	 10SRE, 10Observability-Alerting: Monitoring: add link to graph for Icinga timeseries alarms - https://phabricator.wikimedia.org/T167422 (10fgiunchedi) 05Open→03Invalid Tentatively resolving since we're moving away from icinga-based timeseries alerts and onto Alertmanager. For the latter the lack of a dashb...
[12:44:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: nfs: refresh exclude pattern for nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/732960 (https://phabricator.wikimedia.org/T294086) (owner: 10Arturo Borrero Gonzalez)
[12:46:14] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:04] <wikibugs>	 10SRE: librenms: consider using Distributed Poller with multiple netmon servers - https://phabricator.wikimedia.org/T171122 (10fgiunchedi) -observability for backlog cleanup, unclear whether we want/need this
[12:49:42] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Remove wikibaseDispatchRedisLockManager config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE))
[12:51:11] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Remove dispatchChanges.php-related Wikibase settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE))
[12:51:25] <wikibugs>	 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Given than nowadays all Grafana alerts show up at https://alerts.wikimedia.org and...
[12:51:29] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Remove wmg variables for dispatchChanges.php Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732950 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE))
[12:51:34] <wikibugs>	 (03PS1) 10Jbond: O:puppetboard::ng: Add config for cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/732963
[12:53:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:puppetboard::ng: Add config for cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/732963 (owner: 10Jbond)
[12:53:32] <wikibugs>	 10SRE, 10Observability-Logging, 10Traffic, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) >>! In T293879#7450109, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations),...
[12:54:36] <wikibugs>	 (03PS1) 10Jbond: O:puippetboard:ng: fix typo, label [puppet] - 10https://gerrit.wikimedia.org/r/732964
[12:54:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] O:puippetboard:ng: fix typo, label [puppet] - 10https://gerrit.wikimedia.org/r/732964 (owner: 10Jbond)
[12:54:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:puippetboard:ng: fix typo, label [puppet] - 10https://gerrit.wikimedia.org/r/732964 (owner: 10Jbond)
[12:56:01] <wikibugs>	 10SRE, 10Icinga, 10SRE Observability, 10observability: icinga really needs to check puppet run success of passive icinga hosts - https://phabricator.wikimedia.org/T215848 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Implemented at https://gerrit.wikimedia.org/r/c/operations/alerts/+/710248
[12:57:28] <wikibugs>	 (03PS1) 10Jbond: O:puppetboard::ng: fix typo cfssl vs ssl [puppet] - 10https://gerrit.wikimedia.org/r/732965
[12:58:00] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10fgiunchedi) 05Open→03Invalid No longer the case, graphite hosts (Bullseye) come up fine after a reboot nowadays
[12:58:08] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Reedy) Might be worth double checking the wikitech-l and mediawiki-l footers too...
[13:00:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:puppetboard::ng: fix typo cfssl vs ssl [puppet] - 10https://gerrit.wikimedia.org/r/732965 (owner: 10Jbond)
[13:01:58] <wikibugs>	 10SRE, 10Contributors-Team, 10observability, 10Sustainability (Incident Followup): High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 (10hashar) I have filled that one as part of an incident followup task but #release-engineering-team is...
[13:02:14] <wikibugs>	 (03PS1) 10Urbanecm: Deploy Growth mentor dashboard to phase II wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732967 (https://phabricator.wikimedia.org/T278920)
[13:05:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732959 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[13:07:31] <wikibugs>	 (03PS1) 10Hashar: zuul: use releng list rather than jenkins-bot for email [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642)
[13:12:04] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/732924 (owner: 10Muehlenhoff)
[13:16:52] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatchChanges.php-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604)
[13:16:54] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove wmg variables for dispatchChanges.php Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732950 (https://phabricator.wikimedia.org/T292604)
[13:16:56] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove wikibaseDispatchRedisLockManager config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604)
[13:16:58] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatchLagToMaxLagFactor Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604)
[13:17:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Remove dispatchChanges.php-related Wikibase settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE))
[13:18:27] <wikibugs>	 (03PS1) 10Ema: Use ats-tls metrics for edge traffic drop alert [alerts] - 10https://gerrit.wikimedia.org/r/732970 (https://phabricator.wikimedia.org/T293879)
[13:25:33] <wikibugs>	 10SRE, 10Observability-Logging: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10fgiunchedi)
[13:27:52] <wikibugs>	 (03PS1) 10Zabe: Fix some easy codestyle issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732971
[13:30:28] <jbond>	 !log upload python3-pypuppetdb_2.4.0-1_all.deb to bullseye
[13:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:52] <wikibugs>	 10SRE, 10Observability-Alerting: Aggregate check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10fgiunchedi)
[13:39:40] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove unused dummy keytabs and an SSH key for alluxio [labs/private] - 10https://gerrit.wikimedia.org/r/732952 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[13:39:51] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Remove unused dummy keytabs and an SSH key for alluxio [labs/private] - 10https://gerrit.wikimedia.org/r/732952 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[13:41:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've gone ahead and updated https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters adding a lot of information about the cluster plus se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris)
[13:42:10] <ema>	 !log deployment-cache-upload06: restart varnish-frontend, package got upgraded to 6.0.8 T294116
[13:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:17] <stashbot>	 T294116: Varnish reload failing on deployment-cache-upload06 - https://phabricator.wikimedia.org/T294116
[13:47:15] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Varnish reload failing on deployment-cache-upload06 - https://phabricator.wikimedia.org/T294116 (10ema) 05Open→03Resolved a:03ema I upgraded varnish to 6.0.8 everywhere (see T292290) and forgot about restarting the service on deployment-cache-upload06. I...
[13:48:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[13:49:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff)
[13:50:19] <wikibugs>	 (03PS2) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939
[13:51:30] <wikibugs>	 (03PS1) 10Hashar: zuul: gracefully shutdown [puppet] - 10https://gerrit.wikimedia.org/r/732978 (https://phabricator.wikimedia.org/T257040)
[13:54:03] <icinga-wm>	 PROBLEM - puppetboard on puppetboard1002 is CRITICAL: connect to address 10.64.48.59 and port 8001: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[13:58:31] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Remove dispatchLagToMaxLagFactor Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE))
[13:59:14] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Remove dispatchChanges.php-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE))
[14:00:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Use ats-tls metrics for edge traffic drop alert [alerts] - 10https://gerrit.wikimedia.org/r/732970 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[14:00:30] <wikibugs>	 (03CR) 10Btullis: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis)
[14:12:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[14:12:37] <wikibugs>	 (03CR) 10Michael Große: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große)
[14:12:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[14:16:51] <icinga-wm>	 PROBLEM - puppetboard on puppetboard2002 is CRITICAL: connect to address 10.192.32.30 and port 8001: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[14:21:47] <wikibugs>	 (03PS3) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834)
[14:24:50] <wikibugs>	 (03PS4) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031)
[14:25:44] <wikibugs>	 (03CR) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große)
[14:32:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Regularly resubmit changes that might be stuck in wb_changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große)
[14:40:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10CDanis) Try visiting https://idp.wikimedia.org/logout and then logging back in?
[14:45:01] <wikibugs>	 (03PS4) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834)
[14:54:38] <wikibugs>	 (03CR) 10Ahmon Dancy: "No changes reported by PCC" [puppet] - 10https://gerrit.wikimedia.org/r/732844 (owner: 10Ahmon Dancy)
[14:57:00] <wikibugs>	 (03PS1) 10Majavah: toolforge::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767)
[15:06:35] <jbond>	 !log upload puppetboard_3.1.0-1_all.deb to ullseye-wikimedia
[15:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:41] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:33] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:33] <wikibugs>	 (03PS1) 10Majavah: puppetmaster::gitsync: Replace cron job with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/732991 (https://phabricator.wikimedia.org/T273673)
[15:25:17] <wikibugs>	 (03PS1) 10Btullis: Add three more HDFS related checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/732993 (https://phabricator.wikimedia.org/T293399)
[15:31:35] <wikibugs>	 (03PS5) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834)
[15:32:25] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Aklapper) * https://lists.wikimedia.org/postorius/lists/mediawiki-l.lists.wikimedia.org/templates is empty. * https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/templ...
[15:40:09] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:05] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:17] <wikibugs>	 (03PS1) 10Hashar: zuul: double git-daemon max connections 48 -> 96 [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661)
[15:50:24] <wikibugs>	 (03CR) 10Hashar: "We have bumped the limit two years ago ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/508408 ). While looking at the log today we " [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar)
[15:53:38] <wikibugs>	 (03PS1) 10Jbond: P:puppetdb: update puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276)
[15:54:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31853/console" [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond)
[15:54:59] <wikibugs>	 (03PS2) 10Jbond: P:puppetdb: update puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276)
[15:58:09] <wikibugs>	 (03PS3) 10Jbond: P:puppetdb: update puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276)
[15:58:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31855/console" [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond)
[15:59:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetdb: update puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond)
[16:01:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10EChetty) @CDanis   Hey Chris, tried that since it seemed to work for Luke - but no dice :(  Also tried flushing my cookies/cache, changing browser and the good ol' turning o...
[16:10:49] <wikibugs>	 (03PS1) 10Cwhite: logstash: bugfix logstash logEvent json encoding [puppet] - 10https://gerrit.wikimedia.org/r/733007
[16:25:04] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH)
[16:25:12] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH)
[16:25:17] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:25:36] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH)
[16:26:12] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH) a:03Papaul
[16:31:25] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:29] <wikibugs>	 (03PS9) 10Cwhite: role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618)
[16:32:48] <wikibugs>	 (03PS8) 10Cwhite: role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618)
[16:32:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[16:33:05] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH)
[16:33:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[16:33:15] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH)
[16:33:36] <wikibugs>	 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10RobH)
[16:33:52] <wikibugs>	 (03PS9) 10Cwhite: role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618)
[16:34:02] <wikibugs>	 (03PS10) 10Cwhite: role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618)
[16:34:04] <wikibugs>	 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10RobH)
[16:34:21] <wikibugs>	 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10RobH) a:03Jclark-ctr
[16:37:24] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Add frpm1002, frauth1002, pay-lvs1003, pay-lvs1004 [dns] - 10https://gerrit.wikimedia.org/r/732834 (https://phabricator.wikimedia.org/T289812) (owner: 10Dwisehaupt)
[16:40:23] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:48:17] <wikibugs>	 (03PS5) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031)
[16:48:19] <wikibugs>	 (03CR) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große)
[16:49:01] <wikibugs>	 (03PS6) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031)
[16:49:03] <wikibugs>	 (03CR) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große)
[16:53:31] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:02:09] <wikibugs>	 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10Majavah) See also: {T294034}
[17:02:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install ganeti102[56] - https://phabricator.wikimedia.org/T293909 (10RobH)
[17:04:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10RobH)
[17:04:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: double git-daemon max connections 48 -> 96 [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar)
[17:06:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Packaging, 10serviceops: package requirements for upgrading deployment_servers to buster - https://phabricator.wikimedia.org/T242480 (10Majavah) 05Open→03Resolved This was done at some point.
[17:09:37] <wikibugs>	 (03CR) 10Herron: [C: 03+1] role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[17:10:01] <wikibugs>	 (03CR) 10Hashar: "Moritz may you review the systemd magic?  Pretty sure you have more experience than me on that regard ;)  No urgency, the task has been ar" [puppet] - 10https://gerrit.wikimedia.org/r/732978 (https://phabricator.wikimedia.org/T257040) (owner: 10Hashar)
[17:10:42] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 (owner: 10Dzahn)
[17:11:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10CDanis) Ah, I see the problem, you weren't added to the `wmf` LDAP group.  I've added you -- try https://idp.wikimedia.org/logout and then try again please?
[17:11:45] <wikibugs>	 (03CR) 10Hashar: "Danke Schon!" [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar)
[17:12:07] <wikibugs>	 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10RobH)
[17:12:26] <wikibugs>	 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10RobH) a:03Papaul
[17:14:00] <wikibugs>	 (03CR) 10Hashar: "That should not add any spam to our list, the cron job never errored out and the other one is for Zuul smtp reporter which is not used." [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar)
[17:19:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10EChetty) @CDanis    Amazing -> That plus a cookie flush seemed to do the trick :)  Thank you!
[17:19:42] <wikibugs>	 (03PS1) 10Majavah: scap: Use service name for logstash-beta [puppet] - 10https://gerrit.wikimedia.org/r/733023
[17:25:07] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:25:35] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: bugfix logstash logEvent json encoding [puppet] - 10https://gerrit.wikimedia.org/r/733007 (owner: 10Cwhite)
[17:26:20] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732827 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite)
[17:28:40] <wikibugs>	 (03CR) 10Herron: [C: 03+1] role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[17:29:32] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile: add logstash common profile [puppet] - 10https://gerrit.wikimedia.org/r/727626 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[17:31:19] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:54] <wikibugs>	 (03CR) 10Herron: [C: 03+1] hiera: add minimal logstash-beta-next hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[17:36:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[17:44:50] <wikibugs>	 (03PS1) 10AOkoth: gitlab: add data for cloud test vms [puppet] - 10https://gerrit.wikimedia.org/r/733025
[17:55:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Dwisehaupt) 05Resolved→03Open @Cmjohnson I believe the network config was swapped for all the hosts. When attempting to build them I see that the pay-l...
[17:56:31] <wikibugs>	 (03PS2) 10AOkoth: gitlab: add default values for cloud test vms [puppet] - 10https://gerrit.wikimedia.org/r/733025
[17:59:40] <wikibugs>	 (03PS3) 10AOkoth: gitlab: add default values for cloud test vms [puppet] - 10https://gerrit.wikimedia.org/r/733025
[18:00:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab: add default values for cloud test vms [puppet] - 10https://gerrit.wikimedia.org/r/733025 (owner: 10AOkoth)
[18:15:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Packaging, 10serviceops: package requirements for upgrading deployment_servers to buster - https://phabricator.wikimedia.org/T242480 (10Dzahn)
[18:24:45] <wikibugs>	 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) It is possible to get the...
[18:25:13] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:36] <wikibugs>	 (03CR) 10Dzahn: zuul: double git-daemon max connections 48 -> 96 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar)
[18:31:25] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[18:32:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 (owner: 10Dzahn)
[18:34:08] <wikibugs>	 (03CR) 10Dzahn: "added to "misc-ops" but too late to edit the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn)
[18:40:27] <wikibugs>	 (03PS3) 10Dzahn: cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414
[18:43:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] scap: Use service name for logstash-beta [puppet] - 10https://gerrit.wikimedia.org/r/733023 (owner: 10Majavah)
[18:45:49] <wikibugs>	 (03PS3) 10Ebernhardson: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006)
[18:45:51] <wikibugs>	 (03CR) 10Ebernhardson: query_service: Add new oauth related configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson)
[18:46:09] <wikibugs>	 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Metrics, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle)
[18:46:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson)
[18:51:05] <wikibugs>	 (03PS4) 10Dzahn: cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414
[18:51:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 (owner: 10Dzahn)
[18:55:15] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:56:00] <wikibugs>	 (03PS2) 10Dzahn: simplelamp2: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731183
[18:56:09] <wikibugs>	 (03PS1) 10Ahmon Dancy: thumbor: Remove conditionalization for stretch [puppet] - 10https://gerrit.wikimedia.org/r/733033
[19:07:43] <wikibugs>	 (03PS4) 10Ebernhardson: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006)
[19:08:23] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:11:46] <mutante>	 ignoring that based on the word "test" in it
[19:14:49] <wikibugs>	 (03PS1) 10Accraze: ml-services: add enwiki-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/733034 (https://phabricator.wikimedia.org/T294141)
[19:17:00] <urbanecm>	 !log Start server-side upload of 1 video file (T294134)
[19:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:07] <stashbot>	 T294134: Please upload a 556 MB video file to Wikimedia Commons - https://phabricator.wikimedia.org/T294134
[19:24:21] <wikibugs>	 (03CR) 10Legoktm: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris)
[19:38:19] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:39:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 43, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:39:57] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:46:25] <wikibugs>	 (03PS3) 10Urbanecm: Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347)
[19:46:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm)
[19:46:42] <wikibugs>	 (03PS4) 10Urbanecm: Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347)
[19:51:29] <mutante>	 re: router alerts - those already have comments about existing Telia trouble tickets
[19:51:54] <mutante>	 and Telia just mailed a couple hours ago that they saw a flap and are keeping an eye on it ..roughly
[19:55:55] <icinga-wm>	 ACKNOWLEDGEMENT - puppetboard on puppetboard1002 is CRITICAL: connect to address 10.64.48.59 and port 8001: Connection refused daniel_zahn reimaged per SAL - no ticket though https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[19:55:55] <icinga-wm>	 ACKNOWLEDGEMENT - puppetboard on puppetboard2002 is CRITICAL: connect to address 10.192.32.30 and port 8001: Connection refused daniel_zahn reimaged per SAL - no ticket though https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[19:56:20] <Spookreeeno>	 mutante: https://github.com/wikimedia/puppet/commit/2d1741b2d12935b89c9800b3c5ece38df8e0b223#diff-b2ce9b71fdce7711edb9ccfeb1d69e9974a469bf5d5f7687e65598aa49e9ba8b
[19:57:29] <mutante>	 Spookreeeno: ACK, thanks. it doesn't have a ticket though
[19:57:38] <Spookreeeno>	 Nope
[19:58:41] <Spookreeeno>	 John is
[19:58:52] <Spookreeeno>	 Probably gone for weekend now
[20:02:54] <jclark-ctr>	 if you are talking about me i am on site now if anything is needed
[20:03:18] <mutante>	 jclark-ctr: the other John:) thank you very much, we are good
[20:06:25] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10Jclark-ctr) Cable has been run shows link.   netbox has not been updated yet #2009 15m. pp219588361  <-> to cr1-eqiad:xe-3/0/6.
[20:09:36] <AntiComposite>	 reports of timeouts from a few users on Discord
[20:10:13] <dontpanic>	 oh, I'm not the only one
[20:10:13] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:11:09] <dontpanic>	 phab isn't loading so can't create a task
[20:11:18] <urbanecm>	 dontpanic: feel free to PM, I'll relay
[20:12:15] <urbanecm>	 mutante: are SREs on the issue? Or should I use klaxon for the first time? :D
[20:13:31] <mutante>	 urbanecm: things are working for me but it's suspicious that we see that comment on the eqiad patch right befpore?
[20:13:42] <mutante>	 no, we have not been paged
[20:13:57] <urbanecm>	 as i said, there are user reports. And NEL reports in logstash also went up significantly.
[20:14:17] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 64 probes of 711 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:14:27] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 67 probes of 629 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:14:31] <Spookreeeno>	 mutante: ^
[20:14:53] <majavah>	 can't reproduce
[20:15:04] <mutante>	 XioNoX: users report issues right after a cable was patched in Eqiad but things work for me
[20:15:17] <urbanecm>	 https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 is what Im looking on btw
[20:15:20] <mutante>	 jclark-ctr: are you working with someone on that cable thing?
[20:15:32] <Spookreeeno>	 Seem fine in UK
[20:15:38] <Spookreeeno>	 Tried enwiki + meta
[20:15:54] <icinga-wm>	 PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28
[20:15:59] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 247 probes of 622 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:16:07] <legoktm>	 that's new
[20:16:09] <AntiComposite>	 the RIPE map looks a lot like the last Telia problem
[20:16:09] <AmandaNP>	 Can't load any wiki in Canada
[20:16:11] <rzl>	 looking
[20:16:13] <Spookreeeno>	 urbanecm: we paged
[20:16:14] <majavah>	 the logstash dashboard shows US and BR as most affected
[20:16:26] <Spookreeeno>	 legoktm: about a week old I think. Chris did it
[20:16:27] <wikibugs>	 (03PS1) 10BBlack: Revert "ntp: Add drmrs DC site" [puppet] - 10https://gerrit.wikimedia.org/r/733040 (https://phabricator.wikimedia.org/T282787)
[20:16:29] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service,netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:47] <urbanecm>	 Spookreeeno: I don't understand the "we paged" comment
[20:16:52] <mutante>	 Telia had issues again
[20:16:57] <Spookreeeno>	 urbanecm: a page just went off
[20:17:15] <Spookreeeno>	 Not sure why I said we as not me obviously
[20:17:37] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "ntp: Add drmrs DC site" [puppet] - 10https://gerrit.wikimedia.org/r/733040 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack)
[20:17:38] <legoktm>	 what DC are people getting timeouts from?
[20:17:45] <legoktm>	 what kind of timeouts?
[20:17:46] <mutante>	 Telia just mailed us again
[20:17:50] <greg-g>	 legoktm: eqiad
[20:17:56] <mutante>	 "Suspected Cable fault in St Louis and your circuits are affected "
[20:18:06] <greg-g>	 me, personally (see the private channel for my basic info)
[20:18:17] <bblack>	 it takes a few minutes for everything to re-converge, even when the link goes down cleanly in an obvious way
[20:18:30] <jclark-ctr>	 I am at eqiad about to leave just want to check if anything is needed at eqiad.
[20:18:44] <rzl>	 bblack: is your drmrs revert in response to the alert, or is that unrelated?
[20:18:51] <bblack>	 rzl: unrelated
[20:18:53] <rzl>	 thanks
[20:19:12] <bblack>	 [puppet's broken on some core dns/ntp servers from the change I'm reverting]
[20:19:18] <Spookreeeno>	 Gerrit issues from EU/UK
[20:19:20] <legoktm>	 we can temporarily depool eqiad I suppose
[20:19:36] <urbanecm>	 ftr, got https://phabricator.wikimedia.org/P17584 from one of the affected users
[20:19:48] <urbanecm>	 (dontpanic, to be precise)
[20:19:58] <bblack>	 depooling eqiad would only make sense if Telia's still erroneously advertising our prefixes with the link to them dead
[20:20:01] <greg-g>	 I'd paste to phab but I can't connect :)
[20:20:03] <majavah>	 I'm having trouble understanding the logstash dashboard.. what is "NELs by server IP"? is it "where clients are failing to connect" or "where we received the reports"?
[20:20:04] <bblack>	 otherwise we just have to wait for converge
[20:20:08] <lucaswerkmeister>	 I can provide broken and working traceroutes from Germany if you want
[20:20:11] <lucaswerkmeister>	 (but not on Phab either ^^)
[20:20:29] <urbanecm>	 greg-g: I'm happy to act as a relay if needed :D
[20:20:34] <greg-g>	 :)
[20:20:41] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 295 probes of 705 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:20:57] <mutante>	 it's Telia wave between eqiad and codfw but that started a while before we started getting the NEL reports
[20:21:00] <AntiComposite>	 majavah, NEL is https://wikitech.wikimedia.org/wiki/Network_Error_Logging
[20:21:20] <majavah>	 AntiComposite: I know, the dashboard is just confusing
[20:21:31] <bblack>	 oh I thought we were talking about transit fail?
[20:21:52] <legoktm>	 I think multiple things are having issues?
[20:22:05] <mutante>	 bblack: "Suspected Cable fault in St Louis and your circuits are affected"
[20:22:07] <legoktm>	 well, there's users seeing timeouts to eqiad
[20:22:25] <legoktm>	 and that
[20:22:52] <Spookreeeno>	 I can't even get DNS for toolforge tools
[20:23:03] <Spookreeeno>	 SAL is not resolved for me
[20:23:39] <wikibugs>	 (03PS1) 10BBlack: Depool eqiad temporarily [dns] - 10https://gerrit.wikimedia.org/r/733043
[20:24:11] <urbanecm>	 https://phabricator.wikimedia.org/P17585 is from greg-g
[20:24:25] <urbanecm>	 it...appears to reach xe-0-1-4.cr2-eqord.wikimedia.org ?
[20:24:26] <XioNoX>	 I didn't get page but saw the irc tag, getting my laptop
[20:24:50] <greg-g>	 I can't verify what urbanecm says but I did send him my info ;)
[20:25:18] <XioNoX>	 is there a TLDR?
[20:25:23] <AntiComposite>	 (klaxon doesn't list any recent pages, iirc it usually does for pages from alerting)
[20:25:26] <bblack>	 XioNoX: Telia outage
[20:25:31] <bblack>	 (fiber cut)
[20:25:51] <greg-g>	 mtr is running and I'll keep an eye on it for recovery/changes
[20:26:12] <AntiComposite>	 impact appears roughly the same as the Telia outage earlier this month
[20:26:38] <XioNoX>	 telia interface in eqiad is up, should I take BGP down (still catching up)
[20:27:34] <mutante>	 XioNoX: Telia reported a new cut and    IC-313592 and    IC-314534  eqiad -codfw
[20:27:36] <urbanecm>	 https://phabricator.wikimedia.org/P17586 is from lucaswerkmeister, ftr.
[20:28:47] <bblack>	 the transport one (IC-314534) the interface appears to be down, so that's good at that level
[20:29:15] <XioNoX>	 mutante: ok, so that's ulsfo-eqord and eqord-eqiad
[20:29:34] <XioNoX>	 sorry eqord-codfw, no eqiad
[20:30:32] <mutante>	 XioNoX: about 50 minutes ago we had Icinga alerts abot cr3-ulso, cr2-eqord and cr2-codfw 
[20:30:35] <bblack>	 yeah the eqord-eqiad one seems like it's down on both ends, so it must be saturation elsewhere causing isses?
[20:30:52] <mutante>	 the exact alerts that already had comments on Icinga for ongoing Telia issue
[20:31:04] <bblack>	 also: I have a patch to dns-depool eqiad, but I'm not clear yet if that will improve the situation or just move problems around
[20:31:09] <mutante>	 the NEL and user report did not start until a while after that
[20:31:14] <bblack>	 any informed opinion?
[20:31:43] <XioNoX>	 no saturation on our side at least
[20:31:51] <XioNoX>	 but it's confusing where the issue is exactly
[20:32:05] <legoktm>	 IIRC weren't we already coming close to capacity on eqiad<-->codfw? I think we don't want to go over that
[20:32:26] <cdanis>	 saturation on transport links shouldn't be causing these flavors of NELs nor the RIPE Atlas alert
[20:32:28] <bblack>	 yeah but that shouldn't affect users reaching our edge
[20:32:35] <bblack>	 something else is going on
[20:32:54] <XioNoX>	 legoktm: we're fine on the codfw-eqiad especially if it's an emergency
[20:33:07] <bblack>	 the reported are about eqiad edge reachability, basically
[20:33:22] <XioNoX>	 (I mean in the case we have to depool, the codfw-eqiad links can handle it)
[20:33:26] <bblack>	 do we have a Telia transit there which is not on their list, but is affected-but-not-actually-down?
[20:33:33] <bblack>	 s/there/eqiad/
[20:34:12] <XioNoX>	 fyi I reach eqiad through telia without any loss
[20:34:23] * legoktm nods
[20:34:35] <cdanis>	 XioNoX: I am suspicious of a return path issue
[20:35:01] <XioNoX>	 could be yeah, https://phabricator.wikimedia.org/P17586  might be a HE issue?
[20:35:03] <cdanis>	 wth? ssh to bast1003 works fine
[20:35:07] <XioNoX>	 (or return of course)
[20:35:10] <cdanis>	 XioNoX: are we still splitting VRRPs
[20:35:26] <XioNoX>	 cdanis: yep
[20:35:33] <mutante>	 19:38 - 1 interface down on cr3-ulsfo, 19:39 - 2 interfaces down on cr2-eqord, 1 interface down on cr2-codfw.   19:50  Telia sends mail about new fiber cut    20:09 user reports on IRC
[20:35:37] <bblack>	 here we go
[20:35:40] <bblack>	 https://librenms.wikimedia.org/device/device=2/tab=port/port=11600/
[20:35:43] <wikibugs>	 (03PS2) 10Ahmon Dancy: thumbor: Remove conditionalization for stretch [puppet] - 10https://gerrit.wikimedia.org/r/733033 (https://phabricator.wikimedia.org/T294148)
[20:35:54] <bblack>	 ^ equinix peering in eqiad, there's a dropoff in traffic, probably from telia fiber cut impacting other peers there?
[20:35:59] <cdanis>	 XioNoX: so to test return path issues I should try a mtr from a bunch of cache text hosts
[20:36:02] <bblack>	 maybe turn off the peering port for now?
[20:37:00] <bblack>	 XioNoX: sane theory above re: peering?
[20:37:04] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Another change which should only have effects in beta cluster (for a new host)." [puppet] - 10https://gerrit.wikimedia.org/r/733033 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy)
[20:37:08] <XioNoX>	 mutante: thx , I'd say the transport links are out of the possible problem so far, things re-rerouted internally 
[20:37:17] <XioNoX>	 bblack: checking
[20:38:34] <XioNoX>	 bblack: does it match a trop of inbound traffic or increase of outbound somewhere else?
[20:39:05] <bblack>	 it's harder to see the smaller inbound side drop on the same interface (but I think it's there), but the outbound drop there is pretty dramatic.
[20:39:45] <bblack>	 I think we've got some peers over that exchange which we're still advertising in one or both directions with, but are affected by telia somehow and the peering traffic is borked.
[20:40:15] <mutante>	 ripe atlas probes: 62/711 failed to codfw v4  70/629 failed to codfw v6, 295/705 failed to eqiad v4, 249/622 failed to eqiad v6
[20:40:58] <XioNoX>	 bblack: could be, disabling peering in eqiad
[20:41:01] <cdanis>	 it's a return path issue for sure
[20:41:20] <XioNoX>	 !log disable sessions to equinix eqiad IXP
[20:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:59] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 74, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:42:00] <greg-g>	 it's back
[20:42:02] <cdanis>	 does equinix peering happen via cr1-eqiad ?
[20:42:09] <XioNoX>	 cdanis: cr2
[20:42:10] <greg-g>	 my mtr is happy now
[20:42:17] <cdanis>	 that was the one that was working for me
[20:42:20] <cdanis>	 I think the issue is deeper
[20:42:51] <XioNoX>	 greg-g: can you share your previous MTR?
[20:42:55] <greg-g>	 (phab, mw.org etc all loading successfully for me now, just to be explicit)
[20:42:59] <paladox>	 gerrit works for me now
[20:43:06] <legoktm>	 XioNoX: https://phabricator.wikimedia.org/P17585 I believe
[20:43:16] <lucaswerkmeister>	 my mtr also seems happy now fwiw
[20:43:17] <mutante>	 thanks paladox, users report things work now
[20:43:17] <Spookreeeno>	 Gerrit + Sal.toolforge fine for me now
[20:43:23] <XioNoX>	 I should have asked for a return MTR before taking the sessions down :)
[20:43:26] <cdanis>	 ok, librenms + gerrit also working for me
[20:43:32] <cdanis>	 XioNoX: I was trying to get one as you made the change :)
[20:43:38] <cdanis>	 my home IP was affected
[20:43:42] <dontpanic>	 it's back up here
[20:43:46] <greg-g>	 XioNoX: previous traceroute: https://phabricator.wikimedia.org/P17585
[20:43:49] <XioNoX>	 cdanis: blame bblack :)
[20:44:03] <cdanis>	 greg-g: yeah that's not a return path traceroute though
[20:44:06] <bblack>	 :P
[20:44:09] <cdanis>	 the internet is asymmetric
[20:44:17] <greg-g>	 yeah yeah, that's all I had :/
[20:44:24] <cdanis>	 ofc
[20:44:27] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 55 probes of 629 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:44:30] <cdanis>	 not something you can easily get without router or shell access anyway
[20:45:05] <greg-g>	 uh, it's back (the symptoms)
[20:45:17] <cdanis>	 greg-g: PM me your home IP
[20:45:17] <greg-g>	 can't connect to the eqiad lb again
[20:45:20] <XioNoX>	 greg-g: can you share your IP?
[20:45:48] <icinga-wm>	 RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28
[20:45:51] <paladox>	 gerrit's down for me :/
[20:45:53] <bblack>	 depooling eqiad in dns might alleviate user issues, but it might also rob us of evidence
[20:46:04] <XioNoX>	 bblack: go for it
[20:46:06] <bblack>	 ok
[20:46:13] <cdanis>	 XioNoX: you did commit confirmed but didn't confirm
[20:46:15] <XioNoX>	 bblack: we can do specific mtr if needed
[20:46:20] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Depool eqiad temporarily [dns] - 10https://gerrit.wikimedia.org/r/733043 (owner: 10BBlack)
[20:46:27] <cdanis>	 it auto rolled back
[20:46:31] <cdanis>	 per console message on cr2-eqiad
[20:47:06] <bblack>	 heh, the exchange fix rolled back?
[20:47:18] <dontpanic>	 issues are back here too
[20:47:24] <bblack>	 either way, the dns change is already pushing, and takes ~10 minutes to come into full effect for all
[20:47:28] <dontpanic>	 previous tracert https://phabricator.wikimedia.org/P17584
[20:47:28] <XioNoX>	 cdanis: er, yeah commiting for real
[20:47:30] <bblack>	 (TTL randomness)
[20:47:33] <Spookreeeno>	 Yeah gerrit gone here
[20:47:40] <mutante>	 is that the only reason it's back, because config change was rolled back?
[20:47:42] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10RobH)
[20:47:47] <legoktm>	 bblack: so we're still going ahead with the eqiad depool?
[20:48:15] <paladox>	 gerrit works for me now
[20:48:19] <bblack>	 we've waffled on it too long anyways, my vote is stick with it for now a verify manually that we understand problems
[20:48:25] <bblack>	 (the dns depool)
[20:48:38] <cdanis>	 it will fix most of the impact, hopefully
[20:48:44] <cdanis>	 tools like gerrit and icinga will still be affected
[20:48:47] <legoktm>	 !log bblack has temporarily depooled eqiad https://gerrit.wikimedia.org/r/733043
[20:48:49] <greg-g>	 (mtr/etc are happy again)
[20:48:50] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10RobH)
[20:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:52] <majavah>	 +1, having the sites up for most people seems preferred
[20:48:59] <Spookreeeno>	 Yep back for me now
[20:49:16] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10RobH) a:03Jclark-ctr
[20:49:38] <robh>	 bah, imma stop using phab now cuz i forgot it spams into here during outage stuff.
[20:49:42] <bblack>	 it takes several minutes for most to see a real impact from the dns-level depool, so any immediate recoveries are probably from re-committing the exchange fix
[20:50:00] <cdanis>	 ^
[20:50:11] <cdanis>	 plus, tools like gerrit and icinga will still be affected regardless of eqiad pooledness status
[20:50:12] <bblack>	 but still, we don't have a firm grip on the issue, the exchange hack could just be moving problems around
[20:50:29] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 21 probes of 705 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:50:37] <cdanis>	 let's wait and see, between NELs and RIPE Atlas we can see if it is working
[20:50:56] <mutante>	 are we sure this has nothing to do with it?  "eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (Jclark-ctr) Cable has been run shows link."
[20:51:05] <mutante>	 just cause that was just minutes before reports started
[20:51:20] <cdanis>	 https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&from=now-6h&to=now
[20:51:32] <cdanis>	 according to RIPE Atlas, the issue started at 20:09
[20:51:40] <cdanis>	 bug post is at 20:06
[20:51:42] <cdanis>	 👀
[20:51:51] <bblack>	 oh?
[20:51:53] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 42 probes of 622 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:51:53] <XioNoX>	 could be, looking
[20:52:11] <bblack>	 heh
[20:52:17] <bblack>	 that would explain a lot of things! :)
[20:52:20] <cdanis>	 hmm!
[20:52:22] <XioNoX>	 yeah, that's most likely it
[20:52:23] <cdanis>	 the uh
[20:52:28] <cdanis>	 the NEL reports start at 20:06 exactly
[20:52:32] <cdanis>	 sooooo
[20:52:48] <cdanis>	 and a few minutes of delay in RIPE Atlas's probe result processing is typical
[20:52:51] <bblack>	 the traffic dropoff on the exchange is from a new second link that isn't fully provisioned stealing half the traffic and breaking it
[20:52:58] <XioNoX>	 return traffic from cr1 doesn't try to reach cr2 anymore, and peers don't accept the packets
[20:53:04] <XioNoX>	 it's a 2nd IP on the same IXP
[20:53:09] <mutante>	 I saw that and then wondered if that is scheduled maintenance
[20:53:27] <cdanis>	 XioNoX: that tracks -- cr1-eqiad sees my home IP as a black hole
[20:53:58] <cdanis>	 and greg-g's
[20:54:18] <XioNoX>	 I disabled the interface on cr1, going to re-enabled the active on on cr2
[20:54:31] <legoktm>	 !log <XioNoX> I disabled the interface on cr1, going to re-enabled the active on on cr2
[20:54:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:36] <mutante>	 I already pinged jclark but I guess he is onsite but afk
[20:54:45] <mutante>	 do we need to call him to unplug the new cable?
[20:54:52] <XioNoX>	 legoktm: thx :)
[20:55:00] <bblack>	 mutante: he said earlier he was about to leave
[20:55:02] <XioNoX>	 mutante: no, I disabled it
[20:55:06] <mutante>	 ok and ok
[20:55:28] <XioNoX>	 let me know if anyone is having any issue anymore
[20:56:08] <wikibugs>	 (03PS1) 10BBlack: Revert "Depool eqiad temporarily" [dns] - 10https://gerrit.wikimedia.org/r/733049
[20:56:11] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 711 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:56:20] <cdanis>	 https://i.imgur.com/DSxHzlg.png :) :) :)
[20:56:31] <lucaswerkmeister>	 works for me again
[20:56:54] <Spookreeeno>	 No issues here XioNoX
[20:57:03] <legoktm>	 OK for me to consider this incident resolved?
[20:57:13] <XioNoX>	 legoktm: yes
[20:57:15] <bblack>	 yeah, I need to revert the dns depool, but I think that's safe now
[20:57:18] <XioNoX>	 that was clearly the case
[20:57:21] <XioNoX>	 bblack: yep
[20:57:34] <XioNoX>	 I disabled the interface in Netbox as well
[20:57:35] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "Depool eqiad temporarily" [dns] - 10https://gerrit.wikimedia.org/r/733049 (owner: 10BBlack)
[20:57:39] <greg-g>	 thanks all
[20:57:51] <bblack>	 !log re-pooling eqiad in DNS
[20:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:08] <legoktm>	 OK, can someone else who is more networking savvy take on figuring out action items?
[20:58:39] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 46.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:59:07] <legoktm>	 ^ expected
[20:59:07] <bblack>	 it's probably going to be something more meta about being more aware/communicative/loud that we're plugging in new router ports and other such changes in general.
[20:59:41] <bblack>	 I think if we had all realized that change and its timestamp, we could've figured this out much faster.
[21:00:16] <bblack>	 there was a ticket update in this channel which should've clued us in, but I failed to notice it (mutante eventually brough it up, though!)
[21:00:41] <XioNoX>	 yeah, and the Telia stuff put us on the wrong path
[21:00:59] <mutante>	 yea, I marked the Telia related stuff as "unrelated" in the doc but did not remove it
[21:01:00] <greg-g>	  13:15:04 	<mutante>	XioNoX: users report issues right after a cable was patched in Eqiad but things work for me
[21:01:25] <mutante>	  put a 20:06 line in there when it actually started
[21:02:37] <XioNoX>	 it's 11pm here so I'm going to log off if everything is stable again, I'll follow up next week with action items, at first sight it's a process issue 
[21:03:07] <mutante>	 have a good Friday night
[21:03:41] <mutante>	 agreed, process issue
[21:04:31] <XioNoX>	 thanks everyone!
[21:05:25] <Spookreeeno>	 Have a good weekend all!
[21:07:50] <rzl>	 ahaha the VO alert just popped
[21:08:02] <XioNoX>	 fyi, I only got the victorops page now
[21:08:15] <rzl>	 for SREs responding because victorops just fired, the problem is resolved already, you can ignore it <3
[21:08:35] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 78.1 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:08:39] <mutante>	 heh
[21:08:48] <AntiComposite>	 yup, it shows in klaxon now too :)
[21:08:59] <AntiComposite>	 add that to the list 
[21:09:11] <herron>	 ok thx rzl
[21:09:20] <cdanis>	 XioNoX: yeah
[21:09:22] <cdanis>	 because
[21:09:29] <cdanis>	 our outbound connectivity from half of eqiad
[21:09:31] <cdanis>	 was broken
[21:09:33] <cdanis>	 😔
[21:10:09] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:12:47] <cdanis>	 legoktm: I've added a few meta-AIs (things to investigate more deeply later) but now I have to run off to care for the baby
[21:13:57] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: (Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RobH)
[21:14:05] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RobH)
[21:14:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] simplelamp2: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn)
[21:14:41] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RobH)
[21:15:09] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RobH) a:03Papaul
[21:16:07] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:19:19] <wikibugs>	 (03CR) 10Dzahn: "tested on existing user skins.reading-web-staging.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn)
[21:21:31] <wikibugs>	 (03CR) 10Dzahn: "done in simplelamp2 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/731183" [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn)
[21:21:52] <wikibugs>	 (03CR) 10Dzahn: "@Jaime btw, for your info, this should be the fix for https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206 as joe said there" [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn)
[21:23:30] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10Dzahn) The interface has been disabled because this started a partial outage, which has been resolved now.
[21:25:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10wiki_willy) Entries updated on the Accounting Spreadsheet to eliminate related Netbox errors
[21:27:52] <mutante>	 also Telia located the damage is now working on it
[21:29:55] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 24.45 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:31:46] <wikibugs>	 (03CR) 10Dzahn: ".. doesnt really fix it though" [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn)
[21:31:57] <wikibugs>	 10SRE, 10Performance Issue: High loading times on no.wikipedia - https://phabricator.wikimedia.org/T292762 (10jhsoby) I haven't noticed this happening lately. @Tholme, how about you? If you haven't noticed it for a while either, I think we can close this.
[21:36:38] <wikibugs>	 10ops-codfw, 10DC-Ops: codfw Related Netbox Errors - https://phabricator.wikimedia.org/T294158 (10wiki_willy)
[21:37:49] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:41:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: add logstash common profile [puppet] - 10https://gerrit.wikimedia.org/r/727626 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[21:44:56] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/727627 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[21:46:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10wiki_willy)
[21:49:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10wiki_willy)
[21:50:04] <wikibugs>	 (03PS1) 10Dzahn: simplelamp2: add a notify->exec to restart apache before changing MPM [puppet] - 10https://gerrit.wikimedia.org/r/733081
[21:51:10] <wikibugs>	 (03PS10) 10Cwhite: role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618)
[21:52:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10wiki_willy)
[21:52:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[22:04:21] <wikibugs>	 (03PS1) 10Dzahn: rsync::quickdatacopy: add option to exclude some files [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080)
[22:06:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: add option to exclude some files [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn)
[22:07:20] <wikibugs>	 (03PS2) 10Dzahn: rsync::quickdatacopy: add option to exclude some files [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080)
[22:09:17] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: Occasional rsync race while syncing /var/lib/grafana - https://phabricator.wikimedia.org/T294080 (10Dzahn) @fgiunchedi Yea, I think we can add a parameter to just pass through to rsync's --exclude parameter and then use that to ignore the file. Patch upload...
[22:09:52] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: Occasional rsync race while syncing /var/lib/grafana - https://phabricator.wikimedia.org/T294080 (10Dzahn) p:05Triage→03Medium
[22:10:36] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Dzahn) 05Open→03In progress
[22:11:07] <wikibugs>	 10SRE, 10MediaWiki-extensions-TranslationNotifications, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.5; 2021-10-19): Fatal error: Uncaught Error: Class 'MediaWiki\MediaWikiServices' not found - mediawiki_job_translationnotifications - https://phabricator.wikimedia.org/T293702 (10Dzahn) 05Open→03In progress
[22:14:42] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/31862/skins.reading-web-staging.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/733081 (owner: 10Dzahn)
[22:18:08] <legoktm>	 cdanis: thanks!
[22:24:13] <wikibugs>	 (03CR) 10Jforrester: "Not to deploy until wmf.6 is everywhere and won't regress." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester)
[22:24:17] <wikibugs>	 (03PS5) 10Jforrester: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932)
[22:25:29] <wikibugs>	 (03PS1) 10Dzahn: simplelamp2: ensure httpd::mpm comes before httpd, revert previous change [puppet] - 10https://gerrit.wikimedia.org/r/733086
[22:26:45] <wikibugs>	 (03CR) 10Dzahn: "not simple enough to make a simple class" [puppet] - 10https://gerrit.wikimedia.org/r/733086 (owner: 10Dzahn)
[22:27:35] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/31863/skins.reading-web-staging.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/733086 (owner: 10Dzahn)
[22:29:30] <wikibugs>	 (03CR) 10Dzahn: "@Jaime with this it works now :)" [puppet] - 10https://gerrit.wikimedia.org/r/733086 (owner: 10Dzahn)
[22:30:53] <wikibugs>	 (03CR) 10Dzahn: "this should have fixed issue back from https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206 now" [puppet] - 10https://gerrit.wikimedia.org/r/733086 (owner: 10Dzahn)
[22:44:17] <wikibugs>	 (03PS1) 10Dzahn: simplelap: ensure httpd::mpm before mpm, set purge_manual_config => false [puppet] - 10https://gerrit.wikimedia.org/r/733087
[22:48:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "just like https://gerrit.wikimedia.org/r/c/operations/puppet/+/733086 and unused" [puppet] - 10https://gerrit.wikimedia.org/r/733087 (owner: 10Dzahn)
[22:55:08] <wikibugs>	 (03PS1) 10Jforrester: [BETA CLUSTER] Enable WikibaseLexeme Scribunto access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733089 (https://phabricator.wikimedia.org/T294159)
[22:55:13] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:27] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:51] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "Beta-Cluster only config change; let's do this today rather than have the train blow up next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733089 (https://phabricator.wikimedia.org/T294159) (owner: 10Jforrester)
[23:10:33] <wikibugs>	 (03Merged) 10jenkins-bot: [BETA CLUSTER] Enable WikibaseLexeme Scribunto access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733089 (https://phabricator.wikimedia.org/T294159) (owner: 10Jforrester)
[23:13:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:17] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Reedy) >list:member:digest:footer  ` _______________________________________________ $display_name mailing list -- $listname To unsubscribe send an email to ${short_listname}-leave@${domain}...
[23:17:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:45] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1285.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:18:56] <wikibugs>	 (03PS1) 10Dzahn: wikistats::httpd: support buster with PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/733091
[23:19:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "cloud VPS, not analytics" [puppet] - 10https://gerrit.wikimedia.org/r/733091 (owner: 10Dzahn)
[23:25:39] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:27:43] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:28:14] <mutante>	 hrmm ok
[23:28:30] <mutante>	 already had that wikitech page open and wondering
[23:37:16] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] zuul: use releng list rather than jenkins-bot for email [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar)
[23:41:36] <wikibugs>	 (03PS1) 10Dzahn: wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092
[23:42:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 (owner: 10Dzahn)
[23:43:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "back in mailman2 you would have to add such a sender to a list of allowed_nonsubscriber or so to receive the mails on the list. keep in mi" [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar)
[23:43:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: use releng list rather than jenkins-bot for email [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar)
[23:55:05] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:55:19] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state