[00:03:42] hmm, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/InitialiseSettings.php#4149 looks a bit wrong [00:04:24] (03PS1) 10Zabe: Fix array declaration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840 [00:06:51] (03PS2) 10Zabe: Fix array declaration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732840 (https://phabricator.wikimedia.org/T197058) [00:27:35] (03PS1) 10Dzahn: simplelap: fix a typo introduced in a previous change [puppet] - 10https://gerrit.wikimedia.org/r/732842 [00:29:22] (03CR) 10Dzahn: simplelap: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714155 (owner: 10RhinosF1) [00:29:56] (03CR) 10Dzahn: [C: 03+2] simplelap: fix a typo introduced in a previous change [puppet] - 10https://gerrit.wikimedia.org/r/732842 (owner: 10Dzahn) [00:30:05] (03PS1) 10Ahmon Dancy: docker: Mostly documentation updates [puppet] - 10https://gerrit.wikimedia.org/r/732844 [00:31:24] (03PS2) 10Dzahn: simplelap: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731184 [00:32:08] (03CR) 10Dzahn: [C: 03+2] simplelap: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731184 (owner: 10Dzahn) [00:35:22] (03CR) 10Dzahn: "tested in cloud VPS" [puppet] - 10https://gerrit.wikimedia.org/r/731184 (owner: 10Dzahn) [02:56:31] Two tgr's? [03:21:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:23:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:47:28] (03CR) 10Ahmon Dancy: "Note: I have not tested this yet. I will tomorrow unless someone beats me to it. :-)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy) [04:05:37] PROBLEM - SSH on puppetmaster1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:37:20] (03PS1) 10Marostegui: Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732807 [04:38:03] (03CR) 10Marostegui: [C: 03+2] Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732807 (owner: 10Marostegui) [04:38:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17575 and previous config saved to /var/cache/conftool/dbconfig/20211022-043845-root.json [04:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:03] !log Deploy schema change on s8 codfw - T291719 [04:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:09] T291719: Remove abuse_filter_log.afl_filter column and adjust schema consequently from Wikimedia production - https://phabricator.wikimedia.org/T291719 [04:53:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17576 and previous config saved to /var/cache/conftool/dbconfig/20211022-045349-root.json [04:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:51] RECOVERY - SSH on puppetmaster1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17577 and previous config saved to /var/cache/conftool/dbconfig/20211022-050852-root.json [05:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17578 and previous config saved to /var/cache/conftool/dbconfig/20211022-052356-root.json [05:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17579 and previous config saved to /var/cache/conftool/dbconfig/20211022-053900-root.json [05:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17580 and previous config saved to /var/cache/conftool/dbconfig/20211022-055403-root.json [05:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:05] (03PS4) 10Ayounsi: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) [06:57:24] (03PS5) 10Ayounsi: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211022T0700) [07:07:43] (03CR) 10Muehlenhoff: [C: 03+2] Add remaining ownership annotations for ML services [puppet] - 10https://gerrit.wikimedia.org/r/732268 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [07:08:29] (03CR) 10Muehlenhoff: [C: 03+2] Add ownership annotations for Data Engineering services [puppet] - 10https://gerrit.wikimedia.org/r/732257 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [07:09:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [07:11:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732414 (owner: 10Dzahn) [07:15:47] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732383 (https://phabricator.wikimedia.org/T276632) (owner: 10Legoktm) [07:21:01] 10SRE, 10Observability-Logging, 10Traffic, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) [07:21:39] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:31] (03CR) 10Filippo Giunchedi: [C: 03+1] Add an alert for HDFS corrupt blocks [alerts] - 10https://gerrit.wikimedia.org/r/732748 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [07:42:27] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add enable_relay flag to statsd exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/732827 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [07:42:51] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732635 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [07:44:22] (03CR) 10MMandere: [C: 03+2] prometheus: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732635 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [07:45:36] (03PS1) 10Muehlenhoff: Set ganeti2025/2026 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/732911 (https://phabricator.wikimedia.org/T282603) [07:46:41] PROBLEM - ganeti-noded running on ganeti2025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [07:46:47] PROBLEM - ganeti-confd running on ganeti2025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [07:47:27] PROBLEM - ganeti-mond running on ganeti2025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [07:49:45] (03PS1) 10Ema: Add debian/patches/0008-vsl_check_e_inval_assertion.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/732913 (https://phabricator.wikimedia.org/T293879) [08:00:23] !log deployment-cache-text06: test 0008-vsl_check_e_inval_assertion.patch https://gerrit.wikimedia.org/r/c/operations/debs/varnish4/+/732913/ T293879 [08:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:31] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [08:01:09] (03CR) 10Muehlenhoff: [C: 03+2] Set ganeti2025/2026 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/732911 (https://phabricator.wikimedia.org/T282603) (owner: 10Muehlenhoff) [08:09:42] (03CR) 10Jbond: [C: 04-1] "please hold off merging this i want to discuss it wit mortiz to see if/who we want to track the uid/gid" [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [08:14:32] (03CR) 10Jbond: cumin: add an alias for new pki roles and add to misc-others (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [08:14:55] (03PS3) 10Jbond: cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [08:14:59] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:21] (03CR) 10jerkins-bot: [V: 04-1] cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [08:18:19] (03PS4) 10Jbond: cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [08:18:34] mmhh I'll take a look at the grafana sync failure [08:19:45] (03CR) 10jerkins-bot: [V: 04-1] Add debian/patches/0008-vsl_check_e_inval_assertion.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/732913 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [08:20:02] hah, an rsync race while transferring the sqlite journal [08:20:05] (03PS5) 10Jbond: cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [08:20:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [08:21:28] I'll file a task for now, we'll likely need rsync options in quickdatacopy I think to be able to tweak file selection [08:23:48] !log cp3062: test 0008-vsl_check_e_inval_assertion.patch https://gerrit.wikimedia.org/r/c/operations/debs/varnish4/+/732913/ T293879 [08:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:55] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [08:24:07] 10SRE, 10Observability-Metrics: Occasional rsync race while syncing /var/lib/grafana - https://phabricator.wikimedia.org/T294080 (10fgiunchedi) [08:24:10] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3062.esams.wmnet,service=(varnish-fe|ats-tls) [08:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:47] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3062.esams.wmnet,service=(varnish-fe|ats-tls) [08:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:10] (03CR) 10Muehlenhoff: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [08:28:44] (03CR) 10Ema: [V: 03+2 C: 03+2] Add debian/patches/0008-vsl_check_e_inval_assertion.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/732913 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [08:31:44] (03PS1) 10MMandere: exim: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732917 (https://phabricator.wikimedia.org/T282787) [08:34:16] (03PS1) 10Majavah: Use most specific prefix for dns record site assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/732919 (https://phabricator.wikimedia.org/T294082) [08:36:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2025.codfw.wmnet with OS buster [08:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:09] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS buster [08:36:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732917 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [08:44:06] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:42] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:51:51] (03CR) 10Btullis: [C: 03+2] Add an alert for HDFS corrupt blocks [alerts] - 10https://gerrit.wikimedia.org/r/732748 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [08:55:11] (03PS1) 10Btullis: Remove the HDFS corrupt blocks check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/732922 (https://phabricator.wikimedia.org/T293399) [08:55:23] (03Merged) 10jenkins-bot: Add an alert for HDFS corrupt blocks [alerts] - 10https://gerrit.wikimedia.org/r/732748 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [08:56:24] joal: good news, those pendng compactions have started to drop on cassandra aqs1012-b. Not wedged after all. [08:58:21] (03PS1) 10Muehlenhoff: Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/732924 [08:59:19] (03PS1) 10Ema: varnishttfb.mtail: use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/732925 (https://phabricator.wikimedia.org/T293879) [09:00:54] Wrong channel, sorry. [09:01:32] good news still sounds good ^^ [09:01:59] :) [09:04:20] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2025.codfw.wmnet with OS buster [09:04:55] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS buster completed: - ganeti2025 (**PASS**) - Downtimed on Ici... [09:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:02] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:11:12] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/ResubmitChanges.php wikidatawiki --minimum-age $((60*60*12)) # T294029 [09:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:18] T294029: Run ResubmitChanges.php to resubmit stuck changes from 2021-10-21 14:26 UTC - https://phabricator.wikimedia.org/T294029 [09:18:52] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM but best wait for volans to comment." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/732919 (https://phabricator.wikimedia.org/T294082) (owner: 10Majavah) [09:20:39] (03CR) 10Muehlenhoff: Add ownership annotations for WMCS services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [09:25:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/732925 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [09:25:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) Thanks! > I take it the main concern here is allocating a public IPv4 address, which is a scarce resource, no? That's one... [09:27:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/732924 (owner: 10Muehlenhoff) [09:32:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) (owner: 10Legoktm) [09:35:59] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:01:17] (03PS1) 10Jbond: cas 6.4.2: exlucde tomcat-embed-el-9.0.52.jar [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732930 [10:03:36] (03CR) 10Jbond: [C: 04-1] Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [10:05:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2026.codfw.wmnet with OS buster [10:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:15] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS buster [10:16:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: change headers test for /static/current [puppet] - 10https://gerrit.wikimedia.org/r/732280 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [10:27:24] (03PS1) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 [10:33:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2026.codfw.wmnet with OS buster [10:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:09] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS buster completed: - ganeti2026 (**PASS**) - Downtimed on Ici... [10:33:37] (03PS2) 10Jbond: cas 6.4.2: exlucde tomcat-embed-el-9.0.52.jar [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732930 [10:33:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas 6.4.2: exlucde tomcat-embed-el-9.0.52.jar [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732930 (owner: 10Jbond) [10:36:11] (03PS2) 10Majavah: debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288 [10:37:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) a:05aborrero→03ayounsi >>! In T289882#7450242, @ayounsi wrote: > Which means increasing our attack surface as well as... [10:40:22] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:40:47] (03PS1) 10Jbond: changelog: fix distro [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732941 [10:46:53] !log upload cas_6.4.2-1+wmf10u1 [10:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:52] (03CR) 10Btullis: "Thanks both." [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [10:48:55] (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) [10:53:16] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:48] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:00:28] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:15] (03CR) 10Jbond: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:08:41] (03PS1) 10Jbond: Revert "P:idp::standalon: remove unused profile" [puppet] - 10https://gerrit.wikimedia.org/r/732812 [11:08:51] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:idp::standalon: remove unused profile" [puppet] - 10https://gerrit.wikimedia.org/r/732812 (owner: 10Jbond) [11:12:22] (03CR) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:16:38] (03PS1) 10Jbond: P:idp::standalon: switch to P:base::production [puppet] - 10https://gerrit.wikimedia.org/r/732945 [11:17:13] (03CR) 10Jbond: [C: 03+2] P:idp::standalon: switch to P:base::production [puppet] - 10https://gerrit.wikimedia.org/r/732945 (owner: 10Jbond) [11:19:52] (03CR) 10Michael Große: [C: 03+1] Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [11:29:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10EChetty) @Dzahn Hey Dan! Thanks for setting this up. However it seems that some of the sites are not authenticating me correctly when I try to access them: For eg. When I... [11:34:06] 10SRE, 10Cassandra, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability): Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java - https://phabricator.wikimedia.org/T261966 (10Aklapper) [11:34:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288 (owner: 10Majavah) [11:34:18] (03CR) 10Majavah: [C: 03+2] debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288 (owner: 10Majavah) [11:35:36] (03Merged) 10jenkins-bot: debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288 (owner: 10Majavah) [11:36:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm::kubectl: install kubectl-sudo [puppet] - 10https://gerrit.wikimedia.org/r/732747 (owner: 10Majavah) [11:39:15] (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatchChanges.php-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) [11:39:17] (03PS1) 10Lucas Werkmeister (WMDE): Remove wmg variables for dispatchChanges.php Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732950 (https://phabricator.wikimedia.org/T292604) [11:39:19] (03PS1) 10Lucas Werkmeister (WMDE): Remove wikibaseDispatchRedisLockManager config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604) [11:40:16] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:40:32] (03CR) 10MMandere: [C: 03+2] exim: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732917 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [11:43:38] (03PS1) 10Btullis: Remove unused dummy keytabs and an SSH key for alluxio [labs/private] - 10https://gerrit.wikimedia.org/r/732952 (https://phabricator.wikimedia.org/T266641) [11:44:28] (03PS1) 10Majavah: kubectl::kubeadm: make kubectl-sudo executable [puppet] - 10https://gerrit.wikimedia.org/r/732953 [11:49:16] (03CR) 10Lucas Werkmeister (WMDE): "I’m not sure if I want to deploy this on Monday or wait longer, but putting it up for review already." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:49:21] (03CR) 10Btullis: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:50:15] (03CR) 10Lucas Werkmeister (WMDE): "This feels like a riskier change than others, because it touches a MediaWiki core setting – I didn’t find any other uses of this lock mana" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:53:24] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:55:08] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:27] 10Puppet, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-jbond: Add additional prometheus metrics to puppet runs - https://phabricator.wikimedia.org/T283585 (10fgiunchedi) @jbond IIRC for this we went the logstash way, anything else to be done and/or missing ? [11:58:15] 10Puppet, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-jbond: Add additional prometheus metrics to puppet runs - https://phabricator.wikimedia.org/T283585 (10jbond) 05Open→03Resolved a:03jbond Thats correct all though it still a work in progress, however this one i thin... [12:00:37] (03PS1) 10MMandere: ntp: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732954 (https://phabricator.wikimedia.org/T282787) [12:01:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:18] (03CR) 10Btullis: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:02:25] (03PS2) 10Btullis: Remove all remaining references to alluxio [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) [12:03:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/732954 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:04:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubectl::kubeadm: make kubectl-sudo executable [puppet] - 10https://gerrit.wikimedia.org/r/732953 (owner: 10Majavah) [12:05:25] 10SRE, 10observability, 10User-jbond: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Closing as we're in a good place nowadays ` root@prometheus1004:~# apache2ctl graceful root@prometheus1004:~# ` [12:15:05] (03CR) 10MMandere: [C: 03+2] ntp: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732954 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:15:09] (03PS1) 10MVernon: codfw-prod: final weight to ms-be20[62-65] [software/swift-ring] - 10https://gerrit.wikimedia.org/r/732957 (https://phabricator.wikimedia.org/T288458) [12:23:27] (03PS3) 10Seddon: Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner) [12:25:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:05] (03PS1) 10MMandere: grafana: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732959 (https://phabricator.wikimedia.org/T282787) [12:29:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/732957 (https://phabricator.wikimedia.org/T288458) (owner: 10MVernon) [12:31:32] (03CR) 10Jbond: [C: 03+1] add centrallog2002 to codfw anycast_neighbors and syslog fw allows [homer/public] - 10https://gerrit.wikimedia.org/r/731828 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [12:31:44] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:03] (03CR) 10Jbond: [C: 03+1] upgrade-varnish: support frontend instance only [cookbooks] - 10https://gerrit.wikimedia.org/r/731935 (owner: 10Ema) [12:32:36] (03CR) 10Jbond: [C: 03+1] drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [12:32:50] (03PS1) 10Arturo Borrero Gonzalez: cloud: nfs: refresh exclude pattern for nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/732960 (https://phabricator.wikimedia.org/T294086) [12:33:06] (03CR) 10Jbond: [C: 03+1] Add drmrs network to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/732351 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [12:33:16] (03CR) 10Jbond: [C: 03+1] Remove GRE tunnel between cr4-ulsfo and cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/732616 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [12:34:33] (03PS2) 10Arturo Borrero Gonzalez: cloud: nfs: refresh exclude pattern for nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/732960 (https://phabricator.wikimedia.org/T294086) [12:35:31] (03PS3) 10Jbond: O:puppetboard::ng: add new role [puppet] - 10https://gerrit.wikimedia.org/r/732368 [12:35:34] (03CR) 10Muehlenhoff: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:36:36] (03CR) 10Jbond: [C: 03+2] O:puppetboard::ng: add new role [puppet] - 10https://gerrit.wikimedia.org/r/732368 (owner: 10Jbond) [12:36:44] 10SRE, 10SRE Observability, 10observability, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving since graphite failover nowadays is much better and documented at https://wikitech.wiki... [12:36:53] 10SRE, 10SRE Observability, 10observability, 10Documentation, 10Graphite: document graphite failover/backfill procedures - https://phabricator.wikimedia.org/T102575 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done https://wikitech.wikimedia.org/wiki/Graphite#Operations_manual [12:37:02] 10SRE, 10WMDE-Analytics-Engineering, 10Graphite, 10Patch-For-Review, 10Tracking-Neverending: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451 (10fgiunchedi) [12:38:56] 10SRE, 10Observability-Metrics, 10observability: grafana access control - https://phabricator.wikimedia.org/T108546 (10fgiunchedi) 05Open→03Declined Resolving this as we're moving away from Graphite [12:39:46] (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: final weight to ms-be20[62-65] [software/swift-ring] - 10https://gerrit.wikimedia.org/r/732957 (https://phabricator.wikimedia.org/T288458) (owner: 10MVernon) [12:40:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:03] 10SRE, 10Observability-Metrics, 10observability: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This happened as part of {T247963} where we recreated whisper files on the reimaged hosts [12:40:52] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Doing): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10hashar) [12:41:04] (03PS1) 10Jbond: P:puppetboard::ng: use P:base:::production [puppet] - 10https://gerrit.wikimedia.org/r/732961 [12:42:15] (03CR) 10Jbond: [C: 03+2] P:puppetboard::ng: use P:base:::production [puppet] - 10https://gerrit.wikimedia.org/r/732961 (owner: 10Jbond) [12:44:41] 10SRE, 10Observability-Alerting: Monitoring: add link to graph for Icinga timeseries alarms - https://phabricator.wikimedia.org/T167422 (10fgiunchedi) 05Open→03Invalid Tentatively resolving since we're moving away from icinga-based timeseries alerts and onto Alertmanager. For the latter the lack of a dashb... [12:44:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: nfs: refresh exclude pattern for nagios check_disk [puppet] - 10https://gerrit.wikimedia.org/r/732960 (https://phabricator.wikimedia.org/T294086) (owner: 10Arturo Borrero Gonzalez) [12:46:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:04] 10SRE: librenms: consider using Distributed Poller with multiple netmon servers - https://phabricator.wikimedia.org/T171122 (10fgiunchedi) -observability for backlog cleanup, unclear whether we want/need this [12:49:42] (03CR) 10Michael Große: [C: 03+1] Remove wikibaseDispatchRedisLockManager config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [12:51:11] (03CR) 10Michael Große: [C: 03+1] Remove dispatchChanges.php-related Wikibase settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [12:51:25] 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Given than nowadays all Grafana alerts show up at https://alerts.wikimedia.org and... [12:51:29] (03CR) 10Michael Große: [C: 03+1] Remove wmg variables for dispatchChanges.php Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732950 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [12:51:34] (03PS1) 10Jbond: O:puppetboard::ng: Add config for cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/732963 [12:53:10] (03CR) 10Jbond: [C: 03+2] O:puppetboard::ng: Add config for cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/732963 (owner: 10Jbond) [12:53:32] 10SRE, 10Observability-Logging, 10Traffic, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) >>! In T293879#7450109, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations),... [12:54:36] (03PS1) 10Jbond: O:puippetboard:ng: fix typo, label [puppet] - 10https://gerrit.wikimedia.org/r/732964 [12:54:48] (03CR) 10Jbond: [C: 03+1] O:puippetboard:ng: fix typo, label [puppet] - 10https://gerrit.wikimedia.org/r/732964 (owner: 10Jbond) [12:54:53] (03CR) 10Jbond: [C: 03+2] O:puippetboard:ng: fix typo, label [puppet] - 10https://gerrit.wikimedia.org/r/732964 (owner: 10Jbond) [12:56:01] 10SRE, 10Icinga, 10SRE Observability, 10observability: icinga really needs to check puppet run success of passive icinga hosts - https://phabricator.wikimedia.org/T215848 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Implemented at https://gerrit.wikimedia.org/r/c/operations/alerts/+/710248 [12:57:28] (03PS1) 10Jbond: O:puppetboard::ng: fix typo cfssl vs ssl [puppet] - 10https://gerrit.wikimedia.org/r/732965 [12:58:00] 10SRE, 10SRE Observability, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10fgiunchedi) 05Open→03Invalid No longer the case, graphite hosts (Bullseye) come up fine after a reboot nowadays [12:58:08] 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Reedy) Might be worth double checking the wikitech-l and mediawiki-l footers too... [13:00:27] (03CR) 10Jbond: [C: 03+2] O:puppetboard::ng: fix typo cfssl vs ssl [puppet] - 10https://gerrit.wikimedia.org/r/732965 (owner: 10Jbond) [13:01:58] 10SRE, 10Contributors-Team, 10observability, 10Sustainability (Incident Followup): High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 (10hashar) I have filled that one as part of an incident followup task but #release-engineering-team is... [13:02:14] (03PS1) 10Urbanecm: Deploy Growth mentor dashboard to phase II wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732967 (https://phabricator.wikimedia.org/T278920) [13:05:29] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732959 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:07:31] (03PS1) 10Hashar: zuul: use releng list rather than jenkins-bot for email [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) [13:12:04] (03CR) 10Herron: [C: 03+1] Prefer mx1001 over mx2001 for weights in MX records [dns] - 10https://gerrit.wikimedia.org/r/732924 (owner: 10Muehlenhoff) [13:16:52] (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatchChanges.php-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) [13:16:54] (03PS2) 10Lucas Werkmeister (WMDE): Remove wmg variables for dispatchChanges.php Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732950 (https://phabricator.wikimedia.org/T292604) [13:16:56] (03PS2) 10Lucas Werkmeister (WMDE): Remove wikibaseDispatchRedisLockManager config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732951 (https://phabricator.wikimedia.org/T292604) [13:16:58] (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatchLagToMaxLagFactor Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604) [13:17:00] (03CR) 10Lucas Werkmeister (WMDE): Remove dispatchChanges.php-related Wikibase settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [13:18:27] (03PS1) 10Ema: Use ats-tls metrics for edge traffic drop alert [alerts] - 10https://gerrit.wikimedia.org/r/732970 (https://phabricator.wikimedia.org/T293879) [13:25:33] 10SRE, 10Observability-Logging: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10fgiunchedi) [13:27:52] (03PS1) 10Zabe: Fix some easy codestyle issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732971 [13:30:28] !log upload python3-pypuppetdb_2.4.0-1_all.deb to bullseye [13:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:52] 10SRE, 10Observability-Alerting: Aggregate check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10fgiunchedi) [13:39:40] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove unused dummy keytabs and an SSH key for alluxio [labs/private] - 10https://gerrit.wikimedia.org/r/732952 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [13:39:51] (03CR) 10Btullis: [V: 03+2 C: 03+2] Remove unused dummy keytabs and an SSH key for alluxio [labs/private] - 10https://gerrit.wikimedia.org/r/732952 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [13:41:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've gone ahead and updated https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters adding a lot of information about the cluster plus se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [13:42:10] !log deployment-cache-upload06: restart varnish-frontend, package got upgraded to 6.0.8 T294116 [13:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:17] T294116: Varnish reload failing on deployment-cache-upload06 - https://phabricator.wikimedia.org/T294116 [13:47:15] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Varnish reload failing on deployment-cache-upload06 - https://phabricator.wikimedia.org/T294116 (10ema) 05Open→03Resolved a:03ema I upgraded varnish to 6.0.8 everywhere (see T292290) and forgot about restarting the service on deployment-cache-upload06. I... [13:48:29] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [13:49:28] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [13:50:19] (03PS2) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 [13:51:30] (03PS1) 10Hashar: zuul: gracefully shutdown [puppet] - 10https://gerrit.wikimedia.org/r/732978 (https://phabricator.wikimedia.org/T257040) [13:54:03] PROBLEM - puppetboard on puppetboard1002 is CRITICAL: connect to address 10.64.48.59 and port 8001: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [13:58:31] (03CR) 10Michael Große: [C: 03+1] Remove dispatchLagToMaxLagFactor Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732969 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [13:59:14] (03CR) 10Michael Große: [C: 03+1] Remove dispatchChanges.php-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732949 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [14:00:15] (03CR) 10Filippo Giunchedi: [C: 03+1] Use ats-tls metrics for edge traffic drop alert [alerts] - 10https://gerrit.wikimedia.org/r/732970 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [14:00:30] (03CR) 10Btullis: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [14:12:37] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [14:12:37] (03CR) 10Michael Große: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große) [14:12:49] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [14:16:51] PROBLEM - puppetboard on puppetboard2002 is CRITICAL: connect to address 10.192.32.30 and port 8001: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [14:21:47] (03PS3) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834) [14:24:50] (03PS4) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) [14:25:44] (03CR) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große) [14:32:33] (03CR) 10Lucas Werkmeister (WMDE): Regularly resubmit changes that might be stuck in wb_changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große) [14:40:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10CDanis) Try visiting https://idp.wikimedia.org/logout and then logging back in? [14:45:01] (03PS4) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834) [14:54:38] (03CR) 10Ahmon Dancy: "No changes reported by PCC" [puppet] - 10https://gerrit.wikimedia.org/r/732844 (owner: 10Ahmon Dancy) [14:57:00] (03PS1) 10Majavah: toolforge::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) [15:06:35] !log upload puppetboard_3.1.0-1_all.deb to ullseye-wikimedia [15:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:41] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:33] (03PS1) 10Majavah: puppetmaster::gitsync: Replace cron job with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/732991 (https://phabricator.wikimedia.org/T273673) [15:25:17] (03PS1) 10Btullis: Add three more HDFS related checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/732993 (https://phabricator.wikimedia.org/T293399) [15:31:35] (03PS5) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834) [15:32:25] 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Aklapper) * https://lists.wikimedia.org/postorius/lists/mediawiki-l.lists.wikimedia.org/templates is empty. * https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/templ... [15:40:09] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:05] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:17] (03PS1) 10Hashar: zuul: double git-daemon max connections 48 -> 96 [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) [15:50:24] (03CR) 10Hashar: "We have bumped the limit two years ago ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/508408 ). While looking at the log today we " [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar) [15:53:38] (03PS1) 10Jbond: P:puppetdb: update puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) [15:54:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31853/console" [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [15:54:59] (03PS2) 10Jbond: P:puppetdb: update puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) [15:58:09] (03PS3) 10Jbond: P:puppetdb: update puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) [15:58:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31855/console" [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [15:59:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetdb: update puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/733002 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [16:01:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10EChetty) @CDanis Hey Chris, tried that since it seemed to work for Luke - but no dice :( Also tried flushing my cookies/cache, changing browser and the good ol' turning o... [16:10:49] (03PS1) 10Cwhite: logstash: bugfix logstash logEvent json encoding [puppet] - 10https://gerrit.wikimedia.org/r/733007 [16:25:04] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH) [16:25:12] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH) [16:25:17] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:36] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH) [16:26:12] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH) a:03Papaul [16:31:25] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:29] (03PS9) 10Cwhite: role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) [16:32:48] (03PS8) 10Cwhite: role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) [16:32:56] (03CR) 10jerkins-bot: [V: 04-1] role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [16:33:05] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH) [16:33:07] (03CR) 10jerkins-bot: [V: 04-1] role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [16:33:15] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10RobH) [16:33:36] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10RobH) [16:33:52] (03PS9) 10Cwhite: role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) [16:34:02] (03PS10) 10Cwhite: role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) [16:34:04] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10RobH) [16:34:21] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10RobH) a:03Jclark-ctr [16:37:24] (03CR) 10Jgreen: [C: 03+2] Add frpm1002, frauth1002, pay-lvs1003, pay-lvs1004 [dns] - 10https://gerrit.wikimedia.org/r/732834 (https://phabricator.wikimedia.org/T289812) (owner: 10Dwisehaupt) [16:40:23] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:48:17] (03PS5) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) [16:48:19] (03CR) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große) [16:49:01] (03PS6) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) [16:49:03] (03CR) 10Michael Große: Regularly resubmit changes that might be stuck in wb_changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große) [16:53:31] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:02:09] 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10Majavah) See also: {T294034} [17:02:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install ganeti102[56] - https://phabricator.wikimedia.org/T293909 (10RobH) [17:04:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10RobH) [17:04:53] (03CR) 10Dzahn: [C: 03+2] zuul: double git-daemon max connections 48 -> 96 [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar) [17:06:56] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10serviceops: package requirements for upgrading deployment_servers to buster - https://phabricator.wikimedia.org/T242480 (10Majavah) 05Open→03Resolved This was done at some point. [17:09:37] (03CR) 10Herron: [C: 03+1] role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:10:01] (03CR) 10Hashar: "Moritz may you review the systemd magic? Pretty sure you have more experience than me on that regard ;) No urgency, the task has been ar" [puppet] - 10https://gerrit.wikimedia.org/r/732978 (https://phabricator.wikimedia.org/T257040) (owner: 10Hashar) [17:10:42] (03CR) 10Brennen Bearnes: [C: 03+1] cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 (owner: 10Dzahn) [17:11:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10CDanis) Ah, I see the problem, you weren't added to the `wmf` LDAP group. I've added you -- try https://idp.wikimedia.org/logout and then try again please? [17:11:45] (03CR) 10Hashar: "Danke Schon!" [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar) [17:12:07] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10RobH) [17:12:26] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10RobH) a:03Papaul [17:14:00] (03CR) 10Hashar: "That should not add any spam to our list, the cron job never errored out and the other one is for Zuul smtp reporter which is not used." [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [17:19:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10EChetty) @CDanis Amazing -> That plus a cookie flush seemed to do the trick :) Thank you! [17:19:42] (03PS1) 10Majavah: scap: Use service name for logstash-beta [puppet] - 10https://gerrit.wikimedia.org/r/733023 [17:25:07] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:35] (03CR) 10Cwhite: [C: 03+2] logstash: bugfix logstash logEvent json encoding [puppet] - 10https://gerrit.wikimedia.org/r/733007 (owner: 10Cwhite) [17:26:20] (03CR) 10Herron: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732827 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [17:28:40] (03CR) 10Herron: [C: 03+1] role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:29:32] (03CR) 10Herron: [C: 03+1] profile: add logstash common profile [puppet] - 10https://gerrit.wikimedia.org/r/727626 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:31:19] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:54] (03CR) 10Herron: [C: 03+1] hiera: add minimal logstash-beta-next hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:36:17] (03CR) 10Herron: [C: 03+1] logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:44:50] (03PS1) 10AOkoth: gitlab: add data for cloud test vms [puppet] - 10https://gerrit.wikimedia.org/r/733025 [17:55:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Dwisehaupt) 05Resolved→03Open @Cmjohnson I believe the network config was swapped for all the hosts. When attempting to build them I see that the pay-l... [17:56:31] (03PS2) 10AOkoth: gitlab: add default values for cloud test vms [puppet] - 10https://gerrit.wikimedia.org/r/733025 [17:59:40] (03PS3) 10AOkoth: gitlab: add default values for cloud test vms [puppet] - 10https://gerrit.wikimedia.org/r/733025 [18:00:16] (03CR) 10Dzahn: [C: 03+2] gitlab: add default values for cloud test vms [puppet] - 10https://gerrit.wikimedia.org/r/733025 (owner: 10AOkoth) [18:15:00] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10serviceops: package requirements for upgrading deployment_servers to buster - https://phabricator.wikimedia.org/T242480 (10Dzahn) [18:24:45] 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) It is possible to get the... [18:25:13] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:36] (03CR) 10Dzahn: zuul: double git-daemon max connections 48 -> 96 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/733000 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar) [18:31:25] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:35] (03CR) 10Dzahn: [C: 03+2] cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [18:32:05] (03CR) 10Dzahn: [C: 03+2] cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 (owner: 10Dzahn) [18:34:08] (03CR) 10Dzahn: "added to "misc-ops" but too late to edit the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [18:40:27] (03PS3) 10Dzahn: cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 [18:43:56] (03CR) 10Dzahn: [C: 03+2] scap: Use service name for logstash-beta [puppet] - 10https://gerrit.wikimedia.org/r/733023 (owner: 10Majavah) [18:45:49] (03PS3) 10Ebernhardson: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) [18:45:51] (03CR) 10Ebernhardson: query_service: Add new oauth related configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [18:46:09] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Metrics, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle) [18:46:26] (03CR) 10jerkins-bot: [V: 04-1] query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [18:51:05] (03PS4) 10Dzahn: cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 [18:51:42] (03CR) 10Dzahn: [C: 03+2] cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 (owner: 10Dzahn) [18:55:15] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:56:00] (03PS2) 10Dzahn: simplelamp2: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731183 [18:56:09] (03PS1) 10Ahmon Dancy: thumbor: Remove conditionalization for stretch [puppet] - 10https://gerrit.wikimedia.org/r/733033 [19:07:43] (03PS4) 10Ebernhardson: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) [19:08:23] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:11:46] ignoring that based on the word "test" in it [19:14:49] (03PS1) 10Accraze: ml-services: add enwiki-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/733034 (https://phabricator.wikimedia.org/T294141) [19:17:00] !log Start server-side upload of 1 video file (T294134) [19:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:07] T294134: Please upload a 556 MB video file to Wikimedia Commons - https://phabricator.wikimedia.org/T294134 [19:24:21] (03CR) 10Legoktm: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [19:38:19] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:39:19] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 43, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:39:57] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:46:25] (03PS3) 10Urbanecm: Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) [19:46:35] (03CR) 10jerkins-bot: [V: 04-1] Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [19:46:42] (03PS4) 10Urbanecm: Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) [19:51:29] re: router alerts - those already have comments about existing Telia trouble tickets [19:51:54] and Telia just mailed a couple hours ago that they saw a flap and are keeping an eye on it ..roughly [19:55:55] ACKNOWLEDGEMENT - puppetboard on puppetboard1002 is CRITICAL: connect to address 10.64.48.59 and port 8001: Connection refused daniel_zahn reimaged per SAL - no ticket though https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [19:55:55] ACKNOWLEDGEMENT - puppetboard on puppetboard2002 is CRITICAL: connect to address 10.192.32.30 and port 8001: Connection refused daniel_zahn reimaged per SAL - no ticket though https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [19:56:20] mutante: https://github.com/wikimedia/puppet/commit/2d1741b2d12935b89c9800b3c5ece38df8e0b223#diff-b2ce9b71fdce7711edb9ccfeb1d69e9974a469bf5d5f7687e65598aa49e9ba8b [19:57:29] Spookreeeno: ACK, thanks. it doesn't have a ticket though [19:57:38] Nope [19:58:41] John is [19:58:52] Probably gone for weekend now [20:02:54] if you are talking about me i am on site now if anything is needed [20:03:18] jclark-ctr: the other John:) thank you very much, we are good [20:06:25] 10SRE, 10ops-eqiad: eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10Jclark-ctr) Cable has been run shows link. netbox has not been updated yet #2009 15m. pp219588361 <-> to cr1-eqiad:xe-3/0/6. [20:09:36] reports of timeouts from a few users on Discord [20:10:13] oh, I'm not the only one [20:10:13] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:09] phab isn't loading so can't create a task [20:11:18] dontpanic: feel free to PM, I'll relay [20:12:15] mutante: are SREs on the issue? Or should I use klaxon for the first time? :D [20:13:31] urbanecm: things are working for me but it's suspicious that we see that comment on the eqiad patch right befpore? [20:13:42] no, we have not been paged [20:13:57] as i said, there are user reports. And NEL reports in logstash also went up significantly. [20:14:17] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 64 probes of 711 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:14:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 67 probes of 629 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:14:31] mutante: ^ [20:14:53] can't reproduce [20:15:04] XioNoX: users report issues right after a cable was patched in Eqiad but things work for me [20:15:17] https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 is what Im looking on btw [20:15:20] jclark-ctr: are you working with someone on that cable thing? [20:15:32] Seem fine in UK [20:15:38] Tried enwiki + meta [20:15:54] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [20:15:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 247 probes of 622 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:16:07] that's new [20:16:09] the RIPE map looks a lot like the last Telia problem [20:16:09] Can't load any wiki in Canada [20:16:11] looking [20:16:13] urbanecm: we paged [20:16:14] the logstash dashboard shows US and BR as most affected [20:16:26] legoktm: about a week old I think. Chris did it [20:16:27] (03PS1) 10BBlack: Revert "ntp: Add drmrs DC site" [puppet] - 10https://gerrit.wikimedia.org/r/733040 (https://phabricator.wikimedia.org/T282787) [20:16:29] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service,netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:47] Spookreeeno: I don't understand the "we paged" comment [20:16:52] Telia had issues again [20:16:57] urbanecm: a page just went off [20:17:15] Not sure why I said we as not me obviously [20:17:37] (03CR) 10BBlack: [C: 03+2] Revert "ntp: Add drmrs DC site" [puppet] - 10https://gerrit.wikimedia.org/r/733040 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [20:17:38] what DC are people getting timeouts from? [20:17:45] what kind of timeouts? [20:17:46] Telia just mailed us again [20:17:50] legoktm: eqiad [20:17:56] "Suspected Cable fault in St Louis and your circuits are affected " [20:18:06] me, personally (see the private channel for my basic info) [20:18:17] it takes a few minutes for everything to re-converge, even when the link goes down cleanly in an obvious way [20:18:30] I am at eqiad about to leave just want to check if anything is needed at eqiad. [20:18:44] bblack: is your drmrs revert in response to the alert, or is that unrelated? [20:18:51] rzl: unrelated [20:18:53] thanks [20:19:12] [puppet's broken on some core dns/ntp servers from the change I'm reverting] [20:19:18] Gerrit issues from EU/UK [20:19:20] we can temporarily depool eqiad I suppose [20:19:36] ftr, got https://phabricator.wikimedia.org/P17584 from one of the affected users [20:19:48] (dontpanic, to be precise) [20:19:58] depooling eqiad would only make sense if Telia's still erroneously advertising our prefixes with the link to them dead [20:20:01] I'd paste to phab but I can't connect :) [20:20:03] I'm having trouble understanding the logstash dashboard.. what is "NELs by server IP"? is it "where clients are failing to connect" or "where we received the reports"? [20:20:04] otherwise we just have to wait for converge [20:20:08] I can provide broken and working traceroutes from Germany if you want [20:20:11] (but not on Phab either ^^) [20:20:29] greg-g: I'm happy to act as a relay if needed :D [20:20:34] :) [20:20:41] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 295 probes of 705 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:20:57] it's Telia wave between eqiad and codfw but that started a while before we started getting the NEL reports [20:21:00] majavah, NEL is https://wikitech.wikimedia.org/wiki/Network_Error_Logging [20:21:20] AntiComposite: I know, the dashboard is just confusing [20:21:31] oh I thought we were talking about transit fail? [20:21:52] I think multiple things are having issues? [20:22:05] bblack: "Suspected Cable fault in St Louis and your circuits are affected" [20:22:07] well, there's users seeing timeouts to eqiad [20:22:25] and that [20:22:52] I can't even get DNS for toolforge tools [20:23:03] SAL is not resolved for me [20:23:39] (03PS1) 10BBlack: Depool eqiad temporarily [dns] - 10https://gerrit.wikimedia.org/r/733043 [20:24:11] https://phabricator.wikimedia.org/P17585 is from greg-g [20:24:25] it...appears to reach xe-0-1-4.cr2-eqord.wikimedia.org ? [20:24:26] I didn't get page but saw the irc tag, getting my laptop [20:24:50] I can't verify what urbanecm says but I did send him my info ;) [20:25:18] is there a TLDR? [20:25:23] (klaxon doesn't list any recent pages, iirc it usually does for pages from alerting) [20:25:26] XioNoX: Telia outage [20:25:31] (fiber cut) [20:25:51] mtr is running and I'll keep an eye on it for recovery/changes [20:26:12] impact appears roughly the same as the Telia outage earlier this month [20:26:38] telia interface in eqiad is up, should I take BGP down (still catching up) [20:27:34] XioNoX: Telia reported a new cut and IC-313592 and IC-314534 eqiad -codfw [20:27:36] https://phabricator.wikimedia.org/P17586 is from lucaswerkmeister, ftr. [20:28:47] the transport one (IC-314534) the interface appears to be down, so that's good at that level [20:29:15] mutante: ok, so that's ulsfo-eqord and eqord-eqiad [20:29:34] sorry eqord-codfw, no eqiad [20:30:32] XioNoX: about 50 minutes ago we had Icinga alerts abot cr3-ulso, cr2-eqord and cr2-codfw [20:30:35] yeah the eqord-eqiad one seems like it's down on both ends, so it must be saturation elsewhere causing isses? [20:30:52] the exact alerts that already had comments on Icinga for ongoing Telia issue [20:31:04] also: I have a patch to dns-depool eqiad, but I'm not clear yet if that will improve the situation or just move problems around [20:31:09] the NEL and user report did not start until a while after that [20:31:14] any informed opinion? [20:31:43] no saturation on our side at least [20:31:51] but it's confusing where the issue is exactly [20:32:05] IIRC weren't we already coming close to capacity on eqiad<-->codfw? I think we don't want to go over that [20:32:26] saturation on transport links shouldn't be causing these flavors of NELs nor the RIPE Atlas alert [20:32:28] yeah but that shouldn't affect users reaching our edge [20:32:35] something else is going on [20:32:54] legoktm: we're fine on the codfw-eqiad especially if it's an emergency [20:33:07] the reported are about eqiad edge reachability, basically [20:33:22] (I mean in the case we have to depool, the codfw-eqiad links can handle it) [20:33:26] do we have a Telia transit there which is not on their list, but is affected-but-not-actually-down? [20:33:33] s/there/eqiad/ [20:34:12] fyi I reach eqiad through telia without any loss [20:34:23] * legoktm nods [20:34:35] XioNoX: I am suspicious of a return path issue [20:35:01] could be yeah, https://phabricator.wikimedia.org/P17586 might be a HE issue? [20:35:03] wth? ssh to bast1003 works fine [20:35:07] (or return of course) [20:35:10] XioNoX: are we still splitting VRRPs [20:35:26] cdanis: yep [20:35:33] 19:38 - 1 interface down on cr3-ulsfo, 19:39 - 2 interfaces down on cr2-eqord, 1 interface down on cr2-codfw. 19:50 Telia sends mail about new fiber cut 20:09 user reports on IRC [20:35:37] here we go [20:35:40] https://librenms.wikimedia.org/device/device=2/tab=port/port=11600/ [20:35:43] (03PS2) 10Ahmon Dancy: thumbor: Remove conditionalization for stretch [puppet] - 10https://gerrit.wikimedia.org/r/733033 (https://phabricator.wikimedia.org/T294148) [20:35:54] ^ equinix peering in eqiad, there's a dropoff in traffic, probably from telia fiber cut impacting other peers there? [20:35:59] XioNoX: so to test return path issues I should try a mtr from a bunch of cache text hosts [20:36:02] maybe turn off the peering port for now? [20:37:00] XioNoX: sane theory above re: peering? [20:37:04] (03CR) 10Ahmon Dancy: [C: 03+1] "Another change which should only have effects in beta cluster (for a new host)." [puppet] - 10https://gerrit.wikimedia.org/r/733033 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [20:37:08] mutante: thx , I'd say the transport links are out of the possible problem so far, things re-rerouted internally [20:37:17] bblack: checking [20:38:34] bblack: does it match a trop of inbound traffic or increase of outbound somewhere else? [20:39:05] it's harder to see the smaller inbound side drop on the same interface (but I think it's there), but the outbound drop there is pretty dramatic. [20:39:45] I think we've got some peers over that exchange which we're still advertising in one or both directions with, but are affected by telia somehow and the peering traffic is borked. [20:40:15] ripe atlas probes: 62/711 failed to codfw v4 70/629 failed to codfw v6, 295/705 failed to eqiad v4, 249/622 failed to eqiad v6 [20:40:58] bblack: could be, disabling peering in eqiad [20:41:01] it's a return path issue for sure [20:41:20] !log disable sessions to equinix eqiad IXP [20:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:59] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 74, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:42:00] it's back [20:42:02] does equinix peering happen via cr1-eqiad ? [20:42:09] cdanis: cr2 [20:42:10] my mtr is happy now [20:42:17] that was the one that was working for me [20:42:20] I think the issue is deeper [20:42:51] greg-g: can you share your previous MTR? [20:42:55] (phab, mw.org etc all loading successfully for me now, just to be explicit) [20:42:59] gerrit works for me now [20:43:06] XioNoX: https://phabricator.wikimedia.org/P17585 I believe [20:43:16] my mtr also seems happy now fwiw [20:43:17] thanks paladox, users report things work now [20:43:17] Gerrit + Sal.toolforge fine for me now [20:43:23] I should have asked for a return MTR before taking the sessions down :) [20:43:26] ok, librenms + gerrit also working for me [20:43:32] XioNoX: I was trying to get one as you made the change :) [20:43:38] my home IP was affected [20:43:42] it's back up here [20:43:46] XioNoX: previous traceroute: https://phabricator.wikimedia.org/P17585 [20:43:49] cdanis: blame bblack :) [20:44:03] greg-g: yeah that's not a return path traceroute though [20:44:06] :P [20:44:09] the internet is asymmetric [20:44:17] yeah yeah, that's all I had :/ [20:44:24] ofc [20:44:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 55 probes of 629 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:44:30] not something you can easily get without router or shell access anyway [20:45:05] uh, it's back (the symptoms) [20:45:17] greg-g: PM me your home IP [20:45:17] can't connect to the eqiad lb again [20:45:20] greg-g: can you share your IP? [20:45:48] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [20:45:51] gerrit's down for me :/ [20:45:53] depooling eqiad in dns might alleviate user issues, but it might also rob us of evidence [20:46:04] bblack: go for it [20:46:06] ok [20:46:13] XioNoX: you did commit confirmed but didn't confirm [20:46:15] bblack: we can do specific mtr if needed [20:46:20] (03CR) 10BBlack: [C: 03+2] Depool eqiad temporarily [dns] - 10https://gerrit.wikimedia.org/r/733043 (owner: 10BBlack) [20:46:27] it auto rolled back [20:46:31] per console message on cr2-eqiad [20:47:06] heh, the exchange fix rolled back? [20:47:18] issues are back here too [20:47:24] either way, the dns change is already pushing, and takes ~10 minutes to come into full effect for all [20:47:28] previous tracert https://phabricator.wikimedia.org/P17584 [20:47:28] cdanis: er, yeah commiting for real [20:47:30] (TTL randomness) [20:47:33] Yeah gerrit gone here [20:47:40] is that the only reason it's back, because config change was rolled back? [20:47:42] 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10RobH) [20:47:47] bblack: so we're still going ahead with the eqiad depool? [20:48:15] gerrit works for me now [20:48:19] we've waffled on it too long anyways, my vote is stick with it for now a verify manually that we understand problems [20:48:25] (the dns depool) [20:48:38] it will fix most of the impact, hopefully [20:48:44] tools like gerrit and icinga will still be affected [20:48:47] !log bblack has temporarily depooled eqiad https://gerrit.wikimedia.org/r/733043 [20:48:49] (mtr/etc are happy again) [20:48:50] 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10RobH) [20:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:52] +1, having the sites up for most people seems preferred [20:48:59] Yep back for me now [20:49:16] 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10RobH) a:03Jclark-ctr [20:49:38] bah, imma stop using phab now cuz i forgot it spams into here during outage stuff. [20:49:42] it takes several minutes for most to see a real impact from the dns-level depool, so any immediate recoveries are probably from re-committing the exchange fix [20:50:00] ^ [20:50:11] plus, tools like gerrit and icinga will still be affected regardless of eqiad pooledness status [20:50:12] but still, we don't have a firm grip on the issue, the exchange hack could just be moving problems around [20:50:29] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 21 probes of 705 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:50:37] let's wait and see, between NELs and RIPE Atlas we can see if it is working [20:50:56] are we sure this has nothing to do with it? "eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (Jclark-ctr) Cable has been run shows link." [20:51:05] just cause that was just minutes before reports started [20:51:20] https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&from=now-6h&to=now [20:51:32] according to RIPE Atlas, the issue started at 20:09 [20:51:40] bug post is at 20:06 [20:51:42] 👀 [20:51:51] oh? [20:51:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 42 probes of 622 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:51:53] could be, looking [20:52:11] heh [20:52:17] that would explain a lot of things! :) [20:52:20] hmm! [20:52:22] yeah, that's most likely it [20:52:23] the uh [20:52:28] the NEL reports start at 20:06 exactly [20:52:32] sooooo [20:52:48] and a few minutes of delay in RIPE Atlas's probe result processing is typical [20:52:51] the traffic dropoff on the exchange is from a new second link that isn't fully provisioned stealing half the traffic and breaking it [20:52:58] return traffic from cr1 doesn't try to reach cr2 anymore, and peers don't accept the packets [20:53:04] it's a 2nd IP on the same IXP [20:53:09] I saw that and then wondered if that is scheduled maintenance [20:53:27] XioNoX: that tracks -- cr1-eqiad sees my home IP as a black hole [20:53:58] and greg-g's [20:54:18] I disabled the interface on cr1, going to re-enabled the active on on cr2 [20:54:31] !log I disabled the interface on cr1, going to re-enabled the active on on cr2 [20:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:36] I already pinged jclark but I guess he is onsite but afk [20:54:45] do we need to call him to unplug the new cable? [20:54:52] legoktm: thx :) [20:55:00] mutante: he said earlier he was about to leave [20:55:02] mutante: no, I disabled it [20:55:06] ok and ok [20:55:28] let me know if anyone is having any issue anymore [20:56:08] (03PS1) 10BBlack: Revert "Depool eqiad temporarily" [dns] - 10https://gerrit.wikimedia.org/r/733049 [20:56:11] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 711 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:56:20] https://i.imgur.com/DSxHzlg.png :) :) :) [20:56:31] works for me again [20:56:54] No issues here XioNoX [20:57:03] OK for me to consider this incident resolved? [20:57:13] legoktm: yes [20:57:15] yeah, I need to revert the dns depool, but I think that's safe now [20:57:18] that was clearly the case [20:57:21] bblack: yep [20:57:34] I disabled the interface in Netbox as well [20:57:35] (03CR) 10BBlack: [C: 03+2] Revert "Depool eqiad temporarily" [dns] - 10https://gerrit.wikimedia.org/r/733049 (owner: 10BBlack) [20:57:39] thanks all [20:57:51] !log re-pooling eqiad in DNS [20:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:08] OK, can someone else who is more networking savvy take on figuring out action items? [20:58:39] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 46.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:59:07] ^ expected [20:59:07] it's probably going to be something more meta about being more aware/communicative/loud that we're plugging in new router ports and other such changes in general. [20:59:41] I think if we had all realized that change and its timestamp, we could've figured this out much faster. [21:00:16] there was a ticket update in this channel which should've clued us in, but I failed to notice it (mutante eventually brough it up, though!) [21:00:41] yeah, and the Telia stuff put us on the wrong path [21:00:59] yea, I marked the Telia related stuff as "unrelated" in the doc but did not remove it [21:01:00] 13:15:04 XioNoX: users report issues right after a cable was patched in Eqiad but things work for me [21:01:25] put a 20:06 line in there when it actually started [21:02:37] it's 11pm here so I'm going to log off if everything is stable again, I'll follow up next week with action items, at first sight it's a process issue [21:03:07] have a good Friday night [21:03:41] agreed, process issue [21:04:31] thanks everyone! [21:05:25] Have a good weekend all! [21:07:50] ahaha the VO alert just popped [21:08:02] fyi, I only got the victorops page now [21:08:15] for SREs responding because victorops just fired, the problem is resolved already, you can ignore it <3 [21:08:35] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 78.1 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:08:39] heh [21:08:48] yup, it shows in klaxon now too :) [21:08:59] add that to the list [21:09:11] ok thx rzl [21:09:20] XioNoX: yeah [21:09:22] because [21:09:29] our outbound connectivity from half of eqiad [21:09:31] was broken [21:09:33] 😔 [21:10:09] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:12:47] legoktm: I've added a few meta-AIs (things to investigate more deeply later) but now I have to run off to care for the baby [21:13:57] 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: (Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RobH) [21:14:05] 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RobH) [21:14:39] (03CR) 10Dzahn: [C: 03+2] simplelamp2: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn) [21:14:41] 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RobH) [21:15:09] 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RobH) a:03Papaul [21:16:07] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:19] (03CR) 10Dzahn: "tested on existing user skins.reading-web-staging.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn) [21:21:31] (03CR) 10Dzahn: "done in simplelamp2 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/731183" [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [21:21:52] (03CR) 10Dzahn: "@Jaime btw, for your info, this should be the fix for https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206 as joe said there" [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn) [21:23:30] 10SRE, 10ops-eqiad: eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10Dzahn) The interface has been disabled because this started a partial outage, which has been resolved now. [21:25:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10wiki_willy) Entries updated on the Accounting Spreadsheet to eliminate related Netbox errors [21:27:52] also Telia located the damage is now working on it [21:29:55] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 24.45 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:31:46] (03CR) 10Dzahn: ".. doesnt really fix it though" [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn) [21:31:57] 10SRE, 10Performance Issue: High loading times on no.wikipedia - https://phabricator.wikimedia.org/T292762 (10jhsoby) I haven't noticed this happening lately. @Tholme, how about you? If you haven't noticed it for a while either, I think we can close this. [21:36:38] 10ops-codfw, 10DC-Ops: codfw Related Netbox Errors - https://phabricator.wikimedia.org/T294158 (10wiki_willy) [21:37:49] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:41:08] (03CR) 10Cwhite: [C: 03+2] profile: add logstash common profile [puppet] - 10https://gerrit.wikimedia.org/r/727626 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [21:44:56] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/727627 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [21:46:43] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10wiki_willy) [21:49:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10wiki_willy) [21:50:04] (03PS1) 10Dzahn: simplelamp2: add a notify->exec to restart apache before changing MPM [puppet] - 10https://gerrit.wikimedia.org/r/733081 [21:51:10] (03PS10) 10Cwhite: role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) [21:52:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10wiki_willy) [21:52:48] (03CR) 10Cwhite: [C: 03+2] role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [22:04:21] (03PS1) 10Dzahn: rsync::quickdatacopy: add option to exclude some files [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) [22:06:04] (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: add option to exclude some files [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [22:07:20] (03PS2) 10Dzahn: rsync::quickdatacopy: add option to exclude some files [puppet] - 10https://gerrit.wikimedia.org/r/733083 (https://phabricator.wikimedia.org/T294080) [22:09:17] 10SRE, 10Observability-Metrics, 10Patch-For-Review: Occasional rsync race while syncing /var/lib/grafana - https://phabricator.wikimedia.org/T294080 (10Dzahn) @fgiunchedi Yea, I think we can add a parameter to just pass through to rsync's --exclude parameter and then use that to ignore the file. Patch upload... [22:09:52] 10SRE, 10Observability-Metrics, 10Patch-For-Review: Occasional rsync race while syncing /var/lib/grafana - https://phabricator.wikimedia.org/T294080 (10Dzahn) p:05Triage→03Medium [22:10:36] 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Dzahn) 05Open→03In progress [22:11:07] 10SRE, 10MediaWiki-extensions-TranslationNotifications, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.5; 2021-10-19): Fatal error: Uncaught Error: Class 'MediaWiki\MediaWikiServices' not found - mediawiki_job_translationnotifications - https://phabricator.wikimedia.org/T293702 (10Dzahn) 05Open→03In progress [22:14:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/31862/skins.reading-web-staging.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/733081 (owner: 10Dzahn) [22:18:08] cdanis: thanks! [22:24:13] (03CR) 10Jforrester: "Not to deploy until wmf.6 is everywhere and won't regress." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [22:24:17] (03PS5) 10Jforrester: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) [22:25:29] (03PS1) 10Dzahn: simplelamp2: ensure httpd::mpm comes before httpd, revert previous change [puppet] - 10https://gerrit.wikimedia.org/r/733086 [22:26:45] (03CR) 10Dzahn: "not simple enough to make a simple class" [puppet] - 10https://gerrit.wikimedia.org/r/733086 (owner: 10Dzahn) [22:27:35] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/31863/skins.reading-web-staging.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/733086 (owner: 10Dzahn) [22:29:30] (03CR) 10Dzahn: "@Jaime with this it works now :)" [puppet] - 10https://gerrit.wikimedia.org/r/733086 (owner: 10Dzahn) [22:30:53] (03CR) 10Dzahn: "this should have fixed issue back from https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206 now" [puppet] - 10https://gerrit.wikimedia.org/r/733086 (owner: 10Dzahn) [22:44:17] (03PS1) 10Dzahn: simplelap: ensure httpd::mpm before mpm, set purge_manual_config => false [puppet] - 10https://gerrit.wikimedia.org/r/733087 [22:48:40] (03CR) 10Dzahn: [C: 03+2] "just like https://gerrit.wikimedia.org/r/c/operations/puppet/+/733086 and unused" [puppet] - 10https://gerrit.wikimedia.org/r/733087 (owner: 10Dzahn) [22:55:08] (03PS1) 10Jforrester: [BETA CLUSTER] Enable WikibaseLexeme Scribunto access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733089 (https://phabricator.wikimedia.org/T294159) [22:55:13] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:27] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:51] (03CR) 10Jforrester: [C: 03+2] "Beta-Cluster only config change; let's do this today rather than have the train blow up next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733089 (https://phabricator.wikimedia.org/T294159) (owner: 10Jforrester) [23:10:33] (03Merged) 10jenkins-bot: [BETA CLUSTER] Enable WikibaseLexeme Scribunto access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733089 (https://phabricator.wikimedia.org/T294159) (owner: 10Jforrester) [23:13:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:17] 10SRE, 10Wikimedia-Mailing-lists: Multiple people with unsubscribe issues - https://phabricator.wikimedia.org/T294106 (10Reedy) >list:member:digest:footer ` _______________________________________________ $display_name mailing list -- $listname To unsubscribe send an email to ${short_listname}-leave@${domain}... [23:17:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:45] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1285.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:18:56] (03PS1) 10Dzahn: wikistats::httpd: support buster with PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/733091 [23:19:56] (03CR) 10Dzahn: [C: 03+2] "cloud VPS, not analytics" [puppet] - 10https://gerrit.wikimedia.org/r/733091 (owner: 10Dzahn) [23:25:39] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:27:43] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:28:14] hrmm ok [23:28:30] already had that wikitech page open and wondering [23:37:16] (03CR) 10Thcipriani: [C: 03+1] zuul: use releng list rather than jenkins-bot for email [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [23:41:36] (03PS1) 10Dzahn: wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 [23:42:07] (03CR) 10jerkins-bot: [V: 04-1] wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 (owner: 10Dzahn) [23:43:35] (03CR) 10Dzahn: [C: 03+1] "back in mailman2 you would have to add such a sender to a list of allowed_nonsubscriber or so to receive the mails on the list. keep in mi" [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [23:43:44] (03CR) 10Dzahn: [C: 03+2] zuul: use releng list rather than jenkins-bot for email [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [23:55:05] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:55:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state