[00:12:43] (03CR) 10BryanDavis: signup: allow blocking of username with regex (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) (owner: 10Slyngshede) [00:39:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/918548 [00:39:27] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/918548 (owner: 10TrainBranchBot) [00:58:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/918548 (owner: 10TrainBranchBot) [01:43:58] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10Dzahn) Fair enough. Though making this task shoud have been identical to "contacting the securit... [01:47:24] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10Dzahn) I mean.. scrolling up in ticket history shows that more than one _previous_ security pers... [02:07:55] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:22:55] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [02:27:54] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:14] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [03:02:42] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:35:49] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10Papaul) I took a quick look at lvs2012, the server can ping 10.192.16.1 and 10.192.32.1 but the server can not ping 10.192.0.1 and 10... [04:57:40] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (cloudcontrol2001-dev), Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:07:46] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) [05:08:34] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) [05:09:04] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) ` ------------------------------------------------------------------------------- Record: 59 Date/Time: 05/10/2023 14:58:16 Source: sys... [05:24:14] (03PS1) 10KartikMistry: Update MinT to 2023-05-11-051736-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/918627 [05:29:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 (owner: 10Giuseppe Lavagetto) [05:29:45] (03Merged) 10jenkins-bot: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 (owner: 10Giuseppe Lavagetto) [05:34:26] Has it already been reported that l10n-bot isn't approving its gerrit patches? (Or am I in the wrong channel?) [05:35:49] (03CR) 10Giuseppe Lavagetto: trafficserver: make mw-on-k8s use a config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917840 (https://phabricator.wikimedia.org/T336037) (owner: 10Giuseppe Lavagetto) [05:39:44] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-05-11-051736-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/918627 (owner: 10KartikMistry) [05:42:32] (03Merged) 10jenkins-bot: Update MinT to 2023-05-11-051736-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/918627 (owner: 10KartikMistry) [05:43:31] (03PS1) 10Marostegui: ProductionServices.php: Failover pc2 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918903 [05:44:07] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:44:15] (03PS1) 10Marostegui: pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/918904 [05:45:53] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:46:09] (03CR) 10Marostegui: [C: 03+2] pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/918904 (owner: 10Marostegui) [05:46:46] kart_: Can I deploy mediawiki? [05:46:52] Well, wmf-config [05:47:59] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2139.codfw.wmnet with reason: T335396 [05:48:03] T335396: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 [05:48:12] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2139.codfw.wmnet with reason: T335396 [05:53:04] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [05:55:41] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Failover pc2 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918903 (owner: 10Marostegui) [05:55:54] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [05:56:25] (03Merged) 10jenkins-bot: ProductionServices.php: Failover pc2 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918903 (owner: 10Marostegui) [05:56:47] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [05:57:14] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:918903|ProductionServices.php: Failover pc2 codfw master]] [05:58:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 714 [05:58:45] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:918903|ProductionServices.php: Failover pc2 codfw master]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [05:59:53] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:918903|ProductionServices.php: Failover pc2 codfw master]] [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T0600) [06:00:05] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T0600). [06:00:44] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:01:26] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:918903|ProductionServices.php: Failover pc2 codfw master]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [06:03:46] (03PS1) 10Marostegui: Revert "pc2014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/918533 [06:05:42] !log Updated MinT to 2023-05-11-051736-production [06:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:07:35] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:918903|ProductionServices.php: Failover pc2 codfw master]] (duration: 07m 42s) [06:08:40] (03PS1) 10Marostegui: Revert "ProductionServices.php: Failover pc2 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918534 [06:11:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 714 [06:12:27] (03CR) 10Marostegui: [C: 03+2] Revert "pc2014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/918533 (owner: 10Marostegui) [06:12:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:14:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [06:15:00] (03PS2) 10Slyngshede: mgmt module [software/bitu] - 10https://gerrit.wikimedia.org/r/918245 [06:15:43] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 13335 [06:17:58] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Failover pc2 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918534 (owner: 10Marostegui) [06:19:02] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Failover pc2 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918534 (owner: 10Marostegui) [06:21:13] !log Configure/reconfigure 1:1 NAT for new fr-tech hosts (frbast2002, frmon2002) - T336450 [06:21:13] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:918534|Revert "ProductionServices.php: Failover pc2 codfw master"]] [06:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:42] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:918534|Revert "ProductionServices.php: Failover pc2 codfw master"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [06:23:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 17676 [06:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [06:24:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 17676 [06:25:22] (03PS1) 10Marostegui: pc1014: Make it pc2 master [puppet] - 10https://gerrit.wikimedia.org/r/918921 [06:29:25] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:918534|Revert "ProductionServices.php: Failover pc2 codfw master"]] (duration: 08m 12s) [06:29:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 20940 [06:31:33] (03PS1) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918922 (https://phabricator.wikimedia.org/T327868) [06:31:40] (03PS1) 10Marostegui: ProductionServices.php: Failover pc2 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918923 [06:31:42] (03CR) 10Marostegui: [C: 03+2] pc1014: Make it pc2 master [puppet] - 10https://gerrit.wikimedia.org/r/918921 (owner: 10Marostegui) [06:32:42] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Failover pc2 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918923 (owner: 10Marostegui) [06:34:01] (03Merged) 10jenkins-bot: ProductionServices.php: Failover pc2 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918923 (owner: 10Marostegui) [06:34:31] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:918923|ProductionServices.php: Failover pc2 eqiad master]] [06:36:28] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:918923|ProductionServices.php: Failover pc2 eqiad master]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [06:40:32] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10MoritzMuehlenhoff) [06:42:55] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:918923|ProductionServices.php: Failover pc2 eqiad master]] (duration: 08m 23s) [06:43:58] (03PS1) 10Marostegui: Revert "ProductionServices.php: Failover pc2 eqiad master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918535 [06:44:16] (03PS1) 10Marostegui: Revert "pc1014: Make it pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/918536 [06:46:23] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:23] (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Make it pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/918536 (owner: 10Marostegui) [06:46:49] (03CR) 10Muehlenhoff: [C: 03+2] Remove bastion role from bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/918430 (owner: 10Muehlenhoff) [06:49:54] (03PS1) 10Muehlenhoff: Remove bast2002 from SSH config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/918926 [06:50:38] (03PS1) 10Marostegui: pc2014: Move it to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/918927 [06:51:38] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/918927 (owner: 10Marostegui) [06:52:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2002.wikimedia.org [06:57:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [06:58:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2002.wikimedia.org [07:00:04] Amir1, apergos, and jnuche: Dear deployers, time to do the UTC morning backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T0700). [07:00:57] morning! no patches are scheduled for deployment during this window. [07:01:42] and no one is signed up for training either, which is just as well, given the absense of patches. [07:02:02] so, have a nice day everyone and we'll see you next time! [07:02:47] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove bast2002 from SSH config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/918926 (owner: 10Muehlenhoff) [07:03:18] (03CR) 10Muehlenhoff: [C: 03+2] Add a generic Cassandra reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [07:04:18] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Failover pc2 eqiad master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918535 (owner: 10Marostegui) [07:05:05] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Failover pc2 eqiad master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918535 (owner: 10Marostegui) [07:05:45] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:918535|Revert "ProductionServices.php: Failover pc2 eqiad master"]] [07:07:13] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:918535|Revert "ProductionServices.php: Failover pc2 eqiad master"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:10:00] (03PS1) 10Marostegui: pc1014: Move to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/918930 [07:13:26] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:918535|Revert "ProductionServices.php: Failover pc2 eqiad master"]] (duration: 07m 41s) [07:13:43] (03CR) 10Marostegui: [C: 03+2] pc1014: Move to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/918930 (owner: 10Marostegui) [07:14:10] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 20940 [07:15:42] (03CR) 10Marostegui: [C: 03+1] "Whenever you consider this is ready, it has my +1" [puppet] - 10https://gerrit.wikimedia.org/r/918446 (owner: 10Jcrespo) [07:21:43] (03CR) 10Stevemunene: Configure product analytics airflow instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [07:30:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [07:32:25] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) The installer is working fine for baremetal and VM installations, but there will be a few more RC releases before the final release, so keeping the task open for now.... [07:36:41] (03CR) 10Stevemunene: [C: 03+2] Configure product analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/909960 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [07:37:58] (03CR) 10Stevemunene: [C: 03+2] Dummy db for new product analytics airflow [labs/private] - 10https://gerrit.wikimedia.org/r/911319 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [07:38:21] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Dummy db for new product analytics airflow [labs/private] - 10https://gerrit.wikimedia.org/r/911319 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [07:38:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 2518 [07:38:50] (03CR) 10Stevemunene: [C: 03+2] Create scap deployment source for product analytics [puppet] - 10https://gerrit.wikimedia.org/r/912834 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [07:39:10] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 2518 [07:41:07] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade [07:43:53] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev [07:43:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.cassandra.roll-reboot (exit_code=1) rolling reboot on A:cassandra-dev [07:47:29] (03CR) 10Jelto: "looks mostly good, but https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/http" [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) (owner: 10Dzahn) [07:49:31] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [07:50:09] (03PS1) 10Muehlenhoff: Fix imported class [cookbooks] - 10https://gerrit.wikimedia.org/r/919000 [07:53:08] (03PS2) 10Jelto: miscweb annualreport: use wildcard redirect for 2020th report [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) [07:53:20] (03CR) 10CI reject: [V: 04-1] Fix imported class [cookbooks] - 10https://gerrit.wikimedia.org/r/919000 (owner: 10Muehlenhoff) [07:56:09] (03CR) 10Volans: "Looks sane, I have some final questions/comments inline." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [07:57:45] (03CR) 10Jelto: miscweb annualreport: use wildcard redirect for 2020th report (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [07:58:13] (03PS2) 10Muehlenhoff: Fix imported class [cookbooks] - 10https://gerrit.wikimedia.org/r/919000 [07:58:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] CI: Diff scaffold changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/917339 (owner: 10JMeybohm) [07:58:45] (03Merged) 10jenkins-bot: CI: Diff scaffold changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/917339 (owner: 10JMeybohm) [08:00:04] hashar and brennen: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T0800) [08:00:57] o/ [08:01:03] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/919000 (owner: 10Muehlenhoff) [08:01:10] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919003 (https://phabricator.wikimedia.org/T330214) [08:01:12] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919003 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [08:01:41] (03CR) 10Stevemunene: [C: 03+2] Place airflow1006 in airflow role [puppet] - 10https://gerrit.wikimedia.org/r/918566 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [08:01:57] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919003 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [08:02:19] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:40] (03CR) 10Muehlenhoff: [C: 03+2] Fix imported class [cookbooks] - 10https://gerrit.wikimedia.org/r/919000 (owner: 10Muehlenhoff) [08:05:00] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade [08:05:25] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:22] (03CR) 10Volans: [C: 03+1] "Thanks! Ship it" [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [08:06:46] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev [08:08:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:08:51] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.8 refs T330214 [08:08:55] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [08:11:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:13:01] !log installing Linux 4.19.282 updates on Buster systems [08:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:20:13] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:24:35] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:53] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade [08:34:57] (03PS1) 10Slyngshede: signup:blocklist Expand blocklist feature [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 [08:37:59] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:38:29] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:40:10] !log `apt-get clean` on orespoolcounter nodes to free space in the root partition [08:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:15] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [08:41:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:cassandra-dev [08:45:43] (03CR) 10Volans: [C: 04-1] "There are still some things to fix but looks much better, thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [08:54:38] <_joe_> jouncebot: nowandnext [08:54:38] For the next 1 hour(s) and 5 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T0800) [08:54:39] In 1 hour(s) and 5 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1000) [08:54:39] In 1 hour(s) and 5 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1000) [08:55:09] <_joe_> uhhh why is the train running at the same time as one mw infra windows? [08:56:31] !log jelto@cumin1001 END (ERROR) - Cookbook sre.gitlab.upgrade (exit_code=97) on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [08:58:39] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10jbond) a:03jbond [08:59:14] jouncebot: now [08:59:15] For the next 1 hour(s) and 0 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T0800) [08:59:21] _joe_: the train is now [08:59:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918495 (owner: 10Volans) [08:59:35] byt Services and MW infra are at the same time, yes [08:59:44] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade [09:00:05] (03CR) 10Jbond: [C: 03+2] dnsquery: bump to v5.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/918580 (owner: 10Jbond) [09:00:16] <_joe_> volans: ahh "for the next" yeah I misread [09:01:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:01:43] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:02:01] (03CR) 10Jbond: "i personally think we should create individual accounts." [puppet] - 10https://gerrit.wikimedia.org/r/918519 (https://phabricator.wikimedia.org/T336357) (owner: 10Btullis) [09:03:23] (03CR) 10Kamila Součková: [C: 03+1] thumbor: fix indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/918494 (owner: 10Hnowlan) [09:04:29] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:23] 10SRE, 10Infrastructure-Foundations, 10LDAP: Create auto-populated LDAP group of those who have production shell access - https://phabricator.wikimedia.org/T271587 (10jbond) @SLyngshede-WMF, @MoritzMuehlenhoff this seems like something that fits with the IDM. I'm not sure it needs to be part of the bitu so... [09:12:45] (03CR) 10Hnowlan: [C: 03+2] thumbor: fix indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/918494 (owner: 10Hnowlan) [09:13:30] (03Merged) 10jenkins-bot: thumbor: fix indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/918494 (owner: 10Hnowlan) [09:14:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:16:30] (03PS1) 10Stevemunene: Remove redundant analytics-product group [puppet] - 10https://gerrit.wikimedia.org/r/919018 (https://phabricator.wikimedia.org/T333000) [09:20:15] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:21:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:23:50] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41117/console" [puppet] - 10https://gerrit.wikimedia.org/r/919018 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [09:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137', diff saved to https://phabricator.wikimedia.org/P48181 and previous config saved to /var/cache/conftool/dbconfig/20230511-092848-root.json [09:29:09] (03PS9) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [09:34:42] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10LDAP: Create auto-populated LDAP group of those who have production shell access - https://phabricator.wikimedia.org/T271587 (10MoritzMuehlenhoff) p:05Medium→03Low [09:34:53] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10LDAP: Create auto-populated LDAP group of those who have production shell access - https://phabricator.wikimedia.org/T271587 (10MoritzMuehlenhoff) >>! In T271587#8843926, @jbond wrote: > @SLyngshede-WMF, @MoritzMuehlenhoff this seems like something that fits... [09:34:59] (03PS7) 10Jbond: team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T277083) [09:35:50] !log installing distro-info-data updates on buster [09:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:58] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10cmooney) >>! In T336428#8843550, @Papaul wrote: > I took a quick look at lvs2012, the server can ping 10.192.16.1 and 10.192.32.1 but... [09:40:05] (03CR) 10Jbond: [C: 03+1] add tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [09:46:21] (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T277083) (owner: 10Jbond) [09:48:07] (03CR) 10Filippo Giunchedi: [C: 03+1] hierdata: add swift (thanos) mw-event-enrichment account [puppet] - 10https://gerrit.wikimedia.org/r/918582 (https://phabricator.wikimedia.org/T330693) (owner: 10Eevans) [09:49:02] (03PS1) 10Aqu: Add Airflow configuration to connect to DataHub [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) [09:49:58] (03PS2) 10Aqu: Add Airflow configuration to connect to DataHub [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) [10:00:04] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1000). [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1000) [10:03:42] (03PS1) 10Jelto: gitlab: make sure letsencrypt extention is disabled [puppet] - 10https://gerrit.wikimedia.org/r/919022 (https://phabricator.wikimedia.org/T336476) [10:06:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48182 and previous config saved to /var/cache/conftool/dbconfig/20230511-100628-root.json [10:09:20] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/919018 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [10:10:25] mvolz: My dear minions, it's time we take the moon! .> [10:10:44] !log installing protobuf security updates [10:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:13:41] (03Abandoned) 10Btullis: Add an ldap_only user for bishopfox [puppet] - 10https://gerrit.wikimedia.org/r/918519 (https://phabricator.wikimedia.org/T336357) (owner: 10Btullis) [10:15:17] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10cmooney) Ok I think I see what the issue is. Looking at the [[ https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt | k... [10:17:17] !log installing modsecurity-crs security updates [10:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48183 and previous config saved to /var/cache/conftool/dbconfig/20230511-102133-root.json [10:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [10:23:54] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:23:55] (03CR) 10Volans: [C: 03+2] reports.network: better variable naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918495 (owner: 10Volans) [10:23:59] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:25:11] (03Merged) 10jenkins-bot: reports.network: better variable naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918495 (owner: 10Volans) [10:36:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48184 and previous config saved to /var/cache/conftool/dbconfig/20230511-103638-root.json [10:51:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48185 and previous config saved to /var/cache/conftool/dbconfig/20230511-105142-root.json [10:55:34] 10SRE, 10Wikidata, 10wdwb-tech, 10Shape Expressions (M2: Linking to EntitySchemas in statements), and 3 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Arian_Bozorg) [11:04:56] jouncebot: nowandnext [11:04:56] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [11:04:56] In 1 hour(s) and 55 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1300) [11:04:56] In 1 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1300) [11:06:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48186 and previous config saved to /var/cache/conftool/dbconfig/20230511-110647-root.json [11:06:53] GitLab needs a short maintenance break in one hour (for around 5 minutes) [11:08:06] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [11:10:34] (03PS1) 10Jbond: O:traffic: Add lvs::kernel_config during insetup to allow reimages [puppet] - 10https://gerrit.wikimedia.org/r/919034 (https://phabricator.wikimedia.org/T336428) [11:11:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41118/console" [puppet] - 10https://gerrit.wikimedia.org/r/919034 (https://phabricator.wikimedia.org/T336428) (owner: 10Jbond) [11:11:57] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10jbond) > Ok I think I see what the issue is Nice work on the investigation > I'm also not sure if this config... [11:12:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41119/console" [puppet] - 10https://gerrit.wikimedia.org/r/919034 (https://phabricator.wikimedia.org/T336428) (owner: 10Jbond) [11:13:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/919034 (https://phabricator.wikimedia.org/T336428) (owner: 10Jbond) [11:15:12] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) a:03Jelto Refactoring of omniauth providers looks good on all instances. Changes as expected. Thanks again @jbond for preparin... [11:18:58] (03CR) 10Jbond: [C: 03+2] team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T277083) (owner: 10Jbond) [11:20:52] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Patch-For-Review: Improve alerting for hosts with Puppet disabled for longer periods - https://phabricator.wikimedia.org/T277083 (10jbond) > Improve the Icinga alerting so that a single host with Puppet disabled for more than a week becomes a... [11:21:14] 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) p:05Triage→03Medium [11:21:38] (03PS1) 10Volans: secrets: add ZTP script for install_server [labs/private] - 10https://gerrit.wikimedia.org/r/919037 (https://phabricator.wikimedia.org/T336485) [11:21:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48187 and previous config saved to /var/cache/conftool/dbconfig/20230511-112152-root.json [11:24:39] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/919034 (https://phabricator.wikimedia.org/T336428) (owner: 10Jbond) [11:24:39] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10cmooney) Ah cool John thanks for the explanation. > Seems like it would be an improvement to what we currently... [11:26:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Will look on the non-dummy repo and check the code." [labs/private] - 10https://gerrit.wikimedia.org/r/919037 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [11:27:36] (03CR) 10Volans: [V: 03+2 C: 03+2] secrets: add ZTP script for install_server [labs/private] - 10https://gerrit.wikimedia.org/r/919037 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [11:27:58] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) I have discussed this with @jbond and @MoritzMuehlenhoff and I can appreciate now that it wo... [11:29:21] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) [11:36:24] (03PS15) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [11:36:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48189 and previous config saved to /var/cache/conftool/dbconfig/20230511-113657-root.json [11:37:32] (03PS16) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [11:39:29] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bullseye [11:41:18] (03PS17) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [11:44:15] PROBLEM - Check systemd state on prometheus3002 is CRITICAL: CRITICAL - degraded: The following units failed: alerts-deploy@ext.service,alerts-deploy@k8s-mlserve.service,alerts-deploy@k8s-staging.service,alerts-deploy@k8s.service,alerts-deploy@local.service,alerts-deploy@ops.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:09] (03CR) 10David Caro: [C: 03+2] toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [11:45:31] (03CR) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks (0313 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [11:47:23] (03PS10) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [11:47:55] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:48:33] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:49:34] (03CR) 10Lgaulia: [C: 03+1] Enable First Input Delay events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918348 (https://phabricator.wikimedia.org/T332012) (owner: 10Phedenskog) [11:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48190 and previous config saved to /var/cache/conftool/dbconfig/20230511-115201-root.json [11:53:09] (03PS9) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/918410 [11:53:33] (03CR) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [11:53:39] (03CR) 10Slyngshede: "After suggestion from Riccardo the generated SQL queries was checked. While Python 3.5 and Django 2.2 will indeed give a deprecation warni" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [11:54:31] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [11:56:28] (03CR) 10Btullis: [C: 03+1] "Apologies for the delay in reviewing. Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/753479 (owner: 10Jbond) [11:57:02] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [11:57:52] (03CR) 10Muehlenhoff: "Looks good! A few final typos inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 (owner: 10Slyngshede) [11:58:26] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] signup: allow blocking of username with regex (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) (owner: 10Slyngshede) [11:59:22] (03CR) 10Slyngshede: "Followup patch addressing comment from Bryan." [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 (owner: 10Slyngshede) [12:00:25] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:00:44] (03PS6) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) [12:01:18] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [12:03:29] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:10:37] (03CR) 10Atieno: [C: 03+1] Add discovery records for device-analytics [dns] - 10https://gerrit.wikimedia.org/r/917306 (https://phabricator.wikimedia.org/T335505) (owner: 10Hnowlan) [12:12:22] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [12:18:32] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2002.wikimedia.org with OS bullseye [12:22:19] (03PS1) 10David Caro: Revert "toolforge_cli: add api gateway url and builds endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/918542 [12:22:29] (03CR) 10David Caro: [C: 03+2] Revert "toolforge_cli: add api gateway url and builds endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/918542 (owner: 10David Caro) [12:24:57] (03CR) 10Volans: [C: 03+1] "LGTM, replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [12:40:49] (03PS1) 10Ladsgroup: Add outreachwiki to wikidataclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919051 (https://phabricator.wikimedia.org/T171140) [12:41:14] !log creating wikidata client tables for outreachwiki (T171140) [12:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:18] T171140: Enable Wikidata support for Outreach Wiki - https://phabricator.wikimedia.org/T171140 [12:41:30] (03CR) 10CI reject: [V: 04-1] Add outreachwiki to wikidataclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919051 (https://phabricator.wikimedia.org/T171140) (owner: 10Ladsgroup) [12:42:55] (03PS2) 10Ladsgroup: Add outreachwiki to wikidataclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919051 (https://phabricator.wikimedia.org/T171140) [12:43:30] jouncebot: nowandnext [12:43:30] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [12:43:30] In 0 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1300) [12:43:30] In 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1300) [12:44:13] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) Thanks for filing this one! I'm happy with the script in the private repo, but I think it would help if @ayounsi also had a quick look... [12:45:33] (03CR) 10Ladsgroup: [C: 03+2] Add outreachwiki to wikidataclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919051 (https://phabricator.wikimedia.org/T171140) (owner: 10Ladsgroup) [12:46:06] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10SLyngshede-WMF) [12:46:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10SLyngshede-WMF) p:05Triage→03Low [12:46:22] (03Merged) 10jenkins-bot: Add outreachwiki to wikidataclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919051 (https://phabricator.wikimedia.org/T171140) (owner: 10Ladsgroup) [12:47:24] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:919051|Add outreachwiki to wikidataclient dblist (T171140)]] [12:47:28] T171140: Enable Wikidata support for Outreach Wiki - https://phabricator.wikimedia.org/T171140 [12:48:53] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add thanos-fe[12]004 to memcache and conftool [puppet] - 10https://gerrit.wikimedia.org/r/918418 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [12:48:57] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:919051|Add outreachwiki to wikidataclient dblist (T171140)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [12:49:15] RECOVERY - Check systemd state on prometheus3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [12:54:15] (03PS1) 10Volans: dhcp: expand support for hostname based match [software/spicerack] - 10https://gerrit.wikimedia.org/r/919052 (https://phabricator.wikimedia.org/T336485) [12:54:31] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10Papaul) >>! In T336428#8844009, @cmooney wrote: >>>! In T336428#8843550, @Papaul wrote: >> I took a quick look... [12:54:47] !log roll-restart thanos-fe swift-proxy to apply config changes - T336348 [12:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:51] T336348: Put in service thanos-fe[12]004 - https://phabricator.wikimedia.org/T336348 [12:56:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [12:58:29] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:919051|Add outreachwiki to wikidataclient dblist (T171140)]] (duration: 11m 05s) [12:58:33] T171140: Enable Wikidata support for Outreach Wiki - https://phabricator.wikimedia.org/T171140 [12:58:35] (03CR) 10CI reject: [V: 04-1] dhcp: expand support for hostname based match [software/spicerack] - 10https://gerrit.wikimedia.org/r/919052 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1300) [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1300). [13:00:06] No Gerrit patches in the queue for this window AFAICS. [13:00:46] (03PS1) 10Giuseppe Lavagetto: prometheus/k8s: add selective scraping of ports [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) [13:00:48] (03PS2) 10Volans: dhcp: expand support for hostname based match [software/spicerack] - 10https://gerrit.wikimedia.org/r/919052 (https://phabricator.wikimedia.org/T336485) [13:01:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [13:02:23] (03PS3) 10Elukey: service::catalog: set lvs_setup for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) [13:04:55] (03PS1) 10David Caro: Revert "Revert "toolforge_cli: add api gateway url and builds endpoint"" [puppet] - 10https://gerrit.wikimedia.org/r/918544 [13:05:05] !log filippo@cumin1001 conftool action : set/weight=100; selector: name=thanos-fe1004.eqiad.wmnet [13:05:46] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1004.eqiad.wmnet,service=thanos-web [13:06:36] !log filippo@cumin1001 conftool action : set/weight=100; selector: name=thanos-fe2004.eqiad.wmnet [13:06:45] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe2004.eqiad.wmnet [13:07:01] !log filippo@cumin1001 conftool action : set/pooled=true; selector: name=thanos-fe2004.eqiad.wmnet [13:07:22] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe2004.codfw.wmnet [13:07:36] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2004.codfw.wmnet,service=thanos-web [13:07:52] !log filippo@cumin1001 conftool action : set/weight=100; selector: name=thanos-fe2004.codfw.wmnet [13:09:22] (03PS1) 10Giuseppe Lavagetto: base.meta.pod_annotations: support annotations for prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/919055 (https://phabricator.wikimedia.org/T271822) [13:09:51] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Remove redundant analytics-product group [puppet] - 10https://gerrit.wikimedia.org/r/919018 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [13:11:57] (03PS1) 10Jelto: gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) [13:13:27] !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS bullseye [13:13:30] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host testvm2002.codfw.wmnet with OS bullseye [13:14:44] (03PS2) 10Jelto: gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) [13:16:44] (03CR) 10CI reject: [V: 04-1] gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) (owner: 10Jelto) [13:16:59] (03CR) 10Volans: [C: 04-1] gitlab: add check for running backups in the background (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) (owner: 10Jelto) [13:17:33] (03PS3) 10Jelto: gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) [13:18:06] (03CR) 10CDanis: [C: 03+2] add tunnelencabulator (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [13:18:14] (03CR) 10CDanis: [V: 03+2 C: 03+2] add tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [13:19:43] (03PS2) 10Giuseppe Lavagetto: base.meta.pod_annotations: support annotations for prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/919055 (https://phabricator.wikimedia.org/T271822) [13:19:46] (03PS1) 10Giuseppe Lavagetto: mediawiki: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) [13:19:58] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:traffic: Add lvs::kernel_config during insetup to allow reimages [puppet] - 10https://gerrit.wikimedia.org/r/919034 (https://phabricator.wikimedia.org/T336428) (owner: 10Jbond) [13:21:57] 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10Joe) a:03Joe [13:21:59] !log upload benthos 4.15.0-1 to {buster,bullseye}-wikimedia - T331801 [13:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:03] T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 [13:23:01] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:23:04] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:23:13] (03PS10) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/918410 [13:23:15] (03PS1) 10Jbond: firewall: Remove kafka_brokers_analytics [puppet] - 10https://gerrit.wikimedia.org/r/919059 [13:23:17] (03PS1) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [13:23:19] (03PS1) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [13:23:21] (03PS1) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [13:23:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] admin: make admin::kerberos_users more generic [puppet] - 10https://gerrit.wikimedia.org/r/753479 (owner: 10Jbond) [13:23:52] (03CR) 10CI reject: [V: 04-1] firewall: Remove kafka_brokers_analytics [puppet] - 10https://gerrit.wikimedia.org/r/919059 (owner: 10Jbond) [13:23:53] 10SRE, 10Infrastructure-Foundations: Add support for nftables in profile::base::firewall - https://phabricator.wikimedia.org/T336497 (10MoritzMuehlenhoff) [13:24:13] (03CR) 10CI reject: [V: 04-1] firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [13:24:36] (03CR) 10CI reject: [V: 04-1] profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [13:24:46] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [13:26:22] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [13:26:35] (03PS1) 10Herron: mwlog: rotate api.log hourly [puppet] - 10https://gerrit.wikimedia.org/r/919063 (https://phabricator.wikimedia.org/T277445) [13:26:52] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:26:55] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:29:05] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [13:29:31] (03PS1) 10Elukey: benthos: use kafka_franz for the webrequest_live instance [puppet] - 10https://gerrit.wikimedia.org/r/919064 (https://phabricator.wikimedia.org/T331801) [13:31:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice" [puppet] - 10https://gerrit.wikimedia.org/r/919064 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [13:31:55] (03PS2) 10Jbond: firewall: Remove kafka_brokers_analytics [puppet] - 10https://gerrit.wikimedia.org/r/919059 [13:31:57] (03PS2) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [13:31:59] (03PS2) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [13:32:01] (03PS2) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [13:32:50] (03CR) 10jenkins-bot: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [13:34:04] (03CR) 10jenkins-bot: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [13:34:47] (03CR) 10Hashar: "Checking on gerrit1001 there is:" [puppet] - 10https://gerrit.wikimedia.org/r/908574 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [13:37:32] (03PS1) 10Muehlenhoff: Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/919066 [13:38:06] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:38:09] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:40:24] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/919066 (owner: 10Muehlenhoff) [13:40:33] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10ssingh) Thanks @cmooney and @jbond for the extensive debugging! Looking at the above discussion, I think I should have mentioned tha... [13:41:45] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye [13:41:49] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host testvm2002.codfw.wmnet with OS bullseye completed: - testvm2002 (**PASS**) - Dow... [13:42:02] jouncebot: next [13:42:03] In 0 hour(s) and 17 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1400) [13:42:09] PROBLEM - Checks that the airflow database for airflow analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:43:30] (03PS18) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [13:44:05] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:45:29] (03PS4) 10Jelto: gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) [13:46:23] (03PS1) 10Ssingh: hiera: fix dns*.yaml resolving nameservers [puppet] - 10https://gerrit.wikimedia.org/r/919067 (https://phabricator.wikimedia.org/T330670) [13:47:38] (03CR) 10CI reject: [V: 04-1] gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) (owner: 10Jelto) [13:47:54] (03PS1) 10David Caro: toolsbeta: refresh prometheus cert [puppet] - 10https://gerrit.wikimedia.org/r/919068 (https://phabricator.wikimedia.org/T336495) [13:48:15] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/output/919063/41123/" [puppet] - 10https://gerrit.wikimedia.org/r/919063 (https://phabricator.wikimedia.org/T277445) (owner: 10Herron) [13:48:28] (03PS5) 10Jelto: gitlab: add check for running backups in the background [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) [13:48:45] (03CR) 10Hashar: "With `puppet agent -tv --debug`:" [puppet] - 10https://gerrit.wikimedia.org/r/908574 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [13:49:21] !log uploaded wmf-laptop 0.5.7 to component/wmf-sre-laptop [13:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:41] 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10SCherukuwada) [13:50:07] (03PS1) 10Ladsgroup: Remove db1110 [puppet] - 10https://gerrit.wikimedia.org/r/919070 (https://phabricator.wikimedia.org/T335011) [13:50:16] (03CR) 10Hashar: [C: 03+1] "This is merely a manifest cleanup. The config files have been removed a while ago, we no more need to `ensure => absent` them." [puppet] - 10https://gerrit.wikimedia.org/r/908574 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [13:50:27] (03PS5) 10Filippo Giunchedi: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [13:50:29] (03PS1) 10Filippo Giunchedi: tox: use python 3.9/3.11 (Bullseye/Bookworm) [alerts] - 10https://gerrit.wikimedia.org/r/919071 [13:51:10] (03CR) 10Jelto: gitlab: add check for running backups in the background (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/919057 (https://phabricator.wikimedia.org/T336490) (owner: 10Jelto) [13:51:27] (03CR) 10Filippo Giunchedi: Add alert for server-side NIC errors (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [13:52:53] (03CR) 10Filippo Giunchedi: [C: 03+2] tox: use python 3.9/3.11 (Bullseye/Bookworm) [alerts] - 10https://gerrit.wikimedia.org/r/919071 (owner: 10Filippo Giunchedi) [13:52:55] (03PS2) 10Filippo Giunchedi: tox: use python 3.9/3.11 (Bullseye/Bookworm) [alerts] - 10https://gerrit.wikimedia.org/r/919071 [13:52:59] (03CR) 10Filippo Giunchedi: [V: 03+2] tox: use python 3.9/3.11 (Bullseye/Bookworm) [alerts] - 10https://gerrit.wikimedia.org/r/919071 (owner: 10Filippo Giunchedi) [13:53:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2024.codfw.wmnet with reason: Maintenance [13:53:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2024.codfw.wmnet with reason: Maintenance [13:53:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2024 (T335845)', diff saved to https://phabricator.wikimedia.org/P48191 and previous config saved to /var/cache/conftool/dbconfig/20230511-135335-ladsgroup.json [13:54:39] (03PS1) 10Slyngshede: Search: add function for search users. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919073 (https://phabricator.wikimedia.org/T335476) [13:54:58] (03CR) 10CI reject: [V: 04-1] Search: add function for search users. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919073 (https://phabricator.wikimedia.org/T335476) (owner: 10Slyngshede) [13:56:33] 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10Dzahn) a:03Dzahn [13:57:08] !log upgrade benthos (4.9.1 -> 4.15.0) on centrallog nodes - T331801 [13:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:12] T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 [13:57:33] (03PS11) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/918410 [13:57:35] (03PS3) 10Jbond: firewall: Remove kafka_brokers_analytics [puppet] - 10https://gerrit.wikimedia.org/r/919059 [13:57:36] (03PS3) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [13:57:38] (03PS3) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [13:57:40] (03PS3) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [13:58:15] jouncebot: next [13:58:15] In 0 hour(s) and 1 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1400) [13:58:25] jouncebot: nowandnext [13:58:26] For the next 0 hour(s) and 1 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1300) [13:58:26] For the next 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1300) [13:58:26] In 0 hour(s) and 1 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1400) [13:58:44] mutante: unlikely we are going to do the LVS maintenace in this window pending T336428 being resolved [13:58:44] T336428: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 [13:58:47] so go for it :) [13:59:08] sukhe: aah:) cool, thank you [13:59:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41128/console" [puppet] - 10https://gerrit.wikimedia.org/r/919059 (owner: 10Jbond) [13:59:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41127/console" [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [14:00:05] sukhe: Dear deployers, time to do the LVS maintenance deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1400). [14:00:27] (03CR) 10David Caro: [C: 03+2] toolsbeta: refresh prometheus cert [puppet] - 10https://gerrit.wikimedia.org/r/919068 (https://phabricator.wikimedia.org/T336495) (owner: 10David Caro) [14:01:04] (03CR) 10CI reject: [V: 04-1] firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [14:01:28] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [14:02:40] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Increase task manager memory alloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/917935 (https://phabricator.wikimedia.org/T336134) (owner: 10Bking) [14:02:56] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41131/console" [puppet] - 10https://gerrit.wikimedia.org/r/919019 (https://phabricator.wikimedia.org/T333004) (owner: 10Aqu) [14:04:38] (03PS4) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [14:04:40] (03PS4) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [14:04:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2024 (T335845)', diff saved to https://phabricator.wikimedia.org/P48192 and previous config saved to /var/cache/conftool/dbconfig/20230511-140440-ladsgroup.json [14:04:42] (03PS4) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [14:04:59] (03Merged) 10jenkins-bot: rdf-streaming-updater: Increase task manager memory alloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/917935 (https://phabricator.wikimedia.org/T336134) (owner: 10Bking) [14:05:17] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:05:20] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:05:48] (03PS1) 10Ssingh: varnish: bump size of varnish shared memory log to 160M (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/919074 (https://phabricator.wikimedia.org/T253093) [14:06:02] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [14:06:04] (03CR) 10CI reject: [V: 04-1] firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [14:06:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41132/console" [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [14:07:17] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41134/console" [puppet] - 10https://gerrit.wikimedia.org/r/919074 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:07:58] (03PS2) 10Ssingh: varnish: bump size of varnish shared memory log to 160M (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/919074 (https://phabricator.wikimedia.org/T253093) [14:08:33] !log starting Gerrit Switchover (Take II): The Reckoning [14:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:49] (03PS12) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/918410 [14:08:51] (03PS4) 10Jbond: firewall: Remove kafka_brokers_analytics [puppet] - 10https://gerrit.wikimedia.org/r/919059 [14:08:53] (03PS5) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [14:08:55] !log bking@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:08:55] (03PS5) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [14:08:57] (03PS5) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [14:09:04] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41135/console" [puppet] - 10https://gerrit.wikimedia.org/r/919074 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:09:59] !log bking@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:11:21] !log bking@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:11:59] (03CR) 10CI reject: [V: 04-1] profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [14:12:40] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [14:12:59] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish: bump size of varnish shared memory log to 160M (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/919074 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [14:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:14:58] !log sudo cumin -b1 -s1200 'A:cp and A:codfw' 'varnish-frontend-restart': T253093 [14:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:03] T253093: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 [14:15:25] !log bking@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:15:57] !log bking@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:16:47] (03CR) 10Elukey: [C: 03+2] benthos: use kafka_franz for the webrequest_live instance [puppet] - 10https://gerrit.wikimedia.org/r/919064 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [14:16:57] (03PS1) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [14:17:42] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [14:17:48] (03CR) 10Marostegui: [C: 03+1] Remove db1110 [puppet] - 10https://gerrit.wikimedia.org/r/919070 (https://phabricator.wikimedia.org/T335011) (owner: 10Ladsgroup) [14:18:13] 10SRE, 10Infrastructure-Foundations, 10Traffic: LVS servers using autoconf SLAAC IPv6 addresses - https://phabricator.wikimedia.org/T336505 (10cmooney) p:05Triage→03Low [14:19:17] (03CR) 10CI reject: [V: 04-1] installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [14:19:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on gerrit1001.wikimedia.org with reason: maintenance [14:19:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on gerrit1001.wikimedia.org with reason: maintenance [14:19:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2024', diff saved to https://phabricator.wikimedia.org/P48194 and previous config saved to /var/cache/conftool/dbconfig/20230511-141947-ladsgroup.json [14:19:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1003.wikimedia.org with reason: maintenance [14:20:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1003.wikimedia.org with reason: maintenance [14:22:10] (03PS1) 10Elukey: benthos::instance: add --skip-env-var-check to lint [puppet] - 10https://gerrit.wikimedia.org/r/919077 (https://phabricator.wikimedia.org/T331801) [14:23:01] (03PS2) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [14:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [14:23:39] (03CR) 10Thcipriani: [C: 03+1] Revert "Revert "gerrit: switch service IP, turn new into current and current into old"" [dns] - 10https://gerrit.wikimedia.org/r/918529 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [14:24:01] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "gerrit: switch service IP, turn new into current and current into old"" [dns] - 10https://gerrit.wikimedia.org/r/918529 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [14:24:39] (03CR) 10Elukey: "elukey@centrallog1002:~$ /usr/bin/benthos lint --skip-env-var-check /etc/benthos/webrequest_live.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/919077 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [14:25:07] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:25:29] (03PS3) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [14:25:33] (03CR) 10CI reject: [V: 04-1] installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [14:25:35] (03CR) 10Filippo Giunchedi: [C: 03+1] benthos::instance: add --skip-env-var-check to lint [puppet] - 10https://gerrit.wikimedia.org/r/919077 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [14:26:22] (03PS1) 10Ladsgroup: Prepare for the new release of 0.10 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) [14:26:36] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2012 [14:26:43] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2012 [14:27:44] !log installing avahi security updates [14:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:29] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:51] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:59] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:03] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:25] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:49] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:55] (JobUnavailable) firing: (3) Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:33:33] (JobUnavailable) firing: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:34:05] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:46] 10SRE, 10wmf-sre-laptop: distribute tunnelencabulator in wmf-sre-laptop - https://phabricator.wikimedia.org/T266784 (10CDanis) 05Open→03Resolved a:03CDanis `13:49 uploaded wmf-laptop 0.5.7 to component/wmf-sre-laptop` [14:34:48] 10SRE: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10CDanis) [14:34:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2024', diff saved to https://phabricator.wikimedia.org/P48195 and previous config saved to /var/cache/conftool/dbconfig/20230511-143453-ladsgroup.json [14:35:43] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:05] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:55] (JobUnavailable) firing: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:33] (JobUnavailable) firing: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:11] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:21] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:07] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:21] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:53] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:55] (JobUnavailable) resolved: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:50:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2024 (T335845)', diff saved to https://phabricator.wikimedia.org/P48197 and previous config saved to /var/cache/conftool/dbconfig/20230511-144959-ladsgroup.json [15:04:40] Gerrit is back up but CI is not ready yet cause the WMCS VMs are unable to speak to the new host [15:05:28] (ack, was wondering ^) [15:15:58] something something about networking routing between WMCS (which has the CI instances) and production network (which hosts Gerrit) [15:18:35] !log altering image_suggestions schema (generated data platform) — T336424 [15:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:39] T336424: Add section_index and item_id columns to image_suggestions.suggestions table schema - https://phabricator.wikimedia.org/T336424 [15:21:40] !log running homer for CR 919151: resolve connection issues to gerrit.wikimedia.org [15:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:23] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bullseye [15:27:37] !log [done] running homer for CR 919151: resolve connection issues to gerrit.wikimedia.org [15:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:06] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/919126 (owner: 10Ssingh) [15:29:05] (03CR) 10RLazarus: [C: 03+2] "Thanks for the review!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [15:31:55] (03Abandoned) 10RLazarus: alerting_host: Disable vopsbot in #wikimedia-sre [puppet] - 10https://gerrit.wikimedia.org/r/917915 (https://phabricator.wikimedia.org/T329791) (owner: 10RLazarus) [15:32:26] (03CR) 10Majavah: [C: 03+1] buildservice: set the PORT=8000 env var (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919147 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:33:54] (03CR) 10CI reject: [V: 04-1] Revert "benthos: use kafka_franz for the webrequest_live instance" [puppet] - 10https://gerrit.wikimedia.org/r/919166 (owner: 10Elukey) [15:34:16] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/919166 (owner: 10Elukey) [15:35:12] (03PS2) 10Elukey: Revert "benthos: use kafka_franz for the webrequest_live instance" [puppet] - 10https://gerrit.wikimedia.org/r/919166 [15:35:52] we probably need a Zuul restart at this stage [15:35:56] Zuul CI is backlogged and processing test results still [15:36:02] hashar: thanks [15:36:03] I am investigating the slowness [15:37:55] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:17] (03CR) 10David Caro: buildservice: set the PORT=8000 env var (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919147 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:39:10] I guessing it has a bunch of results trying to reach the old host somehow maybe [15:39:14] (03CR) 10Muehlenhoff: Prepare for the new release of 0.10 (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) (owner: 10Ladsgroup) [15:40:40] (03PS2) 10Ladsgroup: Prepare for the new release of 0.10 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) [15:41:16] (03CR) 10Ladsgroup: Prepare for the new release of 0.10 (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) (owner: 10Ladsgroup) [15:41:35] Queue lengths: 214 events, 0 results. [15:41:41] so Zuul managed to report everything that was pending [15:42:30] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/919126 (owner: 10Ssingh) [15:42:40] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [15:42:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1110.eqiad.wmnet [15:43:33] (JobUnavailable) firing: (6) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2020.codfw.wmnet with reason: Maintenance [15:45:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2020.codfw.wmnet with reason: Maintenance [15:45:29] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [15:45:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48198 and previous config saved to /var/cache/conftool/dbconfig/20230511-154533-ladsgroup.json [15:45:55] (03PS1) 10Giuseppe Lavagetto: shellbox: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919155 (https://phabricator.wikimedia.org/T271822) [15:48:00] !log gerrit maintenance period ended - gerrit switched to new hardware, IP and distro version [15:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:02] Zuul has Queue lengths: 0 events, 0 results. [15:48:05] so I think it is fine [15:48:09] congrats all [15:48:11] awesome :) [15:48:24] thanks all [15:48:26] !log CI back up and fully operation (after the Gerrit upgrade) [15:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:30] (03CR) 10Majavah: [C: 03+1] buildservice: set the PORT=8000 env var (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919147 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [15:48:43] I am not going to bother investigating why Zuul went severly backlogged [15:49:03] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [15:49:14] guesses the IP was cached somehow [15:49:16] I am guessing it has a large events queue due to the Gerrit replications of all repositories after it restarted [15:49:20] but if it works now, dont bother [15:49:36] even though all those replication events are not handled by Zuul and discarded, they show up in the events queue [15:50:50] (03PS3) 10Ladsgroup: Prepare for the new release of 0.10 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) [15:51:02] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) This has now happened. gerrit.wikimedia.org is now on new hardware, a new IP and a new distro version. [15:51:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) (owner: 10Ladsgroup) [15:52:00] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment: use kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/919150 (https://phabricator.wikimedia.org/T331526) (owner: 10Gmodena) [15:52:24] I might well just restart it [15:52:36] (03PS2) 10Gmodena: mediawiki-page-content-change-enrichment: use kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/919150 (https://phabricator.wikimedia.org/T331526) [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:36] (03CR) 10Hnowlan: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/919148 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [15:53:43] (03PS3) 10Gmodena: mediawiki-page-content-change-enrichment: use kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/919150 (https://phabricator.wikimedia.org/T331526) [15:53:45] (03PS2) 10Jgreen: Add dns for new frack codfw bastion [dns] - 10https://gerrit.wikimedia.org/r/918608 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [15:53:47] (03CR) 10Ladsgroup: [C: 03+2] Prepare for the new release of 0.10 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) (owner: 10Ladsgroup) [15:55:24] * hashar checks gerrit1001 Apache2 logs [15:56:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48199 and previous config saved to /var/cache/conftool/dbconfig/20230511-155607-ladsgroup.json [15:56:38] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment: use kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/919150 (https://phabricator.wikimedia.org/T331526) (owner: 10Gmodena) [15:56:44] (03Merged) 10jenkins-bot: remote: Clarify wait_reboot_since output [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [15:57:11] (03CR) 10Ladsgroup: [C: 03+2] "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) (owner: 10Ladsgroup) [15:57:55] (JobUnavailable) resolved: (6) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:48] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1110.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [15:59:26] (03Abandoned) 10Elukey: Revert "benthos: use kafka_franz for the webrequest_live instance" [puppet] - 10https://gerrit.wikimedia.org/r/919166 (owner: 10Elukey) [16:00:04] jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1600). nyaa~ [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1110.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [16:00:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1110.eqiad.wmnet [16:01:32] (03CR) 10Ladsgroup: [C: 03+2] Remove db1110 [puppet] - 10https://gerrit.wikimedia.org/r/919070 (https://phabricator.wikimedia.org/T335011) (owner: 10Ladsgroup) [16:01:39] (03PS2) 10Giuseppe Lavagetto: mediawiki: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919058 (https://phabricator.wikimedia.org/T271822) [16:01:41] (03PS2) 10Giuseppe Lavagetto: shellbox: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919155 (https://phabricator.wikimedia.org/T271822) [16:01:43] (03PS1) 10Giuseppe Lavagetto: mediawiki: add listeners to the tls fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/919156 [16:01:56] !log Removing db1110 from zarcillo T335011 [16:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:00] T335011: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 [16:02:23] (03CR) 10BBlack: [C: 03+1] "The logic/layout looks correct, also verified all the IP addresses actually match up with the intended destination servers." [puppet] - 10https://gerrit.wikimedia.org/r/919067 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [16:03:53] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 (10Ladsgroup) a:05Ladsgroup→03wiki_willy [16:04:07] (03PS2) 10David Caro: buildservice: set the PORT=8000 env var [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919147 (https://phabricator.wikimedia.org/T336507) [16:04:39] (03CR) 10Bking: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/919108 (https://phabricator.wikimedia.org/T333464) (owner: 10DCausse) [16:05:04] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [16:06:52] (03PS1) 10Elukey: benthos: change kafka consumer group name for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/919158 (https://phabricator.wikimedia.org/T331801) [16:07:50] (03CR) 10CI reject: [V: 04-1] shellbox: update modules, enable named ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/919155 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [16:08:10] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2002.wikimedia.org with OS bullseye [16:11:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2020', diff saved to https://phabricator.wikimedia.org/P48200 and previous config saved to /var/cache/conftool/dbconfig/20230511-161113-ladsgroup.json [16:13:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Bypassing CI since the change is trivial and after gerrit maintenance there is a big backlog." [puppet] - 10https://gerrit.wikimedia.org/r/919158 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [16:15:11] (03PS1) 10Volans: Revert "secrets: add ZTP script for install_server" [labs/private] - 10https://gerrit.wikimedia.org/r/919167 [16:16:10] 10SRE, 10Gerrit: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10bd808) [16:16:39] !log benthos webrequest live instances migrated to kafka-franz (new consumer client, data may have some holes) - T331801 [16:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:44] T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 [16:17:11] (03PS1) 10Volans: installserver: set dummy ZTP temporary root passwd [labs/private] - 10https://gerrit.wikimedia.org/r/919159 (https://phabricator.wikimedia.org/T336485) [16:18:03] 10SRE, 10SRE Observability, 10Patch-For-Review, 10User-fgiunchedi: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Had to change the consumer group name since Sarama and Kafka Franz (both go clients) don't play we... [16:19:06] (03PS2) 10Volans: Revert "secrets: add ZTP script for install_server" [labs/private] - 10https://gerrit.wikimedia.org/r/919167 (https://phabricator.wikimedia.org/T336485) [16:22:29] PROBLEM - Host gerrit.wikimedia.org is DOWN: CRITICAL - Destination Unreachable (gerrit.wikimedia.org) [16:22:58] 10SRE, 10SRE Observability, 10Patch-For-Review, 10User-fgiunchedi: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10Volans) Thanks a lot! We got a small hole for text and almost nothing for upload AFAICT: {F36992438} [16:23:20] (03CR) 10Volans: [V: 03+2 C: 03+2] Revert "secrets: add ZTP script for install_server" [labs/private] - 10https://gerrit.wikimedia.org/r/919167 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [16:23:37] (03CR) 10Volans: [V: 03+2 C: 03+2] installserver: set dummy ZTP temporary root passwd [labs/private] - 10https://gerrit.wikimedia.org/r/919159 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [16:26:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2020', diff saved to https://phabricator.wikimedia.org/P48201 and previous config saved to /var/cache/conftool/dbconfig/20230511-162619-ladsgroup.json [16:27:52] 10SRE, 10Data-Engineering, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10BCornwall) [16:30:00] 10SRE, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10BCornwall) [16:30:25] jouncebot: next [16:30:26] In 0 hour(s) and 29 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1700) [16:30:26] In 0 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1700) [16:32:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans i have some switches ready for testing. 2 leaves in different rows and the 2 spines lsw1-a8 lsw1-b8 ssw1-... [16:33:30] (03CR) 10BryanDavis: gerrit: add host-based Hiera keys for gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909796 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [16:33:43] <_joe_> sukhe: there is an unbreak now task that might require a deployment [16:33:52] sukhe: i need to do a train rollback here... yeah, what _joe_ said [16:34:00] all good from my side, no active LVS work [16:34:01] go ahead please! [16:34:02] 10SRE, 10Gerrit: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10bd808) https://gerrit.wikimedia.org/r/c/operations/puppet/+/909796/2/hieradata/hosts/gerrit1003.yaml has a typo in the IPv6 address. [16:34:06] thanks! [16:34:15] <_joe_> brennen: before rolling back, let me do one more test [16:34:22] _joe_: kk, holding [16:34:37] PROBLEM - Check systemd state on an-airflow1006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-scheduler@analytics_product.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:06] <_joe_> brennen: go ahead [16:37:11] goin' [16:37:50] !log train 1.41.0-wmf.8 (T330214): rolling back to group1 to test for T336504 presence/absence on enwiki [16:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:56] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [16:37:57] T336504: Vector 2022 force-deploying on arbitrary pages - https://phabricator.wikimedia.org/T336504 [16:38:02] (03PS1) 10Hashar: gerrit: fix gerrit1003 ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/919161 (https://phabricator.wikimedia.org/T336524) [16:38:14] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919162 (https://phabricator.wikimedia.org/T330214) [16:38:16] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919162 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [16:38:56] (03CR) 10Krinkle: [C: 03+2] ResourceLoader: Log when MAXAGE_RECOVER is detected [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915720 (https://phabricator.wikimedia.org/T321394) (owner: 10Krinkle) [16:39:25] (03PS2) 10Krinkle: ResourceLoader: Log when MAXAGE_RECOVER is detected [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915720 [16:39:34] (03Abandoned) 10Krinkle: ResourceLoader: Log when MAXAGE_RECOVER is detected [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915720 (owner: 10Krinkle) [16:40:06] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10hashar) [16:40:12] 10SRE, 10Gerrit, 10Patch-For-Review: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10hashar) [16:41:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48203 and previous config saved to /var/cache/conftool/dbconfig/20230511-164125-ladsgroup.json [16:41:42] (03CR) 10David Caro: buildservice: set the PORT=8000 env var (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919147 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [16:43:21] (03CR) 10Majavah: [C: 03+1] buildservice: set the PORT=8000 env var (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919147 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [16:43:38] (03CR) 10Ssingh: [V: 03+2 C: 03+2] gerrit: fix gerrit1003 ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/919161 (https://phabricator.wikimedia.org/T336524) (owner: 10Hashar) [16:45:31] this rollback somewhat hung up on the above, i suspect? ^ [16:45:41] RECOVERY - Host gerrit.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:46:10] (03Merged) 10jenkins-bot: mediawiki-page-content-change-enrichment: use kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/919150 (https://phabricator.wikimedia.org/T331526) (owner: 10Gmodena) [16:46:35] <_joe_> brennen: ah damn [16:46:44] <_joe_> I finally reproduced on a page [16:47:07] (03CR) 10Hnowlan: [C: 03+2] thumbor: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/919148 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [16:47:15] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919162 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [16:47:28] 10SRE, 10Gerrit, 10Patch-For-Review: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10bd808) 05Open→03Resolved a:03hashar ` $ ssh -6 bd808@gerrit.wikimedia.org -p 29418 **** Welcome to Gerrit Code Review **** Hi BryanDavis, you have su... [16:47:31] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10bd808) [16:47:39] (zuul's unstuck, so rollback proceeds) [16:47:41] (03Merged) 10jenkins-bot: Prepare for the new release of 0.10 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/919079 (https://phabricator.wikimedia.org/T336174) (owner: 10Ladsgroup) [16:47:53] (03Merged) 10jenkins-bot: thumbor: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/919148 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [16:47:54] <_joe_> annnd I just reproduced in eqiad as well [16:48:24] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:48:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:48:32] (03CR) 10David Caro: [C: 03+2] buildservice: set the PORT=8000 env var [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919147 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [16:49:20] 10SRE, 10Gerrit, 10Patch-For-Review: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 (10hashar) I have removed the faulty IPv6 from /etc/network/interfaces and manually removed it with: ` ip addr del 2620:0:861:2:208:80:154:51/128 dev eno8303 ` [16:49:41] (03Merged) 10jenkins-bot: buildservice: set the PORT=8000 env var [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919147 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [16:50:14] brennen: when you're done with the rollback, could I sneak in a beta-only config change merge? [16:50:25] !log CI / Zuul was slow to report build results back to Gerrit most probably due to lack of IPv6 (T336524) which should be solved now. [16:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:30] T336524: IPv6 SSH to gerrit.wikimedia.org hangs (blackhole route?) - https://phabricator.wikimedia.org/T336524 [16:50:32] (03PS2) 10Majavah: Enable RealMe on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910769 (https://phabricator.wikimedia.org/T324535) [16:50:43] taavi: yeah, sure thing. i'll ping. [16:50:53] (03CR) 10Bking: [C: 03+2] flink-operator: only deploy it to wikikube@stagings [deployment-charts] - 10https://gerrit.wikimedia.org/r/919108 (https://phabricator.wikimedia.org/T333464) (owner: 10DCausse) [16:51:05] thank you! [16:53:23] (03Merged) 10jenkins-bot: flink-operator: only deploy it to wikikube@stagings [deployment-charts] - 10https://gerrit.wikimedia.org/r/919108 (https://phabricator.wikimedia.org/T333464) (owner: 10DCausse) [16:54:21] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:54:26] k8s image build/push is not moving fast on this one. [16:54:28] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:55:59] !log bking@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:56:08] !log bking@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:56:12] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:56:18] !log bking@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:56:23] !log bking@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:56:30] !log bking@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:57:49] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:58:06] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:58:42] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:58:52] <_joe_> brennen: it should be done now, right? [16:59:13] it _should_ be. [16:59:28] i... is the press enter to continue bug somehow actually rearing its head again? [16:59:46] [profane muttering] [17:00:04] bd808: That opportune time is upon us again. Time for a Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1700). [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1700) [17:00:24] train rollback still underway, ought to be done momentarily. [17:00:52] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [17:01:28] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [17:01:54] * bd808 has nothing to deploy in the Technical Engagement window [17:02:57] i think this train rollback should have taken about 5-6 minutes; it's currently at like 25. [17:03:37] slight increase from CI slowness early on, but something in here seems like it's periodically out of whack. [17:04:41] oof [17:04:48] > 12020 ______▇ 1702 ◍ 1704 ● MissingCategory..... .7 e/L/i/CategoryManager:219 Cannot find id for 'large-tables' [17:04:49] brennen: wild guess: it's adding in i18n for wmf.7 in the docker image that was removed with the group2 promotion [17:05:10] how did that not trigger an alert [17:05:41] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [17:06:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:06:18] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.8 refs T330214 [17:06:22] (03CR) 10Jgreen: [C: 03+2] Add dns for new frack codfw bastion [dns] - 10https://gerrit.wikimedia.org/r/918608 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [17:06:22] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [17:06:42] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:06:53] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [17:07:17] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [17:07:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:08:04] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [17:08:27] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10KFrancis) Hi @Dzahn, NDA's for Robert Timm and Loren Johnson are complete. Please proceed with the access request. The NDA for Adee Ritma... [17:10:00] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:11:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:11:35] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [17:12:19] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [17:12:33] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.8 refs T330214 (duration: 06m 14s) [17:12:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:12:37] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [17:17:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:24:05] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [17:25:58] taavi: all yours [17:26:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910769 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [17:27:03] (03Merged) 10jenkins-bot: Enable RealMe on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910769 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [17:27:54] and done [17:31:45] (03PS1) 10BryanDavis: Allow http://localhost callback URL [extensions/OAuth] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919168 (https://phabricator.wikimedia.org/T299737) [17:33:14] (03PS1) 10Herron: profile::webperf::redis: introduce/move arclamp redis config to profile [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) [17:36:21] (03CR) 10BryanDavis: "deployment of backport requested via https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=2076616" [extensions/OAuth] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919168 (https://phabricator.wikimedia.org/T299737) (owner: 10BryanDavis) [17:38:30] (Device rebooted) firing: Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:38:56] (03PS1) 10Herron: role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) [17:41:53] bd808: I was confused why I was getting “patch set not permitted” errors when *I* tried to create the backport :D :D [17:42:00] thanks for the backport :) [17:42:11] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41141/console" [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [17:42:37] lucaswerkmeister: np. having this stay broken until the hackathon seems like a poor idea [17:42:56] (03CR) 10Lucas Werkmeister: "+1 backport would be very appreciated! (Apparently I don’t have permissions to give an actual +1 on wmf branches o_O)" [extensions/OAuth] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919168 (https://phabricator.wikimedia.org/T299737) (owner: 10BryanDavis) [17:43:26] if you want I can test it on mwdebug using the consumer I meant to request earlier, I still have the tab open ^^ [17:43:30] (Device rebooted) resolved: Device scs-a1-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:43:52] (03PS2) 10Herron: role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) [17:46:22] oops. I put it in the wrong backport window... will fix [17:46:43] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-airflow1006.eqiad.wmnet with reason: Silence error notifications/alerts during setup [17:46:57] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-airflow1006.eqiad.wmnet with reason: Silence error notifications/alerts during setup [17:47:52] (03CR) 10Herron: "this one is meant to be an effective noop" [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [17:49:00] (03CR) 10BryanDavis: Allow http://localhost callback URL (031 comment) [extensions/OAuth] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919168 (https://phabricator.wikimedia.org/T299737) (owner: 10BryanDavis) [17:49:49] (03CR) 10Krinkle: [C: 03+1] profile::webperf::redis: introduce/move arclamp redis config to profile [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [17:50:38] lucaswerkmeister: it will be a few hours, unless I guess you can sweet talk b.rennen into doing it sooner [17:52:50] (03CR) 10Krinkle: [C: 03+1] profile::webperf::redis: introduce/move arclamp redis config to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [17:57:22] (03PS3) 10Herron: profile::arclamp::redis: introduce/move arclamp redis config to profile [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) [17:58:16] (03PS3) 10Herron: role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) [17:58:32] (03PS4) 10Herron: role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) [18:00:05] hashar and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T1800). [18:01:50] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41143/console" [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [18:03:56] (03CR) 10Herron: [V: 03+1] profile::arclamp::redis: introduce/move arclamp redis config to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [18:04:21] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41145/console" [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [18:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:14:59] (03CR) 10Krinkle: profile::arclamp::redis: introduce/move arclamp redis config to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [18:15:01] (03PS4) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [18:17:51] (03CR) 10CI reject: [V: 04-1] installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [18:23:27] (03PS5) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [18:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [18:26:28] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [18:27:47] 10SRE, 10Traffic-Icebox: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10BCornwall) @Vgutierrez Can you recall whether this has happened anytime since the few years this was reported? [18:28:01] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336538 (10phaultfinder) [18:28:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [18:34:29] (03PS1) 10Andrew Bogott: update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev [dns] - 10https://gerrit.wikimedia.org/r/919210 (https://phabricator.wikimedia.org/T336236) [18:35:20] (03CR) 10CI reject: [V: 04-1] update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev [dns] - 10https://gerrit.wikimedia.org/r/919210 (https://phabricator.wikimedia.org/T336236) (owner: 10Andrew Bogott) [18:35:43] (03PS1) 10Brennen Bearnes: WIP: gitlab: block auto created users [puppet] - 10https://gerrit.wikimedia.org/r/919211 (https://phabricator.wikimedia.org/T334958) [18:36:37] (03CR) 10Bking: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/918597 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [18:37:44] (03PS2) 10Andrew Bogott: update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev [dns] - 10https://gerrit.wikimedia.org/r/919210 (https://phabricator.wikimedia.org/T336236) [18:41:31] (03PS4) 10Ryan Kemper: wdqs: Activate wdqs2021 [puppet] - 10https://gerrit.wikimedia.org/r/918597 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [18:41:37] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/918597 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [18:42:32] (03PS6) 10Volans: installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) [18:43:56] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: Activate wdqs2021 [puppet] - 10https://gerrit.wikimedia.org/r/918597 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [18:43:58] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: Activate wdqs2021 [puppet] - 10https://gerrit.wikimedia.org/r/918597 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [18:44:36] (03CR) 10Gmodena: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [18:46:49] (03CR) 10Brennen Bearnes: [C: 03+1] "This risks being a "reviewer has working mouse" type of +1, but I don't want to be a blocker here. Seems legit to me on skimming through b" [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [18:52:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:53:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:55:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:56:14] (SystemdUnitFailed) firing: (2) wdqs-blazegraph.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:56:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:59:17] (03CR) 10Andrew Bogott: [C: 03+2] update codfw1dev rabbitmq01 cname for new cloudcontrol2001-dev [dns] - 10https://gerrit.wikimedia.org/r/919210 (https://phabricator.wikimedia.org/T336236) (owner: 10Andrew Bogott) [19:00:00] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:28] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:06:00] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:06:14] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:07:46] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:08:28] RECOVERY - Check systemd state on wdqs2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:21] (03PS2) 10Andrea Denisse: prometheus: Decommission prometheus3001 in esams [puppet] - 10https://gerrit.wikimedia.org/r/913249 (https://phabricator.wikimedia.org/T33558) [19:28:26] (03CR) 10Dzahn: [C: 03+1] prometheus: Decommission prometheus3001 in esams [puppet] - 10https://gerrit.wikimedia.org/r/913249 (https://phabricator.wikimedia.org/T33558) (owner: 10Andrea Denisse) [19:28:45] 10SRE, 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10Dzahn) 05Open→03In progress [19:33:36] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Decommission prometheus3001 in esams [puppet] - 10https://gerrit.wikimedia.org/r/913249 (https://phabricator.wikimedia.org/T33558) (owner: 10Andrea Denisse) [19:35:20] (03PS1) 10Ebernhardson: wcqs: Configure webproxy for federated queries [puppet] - 10https://gerrit.wikimedia.org/r/919216 (https://phabricator.wikimedia.org/T334470) [19:36:02] (03PS1) 10Andrew Bogott: Temporarily mark out refs to cloudrabbit01 in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/919217 (https://phabricator.wikimedia.org/T336236) [19:37:03] (03CR) 10Andrew Bogott: [C: 03+2] Temporarily mark out refs to cloudrabbit01 in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/919217 (https://phabricator.wikimedia.org/T336236) (owner: 10Andrew Bogott) [19:37:24] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41146/console" [puppet] - 10https://gerrit.wikimedia.org/r/919216 (https://phabricator.wikimedia.org/T334470) (owner: 10Ebernhardson) [19:39:53] (03CR) 10Ebernhardson: [V: 03+1 C: 04-1] wcqs: Configure webproxy for federated queries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919216 (https://phabricator.wikimedia.org/T334470) (owner: 10Ebernhardson) [19:41:06] (03PS2) 10Andrea Denisse: prometheus: Decommission prometheus4001 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/913250 (https://phabricator.wikimedia.org/T335585) [19:42:55] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:43:57] 10SRE, 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10Dzahn) Done! I just added you as owner of wikisource.org as described. Also confirmed you are author of https://wikitech.wi... [19:44:52] 10SRE, 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10Dzahn) 05In progress→03Resolved Please let me know if everything works as expected. Cheers! [19:47:12] (03CR) 10Andrea Denisse: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/913256 (https://phabricator.wikimedia.org/T335588) (owner: 10Andrea Denisse) [19:47:59] (03PS1) 10Urbanecm: Personalized praise: Do not suggest users with Homepage disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919175 (https://phabricator.wikimedia.org/T336300) [19:48:11] (03PS1) 10Urbanecm: Personalized praise: Do not suggest users with Homepage disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/919176 (https://phabricator.wikimedia.org/T336300) [19:48:38] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Dwisehaupt) [19:51:40] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus3001.esams.wment [19:51:43] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Robert Timm (WMDE) - https://phabricator.wikimedia.org/T336435 (10Dzahn) a:05roti_WMDE→03Dzahn per T335941#8845514 the NDA has been signed. moving forward [19:51:45] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Dzahn) a:05lojo_wmde→03Dzahn per T335941#8845514 the NDA has been signed. moving forward [19:52:30] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) out for signature (T335941#8845514) [19:52:45] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [19:53:54] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Dwisehaupt) 05Open→03Resolved Host completed. Set to active in netbox. Closing. [19:54:51] (03PS2) 10Ebernhardson: wcqs: Configure webproxy for federated queries [puppet] - 10https://gerrit.wikimedia.org/r/919216 (https://phabricator.wikimedia.org/T334470) [19:55:29] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [19:55:34] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [19:56:03] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41147/console" [puppet] - 10https://gerrit.wikimedia.org/r/919216 (https://phabricator.wikimedia.org/T334470) (owner: 10Ebernhardson) [19:56:42] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:56:43] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts prometheus3001.esams.wment [20:00:06] brennen and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230511T2000). [20:00:06] bd808 and Urbanecm: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:15] (03PS1) 10Dzahn: admin: add Robert Timm to LDAP-only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/919224 (https://phabricator.wikimedia.org/T336435) [20:02:08] o/ [20:02:50] I can deploy if there are no other takers [20:03:36] ...guess it's me :) [20:05:25] thcipriani: thank you :) [20:05:29] bd808: note that this will only apply to group1 and group0 wikis at the moment (given the state of train) is that fine? [20:05:38] (03PS1) 10Dzahn: admin: add Loren Johnson to LDAP-only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/919225 (https://phabricator.wikimedia.org/T335858) [20:05:43] i'm here, but didn't see the ping [20:05:54] (03CR) 10Ebernhardson: [V: 03+1] wcqs: Configure webproxy for federated queries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919216 (https://phabricator.wikimedia.org/T334470) (owner: 10Ebernhardson) [20:05:58] i can self-deploy, or let you deploy my backports thcipriani, up2you [20:06:12] thcipriani: maybe... remind me which group metawiki is in? [20:06:20] group1 [20:06:27] meta, ... ^ :D [20:06:42] urbanecm: I can take care of bd808 and then get out of your way :) [20:06:47] perfect. I need to fix meta :) [20:06:54] sounds good to me! waiting for the ping then. [20:07:00] cool, will do [20:07:07] bd808: alright, going ahead :) [20:07:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/OAuth] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919168 (https://phabricator.wikimedia.org/T299737) (owner: 10BryanDavis) [20:11:13] (03CR) 10Dzahn: "oh man.. my typo. thanks so much for fixing it!" [puppet] - 10https://gerrit.wikimedia.org/r/919161 (https://phabricator.wikimedia.org/T336524) (owner: 10Hashar) [20:11:43] (03PS1) 10Dzahn: acme_chief: add gerrit-old to list of allowed SANs from gerrit hosts [puppet] - 10https://gerrit.wikimedia.org/r/919226 (https://phabricator.wikimedia.org/T326368) [20:11:59] (03PS2) 10Dzahn: acme_chief: add gerrit-old to list of allowed SANs from gerrit hosts [puppet] - 10https://gerrit.wikimedia.org/r/919226 (https://phabricator.wikimedia.org/T326368) [20:12:16] (03Merged) 10jenkins-bot: Allow http://localhost callback URL [extensions/OAuth] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919168 (https://phabricator.wikimedia.org/T299737) (owner: 10BryanDavis) [20:12:48] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:919168|Allow http://localhost callback URL (T299737)]] [20:12:52] T299737: Warn users during OAuth 2 app creation when they provide a callback URL that's just the domain - https://phabricator.wikimedia.org/T299737 [20:14:14] (03CR) 10Dzahn: [C: 03+1] prometheus: Decommission prometheus4001 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/913250 (https://phabricator.wikimedia.org/T335585) (owner: 10Andrea Denisse) [20:14:23] !log thcipriani@deploy1002 bd808 and thcipriani: Backport for [[gerrit:919168|Allow http://localhost callback URL (T299737)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:14:40] bd808: ^ your change is on mwdebug machines, check please [20:14:43] (03CR) 10Dzahn: [C: 03+1] prometheus: Decommission prometheus5001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/913251 (https://phabricator.wikimedia.org/T335587) (owner: 10Andrea Denisse) [20:14:55] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus3001.esams.wment [20:15:18] thcipriani: \o/ it works [20:15:54] (03CR) 10Dzahn: "Hmm.. so for drmrs you set it to "absent" but for ulsfo and eqsin you just remove the code without setting to absent?" [puppet] - 10https://gerrit.wikimedia.org/r/913256 (https://phabricator.wikimedia.org/T335588) (owner: 10Andrea Denisse) [20:16:33] bd808: cool, thanks for checking, going live [20:17:18] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) Thank you @KFrancis ! Uploaded code changes for the 2 users that are done. In progress in subtasks :) [20:17:44] !log manually remove prometheus3001.esams.wmnet from the ganeti master after a failed step in the decommission cookbook. [20:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:55] (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:18:39] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [20:19:17] (03CR) 10Andrea Denisse: prometheus: Decommission prometheus6001 in drmrs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913256 (https://phabricator.wikimedia.org/T335588) (owner: 10Andrea Denisse) [20:20:46] (03CR) 10Dzahn: [V: 03+1] "makes sense! compiler shows us the actual file names:" [puppet] - 10https://gerrit.wikimedia.org/r/908574 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:20:50] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus3001.esams.wment decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [20:21:28] (03CR) 10Dzahn: [C: 03+1] "gotcha! +1, but you don't want to just set to absent for all 3 ?" [puppet] - 10https://gerrit.wikimedia.org/r/913256 (https://phabricator.wikimedia.org/T335588) (owner: 10Andrea Denisse) [20:21:40] (03PS21) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [20:21:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gerrit: remove leftover absent http config [puppet] - 10https://gerrit.wikimedia.org/r/908574 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:21:59] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus3001.esams.wment decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [20:21:59] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:22:00] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts prometheus3001.esams.wment [20:22:25] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:919168|Allow http://localhost callback URL (T299737)]] (duration: 09m 37s) [20:22:29] T299737: Warn users during OAuth 2 app creation when they provide a callback URL that's just the domain - https://phabricator.wikimedia.org/T299737 [20:22:55] (JobUnavailable) firing: Reduced availability for job envoy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:22:59] (03PS22) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [20:23:22] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/908574 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:24:29] (03CR) 10Dzahn: [C: 03+2] "thanks to sukhe for merging the fix and to you for reporting!" [puppet] - 10https://gerrit.wikimedia.org/r/909796 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:24:31] bd808: should be live now! [20:24:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudswift1001.mgmt.eqiad.wmnet with reboot policy FORCED [20:24:44] urbanecm: you're all clear for your patches [20:24:50] thanks! [20:25:00] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [20:25:20] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Do not suggest users with Homepage disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919175 (https://phabricator.wikimedia.org/T336300) (owner: 10Urbanecm) [20:25:26] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Do not suggest users with Homepage disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/919176 (https://phabricator.wikimedia.org/T336300) (owner: 10Urbanecm) [20:25:34] (03CR) 10Dzahn: "you are absolutely right! just like for the annual change. thanks! will amend" [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) (owner: 10Dzahn) [20:26:37] 10ops-knams, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus3001 - https://phabricator.wikimedia.org/T335584 (10andrea.denisse) 05In progress→03Open [20:26:47] thcipriani: thanks. verified that it worked post deploy too because superstitious ;) [20:27:12] lucaswerkmeister: you should be able to register your localhost OAuth2 consumer now. [20:28:41] (03PS3) 10Urbanecm: [Growth] Remove config variables provided by extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912310 [20:29:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912310 (owner: 10Urbanecm) [20:29:20] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Decommission prometheus4001 in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/913250 (https://phabricator.wikimedia.org/T335585) (owner: 10Andrea Denisse) [20:29:49] (03Merged) 10jenkins-bot: [Growth] Remove config variables provided by extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912310 (owner: 10Urbanecm) [20:30:16] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:912310|[Growth] Remove config variables provided by extension]] [20:31:59] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:912310|[Growth] Remove config variables provided by extension]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:32:25] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus4001.ulsfo.wment [20:32:40] gah, sorry, i got sidelined by my going out in the rain impulses [20:33:16] bd808: sorry, I was distracted, I’ll try it now [20:33:42] now I got “OAuth 2 apps must use an exact callback URL. A bare domain is probably not what you want.” but that was a warning I could skip, yay \o/ [20:36:28] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [20:37:11] (03PS23) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [20:37:38] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:37:39] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts prometheus4001.ulsfo.wment [20:38:40] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [20:42:55] (JobUnavailable) firing: (2) Reduced availability for job envoy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:44:18] (03Merged) 10jenkins-bot: Personalized praise: Do not suggest users with Homepage disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/919175 (https://phabricator.wikimedia.org/T336300) (owner: 10Urbanecm) [20:44:38] urbanecm: you around? [20:44:42] yeah [20:44:49] what's up? [20:45:07] (03Merged) 10jenkins-bot: Personalized praise: Do not suggest users with Homepage disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/919176 (https://phabricator.wikimedia.org/T336300) (owner: 10Urbanecm) [20:45:20] urbanecm: see -releng. I think a beta scap error might be you! [20:46:39] ty [20:46:40] looking [20:47:02] it works in prod at least [20:48:57] (03PS1) 10Urbanecm: Fix lookup of wgGERestbaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919235 [20:48:59] RhinosF1: ^^ should fix thi [20:49:49] Amazing! [20:50:20] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:912310|[Growth] Remove config variables provided by extension]] (duration: 20m 04s) [20:50:35] (03CR) 10RhinosF1: [C: 03+1] "Thanks for the quick look!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919235 (owner: 10Urbanecm) [20:50:59] 10SRE, 10Infrastructure-Foundations, 10netbox: Error creating device in netbox - https://phabricator.wikimedia.org/T336547 (10Jclark-ctr) [20:51:11] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:919175|Personalized praise: Do not suggest users with Homepage disabled (T336300)]], [[gerrit:919176|Personalized praise: Do not suggest users with Homepage disabled (T336300)]] [20:51:15] T336300: Personalized praise includes users with homepage disabled - https://phabricator.wikimedia.org/T336300 [20:51:15] 10SRE, 10Infrastructure-Foundations, 10netbox: Error creating device in netbox - https://phabricator.wikimedia.org/T336547 (10Jclark-ctr) [20:51:25] (03CR) 10Brennen Bearnes: "Confirmed at https://gitlab.devtools.wmcloud.org/ that users will see a message about needing approval from an administrator, and there's " [puppet] - 10https://gerrit.wikimedia.org/r/919211 (https://phabricator.wikimedia.org/T334958) (owner: 10Brennen Bearnes) [20:52:02] (03CR) 10Urbanecm: [C: 03+2] Fix lookup of wgGERestbaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919235 (owner: 10Urbanecm) [20:52:42] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:919175|Personalized praise: Do not suggest users with Homepage disabled (T336300)]], [[gerrit:919176|Personalized praise: Do not suggest users with Homepage disabled (T336300)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:52:48] (03PS2) 10Brennen Bearnes: gitlab: block auto created users [puppet] - 10https://gerrit.wikimedia.org/r/919211 (https://phabricator.wikimedia.org/T334958) [20:52:50] (03Merged) 10jenkins-bot: Fix lookup of wgGERestbaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919235 (owner: 10Urbanecm) [20:53:37] (03PS3) 10Brennen Bearnes: gitlab: block auto created users [puppet] - 10https://gerrit.wikimedia.org/r/919211 (https://phabricator.wikimedia.org/T334958) [20:55:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) @Jclark-ctr i check again those servers from the switch side see below. Those are using NON-JNPR compatible cables. that i... [20:57:49] (03CR) 10Eevans: [C: 03+2] hierdata: add swift (thanos) mw-event-enrichment account [puppet] - 10https://gerrit.wikimedia.org/r/918582 (https://phabricator.wikimedia.org/T330693) (owner: 10Eevans) [20:58:41] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:919175|Personalized praise: Do not suggest users with Homepage disabled (T336300)]], [[gerrit:919176|Personalized praise: Do not suggest users with Homepage disabled (T336300)]] (duration: 07m 30s) [20:58:46] T336300: Personalized praise includes users with homepage disabled - https://phabricator.wikimedia.org/T336300 [21:01:35] okay, i think that's all for today. [21:02:27] (03PS24) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [21:03:23] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [21:03:28] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mw-event-enrichment:prod.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:16] !log eevans@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [21:06:56] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1225.eqiad.wmnet [21:07:12] !log eevans@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [21:07:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts db1225.eqiad.wmnet [21:07:29] !log eevans@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [21:08:28] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:02] (03CR) 10LSobanski: [C: 03+1] acme_chief: add gerrit-old to list of allowed SANs from gerrit hosts [puppet] - 10https://gerrit.wikimedia.org/r/919226 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [21:10:14] !log eevans@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [21:13:47] (03PS2) 10Urbanecm: [Growth] Add mediawiki.mentor_dashboard.interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918500 (https://phabricator.wikimedia.org/T325117) [21:14:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Jclark-ctr) @Marostegui i have cleared logs again. If error returns I would like to perform flea power drain. Otherwise it looks like serv... [21:16:46] (03PS16) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [21:18:09] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) Part has been received. @jcrespo I can swap this part out anytime in the next two hours or anytime tomorrow after 13:00 UTC tracking: 398150002935 [21:19:53] (03CR) 10David Caro: "@raymond I think you told me something at some point about this review? do you remember? I think it was something about deleting account o" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [21:24:37] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) a:03Eevans Ok, this is setup and has been tested. I acreated the t... [21:27:18] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [21:27:28] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Jhancock.wm) @jcrespo this part has been received. Is it currently safe to replace this DIMM? if not I can take care of it tomorrow after 13:00 UTC (or in... [21:30:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) Replaced both cables. they where newer wave2wave dac cables [21:30:35] (03PS6) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [21:31:30] (03PS3) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) [21:31:32] (03CR) 10David Caro: maintain_dbusers: add prometheus stats (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [21:31:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41150/console" [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [21:34:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41151/console" [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [21:39:53] (03PS25) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [21:40:41] (03PS13) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/918410 [21:40:43] (03PS5) 10Jbond: firewall: Remove kafka_brokers_analytics [puppet] - 10https://gerrit.wikimedia.org/r/919059 [21:40:45] (03PS6) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [21:40:47] (03PS6) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [21:40:49] (03PS7) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [21:41:08] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [21:42:20] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) Apologies for the ridiculous delay. I have Wikisource search console access now and am looking at it. [21:43:08] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:43:12] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:44:23] (03CR) 10CI reject: [V: 04-1] profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [21:45:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudswift1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:45:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudswift1002.mgmt.eqiad.wmnet with reboot policy FORCED [21:46:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) ` Xcvr 31 REV 01 740-030077 H70824500300 SFP+-10G-CU3M Xcvr 5 REV 01 740-030077 G1807123036-1... [21:51:22] (03PS7) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [21:51:24] (03PS7) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [21:51:26] (03PS8) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [21:52:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:53:31] (03PS8) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [21:53:34] (03PS8) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [21:53:35] (03PS9) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [21:54:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41153/console" [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [21:56:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudswift1002.mgmt.eqiad.wmnet with reboot policy FORCED [21:57:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:02:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:07:01] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/919225 (https://phabricator.wikimedia.org/T335858) (owner: 10Dzahn) [22:07:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:07:13] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/919224 (https://phabricator.wikimedia.org/T336435) (owner: 10Dzahn) [22:07:35] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [22:08:01] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [22:08:53] (03PS1) 10Dwisehaupt: Add frav1003 dns and rdns entries [dns] - 10https://gerrit.wikimedia.org/r/919239 (https://phabricator.wikimedia.org/T334400) [22:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:22:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [22:23:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudswift1001'] [22:25:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) [22:27:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:41:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudswift1001'] [22:42:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:46:28] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudswift1002'] [22:51:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:52:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:52:31] looking [22:56:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:31] that's likely to keep flapping [22:58:42] high load on the videoscalers, starting 19:00-19:30 or so, could easily just be a big pile of encoding jobs, still digging [23:01:42] yeah it's just videoscaler CPUs maxed out on ffmpeg -- I could slide another couple of jobrunner machines over to videoscaling but I don't think it'd be enough to make an appreciable difference in completion time [23:02:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:02:47] so the question is why are the *jobrunners* also failing probes [23:07:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:07:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudswift1002'] [23:12:13] *oh* they're all double-pooled right now, every host is a jobrunner and a videoscaler -- man, I keep losing track of which way we have that :) [23:12:30] okay, everything makes sense again, and I'm going to split the cluster to chug through these videos without impacting other jobs [23:19:23] I'm going to leave the bottom 7 hosts as jobrunners, top 10 as videoscalers, and see where that gets us [23:19:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) [23:19:41] for later: they're all weighted equally, but shouldn't be, the CPU loads are super uneven [23:22:05] !log rzl@cumin2002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw14(3[789]|4[056]57)\.eqiad\.wmnet [23:22:41] !log rzl@cumin2002 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw14(5[89]|6[016789]|9[45])\.eqiad\.wmnet [23:24:13] !log rzl@cumin2002 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw14(3[789]|4[056]|57)\.eqiad\.wmnet [23:24:41] first regex was missing a | so only matched 3 hosts instead of the intended 7 [23:26:12] now killing ffmpeg on the hosts depooled as videoscalers [23:27:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:27:25] (03CR) 10Tim Starling: [C: 03+1] "Looks good to me. I didn't know about that !(pattern) feature -- eventually I found the documentation in man fnmatch." [puppet] - 10https://gerrit.wikimedia.org/r/919063 (https://phabricator.wikimedia.org/T277445) (owner: 10Herron) [23:28:57] done [23:32:40] rzl: alerts is complaining about degraded: The following units failed: mediawiki_job_purge_parsercache_pc2.service as well [23:32:48] (03CR) 10Dzahn: [C: 03+2] admin: add Robert Timm to LDAP-only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/919224 (https://phabricator.wikimedia.org/T336435) (owner: 10Dzahn) [23:33:15] not sure if it's just now or has been failing in the last days [23:33:21] hauskater: unrelated but I can take a look in a few :) [23:33:32] sure, thanks :) [23:33:57] (03CR) 10Dzahn: [C: 03+2] admin: add Loren Johnson to LDAP-only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/919225 (https://phabricator.wikimedia.org/T335858) (owner: 10Dzahn) [23:34:03] (03PS2) 10Dzahn: admin: add Loren Johnson to LDAP-only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/919225 (https://phabricator.wikimedia.org/T335858) [23:39:13] !log LDAP - added uid roti to groups wmde and nda T336435 [23:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:17] T336435: Grant Access to ldap/wmde for Robert Timm (WMDE) - https://phabricator.wikimedia.org/T336435 [23:39:30] !log LDAP - added uid lorenjohnson to groups wmde nda T335858 [23:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:34] T335858: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 [23:39:46] wikibugs is suspiciously quiet, but filed T336554 for the followup [23:39:47] T336554: Repool jobrunners and videoscalers - https://phabricator.wikimedia.org/T336554 [23:41:03] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Dzahn) 05In progress→03Resolved @lojo_wmde You have been added to groups "wmde" and "nda" as requested. Everything should work now. Let us know if there are any issues. [23:42:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) [23:42:04] okay, the videoscalers will need to keep chewing for however long it takes, but that's taken care of [23:42:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) @Andrew yes we can still do the os install part and resolve this task when we will will be ready to do network changes we c... [23:42:18] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) Robert Timm and Loren Johnson have been added to the groups as requested. [23:42:23] as incident coordinator and sole responder I am unanimously declaring the incident resolved [23:42:39] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmde for Robert Timm (WMDE) - https://phabricator.wikimedia.org/T336435 (10Dzahn) 05In progress→03Resolved @roti_WMDE You have been added to groups "wmde" and "nda" as requested. Everything should work now. Let us know if you run in... [23:45:14] (03PS2) 10Dzahn: microsites: change rewrite rule for https://transparency.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) [23:46:38] (03CR) 10Dzahn: [C: 03+2] acme_chief: add gerrit-old to list of allowed SANs from gerrit hosts [puppet] - 10https://gerrit.wikimedia.org/r/919226 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [23:48:18] hauskater: okay, it looks like that service stoppage is related to the pc2 master failover from 06:00 UTC (about 16h ago) -- I'll ping the folks involved and make sure nothing got missed, but just in case it's intended I'm not going to make any changes :) thanks for raising it [23:48:39] ack :) [23:49:12] (due to time zone spread, don't expect any further updates before UTC morning) [23:50:53] no probs, I'm 'downtiming' myself in 5 [23:51:02] (sleep time-9 [23:51:05] )* [23:51:35] 👍 [23:53:09] (03CR) 10Dzahn: [C: 03+2] "on acmechief hosts this allowed that a gerrit server could request this. but gerrit1001 currently has disabled puppet, so it's expected th" [puppet] - 10https://gerrit.wikimedia.org/r/919226 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)