[00:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928909 [00:39:31] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928909 (owner: 10TrainBranchBot) [00:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928909 (owner: 10TrainBranchBot) [01:30:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:14] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:00] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:54] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:27] !log ladsgroup@mwmaint1002:~$ mwscript maintenance/storage/moveToExternal.php --wiki=enwiki --end 32000000 --undo /home/ladsgroup/T128151.undo.sql --iconv DB cluster27 (T128151) [02:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:31] T128151: Migrate all old DB rows from windows-1252 to UTF-8 on enwiki - https://phabricator.wikimedia.org/T128151 [03:00:08] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:05:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:28] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:24] PROBLEM - Check systemd state on db1139 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:56] I bring back db1139, I can ssh into it but needs a data check and such [03:28:22] RECOVERY - MariaDB read only s2 on db1139 is OK: Version 10.4.25-MariaDB, Uptime 36s, read_only: True, event_scheduler: True, 20.53 QPS, connection latency: 0.003834s, query latency: 0.000428s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [03:28:24] RECOVERY - MariaDB read only s1 on db1139 is OK: Version 10.4.25-MariaDB, Uptime 96s, read_only: True, event_scheduler: True, 1798.92 QPS, connection latency: 0.004415s, query latency: 0.000292s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [03:28:48] RECOVERY - MariaDB Replica SQL: s1 on db1139 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:29:10] RECOVERY - MariaDB Replica IO: s1 on db1139 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:29:14] RECOVERY - MariaDB Replica SQL: s2 on db1139 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:29:16] RECOVERY - mysqld processes on db1139 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:29:50] RECOVERY - MariaDB Replica IO: s2 on db1139 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:45:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_netflow.service,produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [04:12:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [04:15:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [04:16:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [04:32:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [04:32:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [05:10:10] RECOVERY - MariaDB Replica Lag: s2 on db1139 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:17:33] (03PS1) 10Ladsgroup: Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343) [05:23:07] (03PS1) 10KartikMistry: Update cxserver to 2023-06-12-051618-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929036 (https://phabricator.wikimedia.org/T338146) [05:33:36] (03PS1) 10KartikMistry: Update MinT to 2023-06-10-124931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929038 (https://phabricator.wikimedia.org/T284905) [05:36:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:38:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10phaultfinder) [05:41:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:12:38] * kart_ updating MinT now.. [06:12:56] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-06-10-124931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929038 (https://phabricator.wikimedia.org/T284905) (owner: 10KartikMistry) [06:14:04] (03Merged) 10jenkins-bot: Update MinT to 2023-06-10-124931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929038 (https://phabricator.wikimedia.org/T284905) (owner: 10KartikMistry) [06:16:43] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:25:36] Service update seems taking more time than usual :/ [06:27:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:36:01] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:36:05] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [06:37:33] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/929019 (owner: 10Majavah) [06:41:44] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:44:12] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org [06:45:13] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:48:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:48:37] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:50:57] !log upgrading booworm pilot installations to final/released bookworm package state T330495 [06:54:26] !log Updated MinT to 2023-06-10-124931-production (T284905) [06:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:30] T284905: Softcatalà translator - requested for integration as an MT service for CX - https://phabricator.wikimedia.org/T284905 [07:00:05] Amir1, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T0700) [07:00:05] Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:18] o/ [07:01:11] !log upgrading bookworm netboot images to final/released bookworm images T330495 [07:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:16] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [07:01:19] Superpes: around? [07:03:22] (03PS3) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) [07:04:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41666/console" [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [07:09:26] (03PS6) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) [07:09:28] (03PS4) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) [07:10:37] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) 05Open→03Resolved All done, bookworm has been released (https://lists.debian.org/debian-announce/2023/msg00001.html) and our installer/base... [07:10:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41667/console" [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [07:27:30] (03PS7) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) [07:27:32] (03PS5) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) [07:28:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41668/console" [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [08:08:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/928644 (owner: 10JHathaway) [08:13:07] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) >>! In T330884#8863775, @MoritzMuehlenhoff wrote: > @ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage... [08:14:24] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi) Awesome! In place is fine. [08:19:52] (03CR) 10Btullis: [C: 03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [08:23:47] (03PS1) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [08:25:56] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929015 (owner: 10Majavah) [08:25:57] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) [08:26:05] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:26:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929016 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah) [08:28:24] (03CR) 10Elukey: [C: 03+2] profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [08:29:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929018 (owner: 10Majavah) [08:29:59] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [08:30:18] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [08:30:29] @Taavi Sorry I had a sudden commitment and was not able to be present! Will reschedule them for another window :) [08:30:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [08:30:38] jouncebot: nowandnext [08:30:38] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [08:30:38] In 1 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1000) [08:30:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [08:30:46] Superpes: or we can just push it out now if you have time [08:31:03] taavi Yep, if you can, many thanks :) [08:32:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928504 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15) [08:32:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928992 (https://phabricator.wikimedia.org/T338621) (owner: 10Superpes15) [08:32:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:33:03] (03Merged) 10jenkins-bot: [knwiki] Add a temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928504 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15) [08:33:06] (03Merged) 10jenkins-bot: [lmowiki] Removing the Purtaal namespace and fixing the Portal talk translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928992 (https://phabricator.wikimedia.org/T338621) (owner: 10Superpes15) [08:33:42] !log taavi@deploy1002 Started scap: Backport for [[gerrit:928504|[knwiki] Add a temporary logo for the 20th anniversary (T338136)]], [[gerrit:928992|[lmowiki] Removing the Purtaal namespace and fixing the Portal talk translation (T338621)]] [08:33:47] T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136 [08:33:48] T338621: lmo.wiki namespaces change - https://phabricator.wikimedia.org/T338621 [08:36:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Clement_Goubert) p:05Triage→03Medium a:03Jclark-ctr [08:38:09] (03PS2) 10Majavah: P:toolforge::apt_pinning: remove unused pinnings [puppet] - 10https://gerrit.wikimedia.org/r/929016 (https://phabricator.wikimedia.org/T290494) [08:38:11] (03PS1) 10Majavah: P:toolforge::apt_pinning: remove sssd pinning [puppet] - 10https://gerrit.wikimedia.org/r/929159 (https://phabricator.wikimedia.org/T290494) [08:38:28] (03PS3) 10Jbond: dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:39:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow4002.ulsfo.wmnet with OS bookworm [08:39:34] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow4002.ulsfo.wmnet with OS bookworm [08:40:49] Superpes: looks like the container build is again taking ages :/ I'll ping you when they are available for testing [08:41:05] (03CR) 10CI reject: [V: 04-1] dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:42:36] (03PS4) 10Jbond: dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:42:38] (03PS1) 10Jbond: wmflib.is_container: add mocked fact [puppet] - 10https://gerrit.wikimedia.org/r/929161 (https://phabricator.wikimedia.org/T337972) [08:42:46] !log taavi@deploy1002 superpes and taavi: Backport for [[gerrit:928504|[knwiki] Add a temporary logo for the 20th anniversary (T338136)]], [[gerrit:928992|[lmowiki] Removing the Purtaal namespace and fixing the Portal talk translation (T338621)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:42:51] T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136 [08:42:51] T338621: lmo.wiki namespaces change - https://phabricator.wikimedia.org/T338621 [08:43:20] Testing [08:44:07] Everything is fine :) Thanks @taavi! [08:44:33] syncing! [08:45:33] (03CR) 10Jbond: [C: 03+1] "lgtm, CI error was related to the missing mocked fact, see preceding CR (i think i saw you add this fact some where else so feel free to r" [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:48:08] (03CR) 10Jbond: "nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [08:49:47] (03PS1) 10Btullis: Deploy the new eventgate-wikimedia container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/929164 (https://phabricator.wikimedia.org/T335716) [08:50:27] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:928504|[knwiki] Add a temporary logo for the 20th anniversary (T338136)]], [[gerrit:928992|[lmowiki] Removing the Purtaal namespace and fixing the Portal talk translation (T338621)]] (duration: 16m 44s) [08:50:32] T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136 [08:50:32] T338621: lmo.wiki namespaces change - https://phabricator.wikimedia.org/T338621 [08:51:22] (03PS1) 10Muehlenhoff: Add Bookworm to debdeploy config and remove Stretch [puppet] - 10https://gerrit.wikimedia.org/r/929165 [08:52:16] Thanks taavi and Superpes , logo working on wiki now [08:56:03] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [08:56:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [08:56:17] Thanks taavi :) [08:56:43] (03CR) 10Jbond: "LGTM but it would be good to also include https://gerrit.wikimedia.org/r/c/operations/puppet/+/929161 in this patch" [puppet] - 10https://gerrit.wikimedia.org/r/928903 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:56:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [08:57:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow4002.ulsfo.wmnet with reason: host reimage [08:58:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::apt_pinning: remove unused pinnings [puppet] - 10https://gerrit.wikimedia.org/r/929016 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah) [09:00:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:08] (03PS2) 10Arturo Borrero Gonzalez: P:toolforge::apt_pinning: remove sssd pinning [puppet] - 10https://gerrit.wikimedia.org/r/929159 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah) [09:01:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow4002.ulsfo.wmnet with reason: host reimage [09:02:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Create additional nftables directories [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:04:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929165 (owner: 10Muehlenhoff) [09:04:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::apt_pinning: remove sssd pinning [puppet] - 10https://gerrit.wikimedia.org/r/929159 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah) [09:04:27] (03CR) 10Muehlenhoff: [C: 03+2] Add Bookworm to debdeploy config and remove Stretch [puppet] - 10https://gerrit.wikimedia.org/r/929165 (owner: 10Muehlenhoff) [09:05:09] moritzm: we just hit submit at the same time @ gerrit :-P [09:05:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:33] moritzm: ok to merge your patch? Add Bookworm to debdeploy config and remove Stretch (051e74420a) [09:06:48] yeah, I was trying to puppet-merge it, but you had the lock already :-) [09:07:02] cool, merging [09:08:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptly::client: configure unattended-upgrades [puppet] - 10https://gerrit.wikimedia.org/r/929018 (owner: 10Majavah) [09:09:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:bookworm: prepare apt repo for toolforge [puppet] - 10https://gerrit.wikimedia.org/r/929015 (owner: 10Majavah) [09:10:28] (03CR) 10Elukey: [C: 03+1] "LGTM, we can also limit the deployment to event-gate main IIUC, but rolling it to all the instances will be more consistent. As you prefer" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929164 (https://phabricator.wikimedia.org/T335716) (owner: 10Btullis) [09:13:09] (03PS1) 10Muehlenhoff: Build Bookworm base image [puppet] - 10https://gerrit.wikimedia.org/r/929168 (https://phabricator.wikimedia.org/T335560) [09:19:24] (03CR) 10Jbond: [C: 03+1] "lgtm obviously depends on https://gerrit.wikimedia.org/r/c/operations/puppet/+/928903" [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [09:25:48] (03CR) 10Jbond: [C: 03+2] P:puppet::agent: add script to locate unmanaged files [puppet] - 10https://gerrit.wikimedia.org/r/928857 (owner: 10Majavah) [09:26:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [09:26:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [09:30:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:55] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [09:33:50] (03CR) 10Btullis: [C: 03+2] Deploy the new eventgate-wikimedia container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/929164 (https://phabricator.wikimedia.org/T335716) (owner: 10Btullis) [09:34:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow4002.ulsfo.wmnet with OS bookworm [09:34:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:50] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow4002.ulsfo.wmnet with OS bookworm completed: - netflow4002 (**PASS**) -... [09:34:58] (03Merged) 10jenkins-bot: Deploy the new eventgate-wikimedia container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/929164 (https://phabricator.wikimedia.org/T335716) (owner: 10Btullis) [09:38:33] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [09:39:26] (03CR) 10Majavah: [C: 03+1] "Looks correct to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/929168 (https://phabricator.wikimedia.org/T335560) (owner: 10Muehlenhoff) [09:44:51] (03PS1) 10Elukey: Revert "Move cp4037's varnishkafka instances to PKI" [puppet] - 10https://gerrit.wikimedia.org/r/928934 [09:45:30] (03CR) 10Elukey: [C: 03+2] Revert "Move cp4037's varnishkafka instances to PKI" [puppet] - 10https://gerrit.wikimedia.org/r/928934 (owner: 10Elukey) [09:48:28] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [09:56:06] (03PS1) 10Slyngshede: Enable password reset and fix wording. [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 [09:57:41] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [09:58:53] (03PS2) 10Slyngshede: Enable password reset and fix wording. [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1000) [10:07:25] (03CR) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [10:07:41] (03PS2) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) [10:08:18] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [10:09:59] (03CR) 10Jbond: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [10:13:58] (03CR) 10Jbond: [C: 03+1] "LGTM, not sure what the CI issue is i rebased locally fine, tried to push and got a no changes error. its also nice to see such a saving " [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [10:15:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:31:25] (03CR) 10Jbond: "Thanks for the feedback but its probably best to move this to a task" [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [10:40:54] !log mwscript maintenance/storage/moveToExternal.php --wiki=enwiki --start 31000000 --end 110000000 --undo /home/ladsgroup/T128151.undo.sql --iconv DB cluster27 (T128151) [10:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:58] T128151: Migrate all old DB rows from windows-1252 to UTF-8 on enwiki - https://phabricator.wikimedia.org/T128151 [10:42:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow5002.eqsin.wmnet with OS bookworm [10:43:00] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow5002.eqsin.wmnet with OS bookworm [10:46:36] (03PS1) 10Ayounsi: Add Python 3.11 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929177 [10:48:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:49:09] (03PS1) 10Ayounsi: Remove call to now gone _validate_vm function [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929178 [10:51:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929178 (owner: 10Ayounsi) [10:51:18] (03CR) 10Jbond: [C: 03+1] Add Python 3.11 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929177 (owner: 10Ayounsi) [10:52:01] (03CR) 10Muehlenhoff: [C: 03+2] Create additional nftables directories [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:52:46] (03CR) 10Ayounsi: [C: 03+2] Remove call to now gone _validate_vm function [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929178 (owner: 10Ayounsi) [10:53:23] (03Merged) 10jenkins-bot: Remove call to now gone _validate_vm function [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929178 (owner: 10Ayounsi) [10:56:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:56:10] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:56:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:56:44] (03PS1) 10Slyngshede: Captcha: Allow users to request a new captcha. [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 [10:56:59] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:57:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:59:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [11:00:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:30] (03PS2) 10Slyngshede: Captcha: Allow users to request a new captcha. [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 [11:07:17] (03CR) 10Ayounsi: [C: 03+2] Add Python 3.11 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929177 (owner: 10Ayounsi) [11:07:59] (03Merged) 10jenkins-bot: Add Python 3.11 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929177 (owner: 10Ayounsi) [11:08:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:13:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: codfw1dev: use new recursor address [puppet] - 10https://gerrit.wikimedia.org/r/928589 (https://phabricator.wikimedia.org/T338433) (owner: 10Arturo Borrero Gonzalez) [11:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:45] (03PS4) 10Jbond: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [11:17:04] (03CR) 10Muehlenhoff: [C: 03+2] Build Bookworm base image [puppet] - 10https://gerrit.wikimedia.org/r/929168 (https://phabricator.wikimedia.org/T335560) (owner: 10Muehlenhoff) [11:18:37] (03CR) 10CI reject: [V: 04-1] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [11:19:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:20] (03PS1) 10Ayounsi: Prioritize direct peers connected to primary IXP [homer/public] - 10https://gerrit.wikimedia.org/r/929301 (https://phabricator.wikimedia.org/T338201) [11:23:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow5002.eqsin.wmnet with reason: host reimage [11:25:37] (03PS2) 10Ayounsi: Prioritize direct peers connected to primary IXP [homer/public] - 10https://gerrit.wikimedia.org/r/929301 (https://phabricator.wikimedia.org/T338201) [11:27:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow5002.eqsin.wmnet with reason: host reimage [11:28:46] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [11:29:06] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [11:30:38] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [11:31:35] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [11:32:06] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [11:32:36] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [11:32:38] (03PS2) 10Stevemunene: analytics: Decommission analytics10[59-60] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/928478 (https://phabricator.wikimedia.org/T338408) [11:40:04] (03PS2) 10Ladsgroup: Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343) [11:47:31] jouncebot: nowandnexr [11:47:32] jouncebot: nowandnext [11:47:33] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [11:47:33] In 1 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1300) [11:47:43] cooool [11:47:48] (03PS3) 10Ladsgroup: Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343) [11:47:51] (03CR) 10Ladsgroup: [C: 03+2] Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [11:48:49] (03Merged) 10jenkins-bot: Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [11:49:09] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:929035|Set small wikis to read new for externallinks (T335343)]] [11:49:13] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [11:49:55] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [11:50:33] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:929035|Set small wikis to read new for externallinks (T335343)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [11:51:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow5002.eqsin.wmnet with OS bookworm [11:51:13] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow5002.eqsin.wmnet with OS bookworm completed: - netflow5002 (**PASS**) -... [11:51:29] 10SRE, 10Infrastructure-Foundations: Add support for nftables in profile::firewall - https://phabricator.wikimedia.org/T336497 (10MoritzMuehlenhoff) [11:51:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:59] (03PS1) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [11:52:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:22] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:53:37] (03PS1) 10Ayounsi: Remove cloudsw-loopback.pol (folded into common-loopback) [homer/public] - 10https://gerrit.wikimedia.org/r/929316 [11:54:02] (03PS2) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [11:55:08] (03PS1) 10David Martin: Correct the value for schema_title for stream wikifunctions.ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) [11:55:22] (03CR) 10Btullis: [C: 03+1] analytics: Decommission analytics10[59-60] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/928478 (https://phabricator.wikimedia.org/T338408) (owner: 10Stevemunene) [11:56:50] (03PS3) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [11:57:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:01:31] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:929035|Set small wikis to read new for externallinks (T335343)]] (duration: 12m 22s) [12:01:35] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [12:05:56] (03PS2) 10Daimona Eaytoy: prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) [12:08:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:08:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:09:18] (03PS4) 10Daimona Eaytoy: Remove references to $wgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 [12:10:42] (03PS5) 10Daimona Eaytoy: Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 [12:11:53] (03PS1) 10Daimona Eaytoy: Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 [12:20:04] (03CR) 10Ladsgroup: [C: 03+1] Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy) [12:20:44] (03CR) 10Daimona Eaytoy: "(Note, I've scheduled this and the other patch for today's late backport window)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy) [12:22:02] (03CR) 10Ladsgroup: [C: 03+1] Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy) [12:27:19] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Write a cookbook to roll reboot cache hosts - https://phabricator.wikimedia.org/T338783 (10Volans) p:05Triage→03Medium [12:28:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow6001.drmrs.wmnet with OS bookworm [12:28:17] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow6001.drmrs.wmnet with OS bookworm [12:29:03] (03CR) 10Stevemunene: [C: 03+2] analytics: Decommission analytics10[59-60] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/928478 (https://phabricator.wikimedia.org/T338408) (owner: 10Stevemunene) [12:29:13] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10Ottomata) > but nobody has complained about any specific errors IIRC< That's because th... [12:31:10] (03CR) 10Muehlenhoff: Captcha: Allow users to request a new captcha. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 (owner: 10Slyngshede) [12:34:50] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) 05Resolved→03Open > Now that we can specify a port range, we should, and we... [12:34:56] (03PS1) 10Jbond: homer: update tests for graphQL [software/homer] - 10https://gerrit.wikimedia.org/r/929324 [12:34:57] (03CR) 10CI reject: [V: 04-1] homer: update tests for graphQL [software/homer] - 10https://gerrit.wikimedia.org/r/929324 (owner: 10Jbond) [12:36:15] (03CR) 10Muehlenhoff: "Few more nits/typos/comments" [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 (owner: 10Slyngshede) [12:36:23] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:36:57] PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:37:28] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10MoritzMuehlenhoff) >>! In T111433#8922642, @BTullis wrote: >> Now that we can specify a... [12:38:12] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) [12:38:31] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:45:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [12:47:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow6001.drmrs.wmnet with reason: host reimage [12:47:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) Server will not boot Unable to pull tsr report. Troubleshooted steps already perfromed Flea power Drain Minimum configuration diabling power button. led light status on ma... [12:49:13] RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:50:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow6001.drmrs.wmnet with reason: host reimage [12:52:41] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:02] (03PS1) 10Ayounsi: Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082) [12:55:10] (03CR) 10CI reject: [V: 04-1] Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [12:59:33] (03PS1) 10Ayounsi: Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) [12:59:40] (03CR) 10CI reject: [V: 04-1] Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [13:00:07] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1300). [13:00:07] mfossati, duesen, Sohom_Datta, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:27] o/ I'm around but would prefer let someone else deploy [13:00:32] o/ [13:00:35] o/ [13:00:36] p/ [13:00:36] I can deploy [13:00:38] (03PS1) 10Arturo Borrero Gonzalez: cloud: cleanup labsdnsconfig usage [puppet] - 10https://gerrit.wikimedia.org/r/929334 [13:00:52] Lucas_WMDE: cool, was gonna ask if you could [13:00:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Dwisehaupt anything else we can do to help with this task? [13:01:02] Lucas_WMDE: cool. I'm here. [13:01:14] Amir1: are you around to keep an eye on x2 as well? [13:01:37] looks like no backports, that saves CI time ^^ [13:01:41] I'm around but for half an hour only [13:01:41] ~~Oops! All Backports~~ [13:01:52] let’s start with duesen then? [13:02:24] hi folks, here I am! [13:02:39] Lucas_WMDE: i'm ready whenever [13:02:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10MoritzMuehlenhoff) The driver is is simply not present in the Linux kernel present in Buster, so the problem isn't in the Buster installer per se :-)... [13:02:48] (03PS2) 10Lucas Werkmeister (WMDE): Switch VisualEditor to not use RESTbase on English Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:03:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:03:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10Papaul) @ssingh any update on this? [13:03:21] * Lucas_WMDE takes a look at the other changes [13:03:51] (03Merged) 10jenkins-bot: Switch VisualEditor to not use RESTbase on English Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:04:08] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:928590|Switch VisualEditor to not use RESTbase on English Wikipedia. (T320529)]] [13:04:12] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [13:05:28] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and daniel: Backport for [[gerrit:928590|Switch VisualEditor to not use RESTbase on English Wikipedia. (T320529)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:05:57] duesen: anything to test on mwdebug? [13:06:31] (03CR) 10Samtar: [C: 03+1] "surprised this is valid, but the wisdom of S.O. says it is..! 🤷‍♀️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (owner: 10Sohom Datta) [13:06:33] Lucas_WMDE: on it [13:07:06] ok [13:07:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, though I can’t help but notice that all pages except one are linked to Q4618557 (the idwiki one is linked to Q6618850 instead), so I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati) [13:07:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) Hi @papaul: We are all good from dc-ops side, this is on Traffic now. We wanted to get a few NTP changes out of the way before reimaging the next batch and therefore it's blocke... [13:08:15] Sohom_Datta: at least https://gerrit.wikimedia.org/r/929000 will stop that mildly annoying warning when developing.. [13:09:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow6001.drmrs.wmnet with OS bookworm [13:09:20] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow6001.drmrs.wmnet with OS bookworm completed: - netflow6001 (**PASS**) -... [13:09:30] Lucas_WMDE: looks good [13:09:32] (03CR) 10Lucas Werkmeister (WMDE): "Not sure I feel comfortable deploying this tbh… who’s normally responsible for the CSP? Security team?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (owner: 10Sohom Datta) [13:09:35] ok, syncing [13:09:38] moritzm: did you trigger a manual build of the bullseye image? [13:09:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [13:09:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster [13:10:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster [13:10:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e... [13:11:53] taavi: you mean bookworm? not yet, do you need it on short notice, then I can kick it off manually [13:12:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [13:12:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster [13:13:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster [13:13:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e... [13:13:09] moritzm: yeah, bookworm. I'm not in a particular hurry, but I'd also prefer not to wait for a week for it as the timer runs every Sunday [13:14:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [13:14:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster [13:14:37] (03CR) 10Sohom Datta: Add localhost:* to the beta wiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (owner: 10Sohom Datta) [13:14:59] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:928590|Switch VisualEditor to not use RESTbase on English Wikipedia. (T320529)]] (duration: 10m 51s) [13:15:03] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [13:15:34] taavi: sure thing, I've kicked it off manually now [13:15:35] (03PS1) 10AikoChou: ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/929335 (https://phabricator.wikimedia.org/T328899) [13:15:41] thank you! [13:17:46] duesen, Amir1: deployed [13:17:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10Papaul) @ssingh thanks [13:18:31] also, Gerrit feels super slow whenever I open a new tab o_O [13:18:39] (but moving around in an existing gerrit tab is fine) [13:18:52] (03PS2) 10Lucas Werkmeister (WMDE): ImageSuggestions: add help link to 4 new languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati) [13:19:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati) [13:19:45] Amir: stash writes are going up [13:20:22] (03Merged) 10jenkins-bot: ImageSuggestions: add help link to 4 new languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati) [13:20:38] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:928801|ImageSuggestions: add help link to 4 new languages (T331036)]] [13:20:42] T331036: [S] Add help link to article level image suggestions notifications for four additional languages - https://phabricator.wikimedia.org/T331036 [13:20:55] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [13:21:01] @Lucas_WMDE Should I open a phab task for the Beta wiki CSP task ? [13:21:17] (and add the Security Team) [13:21:19] Sohom_Datta: I think that would be a good idea [13:21:33] not 100% sure we need to block this on security team [13:21:40] but at least a phab task seems like a better place to discuss this in general [13:21:58] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and mfossati: Backport for [[gerrit:928801|ImageSuggestions: add help link to 4 new languages (T331036)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:22:07] mfossati: can you test on mwdebug? [13:22:51] Lucas_WMDE: give me a sec [13:22:54] Amir1: stash writes went from ~20 per minute to ~40 per minute... the original prediction was that per *second*. I am starting to think that we got our time units mixed up during the initial estimation. [13:23:22] If that is the case, we are looking at less than 2GB of data. [13:23:24] :D [13:23:46] otoh, the USA are still largely asleep. [13:24:16] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/929335 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [13:24:24] ok, hitting 80 writes/minute now [13:24:25] duesen: I'm very skeptical of 140GB for VE data [13:25:02] East coast should be awake by now but US is not that big in edits flows and traffic [13:25:28] (03PS1) 10Slyngshede: Wikimedia account link, clear banner on linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/929338 [13:26:07] Lucas_WMDE: you can go ahead, no testing on mwdebug is needed. [13:26:38] ok [13:28:07] 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Jclark-ctr) a:03Jclark-ctr [13:28:34] parser cache writes are up, from 10k to 15k per minute [13:28:57] stash writes hovering at around 60 per minute [13:29:11] * duesen gos to double-check the unit [13:29:21] 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) [13:29:39] * duesen confirms that it's per minute [13:30:25] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) [13:31:51] Amir1: network utilization seems to be going up a bit on db2142, but not dramatically [13:32:02] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:928801|ImageSuggestions: add help link to 4 new languages (T331036)]] (duration: 11m 23s) [13:32:10] T331036: [S] Add help link to article level image suggestions notifications for four additional languages - https://phabricator.wikimedia.org/T331036 [13:32:12] let let me check [13:32:44] (03PS1) 10Muehlenhoff: Remove netflow2002 from Kafka config [puppet] - 10https://gerrit.wikimedia.org/r/929340 (https://phabricator.wikimedia.org/T330884) [13:33:54] can someone maybe +1 my config changes before I deploy them? [13:34:00] so they don’t have no review at all ^^ [13:34:08] (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1300) [13:34:25] (03CR) 10Krinkle: [C: 03+1] Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) (owner: 10TheDJ) [13:35:43] maybe TheresNoTime? 🥺 [13:35:54] looking [13:36:01] Lucas_WMDE: I'm looking at them now. I have no idea what they mean :) [13:36:09] thanks :) [13:36:17] But I can +1 as "looks harmless" ;) [13:36:25] Lucas_WMDE: same :-) [13:36:30] I’m doing a grep -r for the config cleanups to confirm there are no references to them in wmf.21 [13:36:31] *.12 [13:36:45] (03CR) 10Daniel Kinzler: [C: 03+1] "looks fine to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) (owner: 10Lucas Werkmeister (WMDE)) [13:36:51] * TheresNoTime defers to others to break prod [13:36:54] (03PS2) 10Lucas Werkmeister (WMDE): [wikidatawiki] Add pagelang to wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) [13:37:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) (owner: 10Lucas Werkmeister (WMDE)) [13:37:06] (03CR) 10Daniel Kinzler: [C: 03+1] "Don't know what this means for Wikidata, but shouldn't break anythign else at least!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) (owner: 10Lucas Werkmeister (WMDE)) [13:37:10] (03CR) 10Daniel Kinzler: [C: 03+1] "Don't know what this means for Wikidata, but shouldn't break anythign else at least!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) (owner: 10Lucas Werkmeister (WMDE)) [13:37:15] thanks! [13:37:20] the permissions change is straightforward to test at least [13:37:24] * Lucas_WMDE prepares the curl command [13:38:28] (03Merged) 10jenkins-bot: [wikidatawiki] Add pagelang to wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) (owner: 10Lucas Werkmeister (WMDE)) [13:38:42] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:928798|[wikidatawiki] Add pagelang to wikidata-staff (T337760)]] [13:38:46] T337760: Wikidata: Add pagelang right to wikidata-staff group - https://phabricator.wikimedia.org/T337760 [13:40:03] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:928798|[wikidatawiki] Add pagelang to wikidata-staff (T337760)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:41:29] (03CR) 10Lucas Werkmeister (WMDE): "Tested on mwdebug with:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) (owner: 10Lucas Werkmeister (WMDE)) [13:45:31] (03PS2) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) [13:46:10] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:928798|[wikidatawiki] Add pagelang to wikidata-staff (T337760)]] (duration: 07m 27s) [13:46:13] T337760: Wikidata: Add pagelang right to wikidata-staff group - https://phabricator.wikimedia.org/T337760 [13:46:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) (owner: 10Lucas Werkmeister (WMDE)) [13:47:02] Lucas_WMDE: Created https://phabricator.wikimedia.org/T338790 [13:47:28] (03Merged) 10jenkins-bot: Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) (owner: 10Lucas Werkmeister (WMDE)) [13:47:44] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:923619|Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag (T335783)]] [13:49:05] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:923619|Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag (T335783)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:49:26] (03CR) 10Lucas Werkmeister (WMDE): "Tested by loading https://www.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=L1-S1&format=xmlfm and checking that the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) (owner: 10Lucas Werkmeister (WMDE)) [13:49:52] (03PS2) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) [13:50:48] Amir1: ok, I'm calling this a success. I see no impact at all on x2. [13:51:10] awesome, I still think this needs to be compressed :P [13:51:11] \o/ [13:51:39] (03PS3) 10Slyngshede: Enable password reset and fix wording. [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 [13:51:46] (03CR) 10Slyngshede: Enable password reset and fix wording. (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 (owner: 10Slyngshede) [13:52:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [13:52:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [13:52:19] (03PS2) 10Hashar: Add localhost:* to the beta wiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (https://phabricator.wikimedia.org/T338790) (owner: 10Sohom Datta) [13:52:34] 10SRE: compare Probenet data w/ NEL data - https://phabricator.wikimedia.org/T337317 (10CDanis) [13:54:39] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:923619|Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag (T335783)]] (duration: 06m 54s) [13:54:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) (owner: 10Lucas Werkmeister (WMDE)) [13:54:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:32] (03Merged) 10jenkins-bot: Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) (owner: 10Lucas Werkmeister (WMDE)) [13:55:42] (03PS1) 10AikoChou: ml-services: update outlink isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/929342 [13:55:47] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:923623|Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag (T335107)]] [13:55:50] T335107: Remove temporary feature flag for Entity Labels in parsed edit summaries in API requests again - https://phabricator.wikimedia.org/T335107 [13:56:10] (03PS3) 10Slyngshede: Captcha: Allow users to request a new captcha. [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 [13:57:02] * duesen goes afk for an hour [13:57:08] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:923623|Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag (T335107)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:57:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 (owner: 10Slyngshede) [13:57:25] (03CR) 10Lucas Werkmeister (WMDE): "Tested by loading https://www.wikidata.org/w/api.php?action=query&format=json&prop=revisions&revids=1810599589&formatversion=2&rvprop=comm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) (owner: 10Lucas Werkmeister (WMDE)) [13:58:08] (03CR) 10Elukey: [C: 03+2] ml-services: update outlink isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/929342 (owner: 10AikoChou) [13:58:12] (03CR) 10Slyngshede: Captcha: Allow users to request a new captcha. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 (owner: 10Slyngshede) [13:59:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:01:16] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:01:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 (owner: 10Slyngshede) [14:02:06] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Captcha: Allow users to request a new captcha. [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 (owner: 10Slyngshede) [14:02:37] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:923623|Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag (T335107)]] (duration: 06m 49s) [14:02:40] !log UTC afternoon backport+config window done [14:02:41] T335107: Remove temporary feature flag for Entity Labels in parsed edit summaries in API requests again - https://phabricator.wikimedia.org/T335107 [14:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:02] Lucas_WMDE: thanks for deploying! [14:03:15] Sohom_Datta: let’s see what happens on https://phabricator.wikimedia.org/T338790 – if no one else shares my concerns then I’m okay with this being deployed after all, but I’d like to leave some opportunity for feedback [14:03:30] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [14:03:33] Sure :) [14:04:19] RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:32] (03CR) 10Jbond: "lgtm but see comments" [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:05:06] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [14:07:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:46] (03CR) 10Hashar: [C: 04-1] "I have amended the commit message to point to T338790." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (https://phabricator.wikimedia.org/T338790) (owner: 10Sohom Datta) [14:11:59] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:59] (03CR) 10Jbond: sre.hosts.reboot-cluster: fix-ups for Traffic/SRE usage (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [14:15:15] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1010.eqiad.wmnet'] [14:15:22] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1011.eqiad.wmnet'] [14:16:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup1010.eqiad.wmnet'] [14:16:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup1011.eqiad.wmnet'] [14:16:34] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1010.eqiad.wmnet'] [14:16:38] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1011.eqiad.wmnet'] [14:17:32] jouncebot: nowandnext [14:17:32] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [14:17:32] In 1 hour(s) and 12 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1530) [14:17:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:43] (03CR) 10Jforrester: [C: 03+2] Correct the value for schema_title for stream wikifunctions.ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [14:17:48] (03PS2) 10Jforrester: Correct the value for schema_title for stream wikifunctions.ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [14:17:54] (03CR) 10Jforrester: "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [14:18:50] (03Merged) 10jenkins-bot: Correct the value for schema_title for stream wikifunctions.ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [14:22:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup1010.eqiad.wmnet'] [14:23:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup1011.eqiad.wmnet'] [14:23:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [14:24:54] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Enable password reset and fix wording. [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 (owner: 10Slyngshede) [14:26:05] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [14:26:07] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1011.eqiad.wmnet with OS bullseye [14:26:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [14:26:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye [14:28:35] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster [14:28:38] (03PS2) 10Slyngshede: Wikimedia account link, clear banner on linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/929338 [14:28:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e... [14:29:11] !log Deployed updated mitigations for T336027 [14:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:32] (03CR) 10Arturo Borrero Gonzalez: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:35:20] (03CR) 10Ayounsi: [C: 03+1] Remove netflow2002 from Kafka config [puppet] - 10https://gerrit.wikimedia.org/r/929340 (https://phabricator.wikimedia.org/T330884) (owner: 10Muehlenhoff) [14:38:44] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1011.eqiad.wmnet with reason: host reimage [14:41:35] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [14:41:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1011.eqiad.wmnet with reason: host reimage [14:42:56] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6001.drmrs.wmnet [14:44:02] !log rebooting cp6001.drmrs.wmnet for upgrade [14:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:37] 10Puppet, 10Analytics-Radar, 10Data-Engineering-Icebox: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104 (10joanna_borun) [14:44:43] (03PS1) 10Ilias Sarantopoulos: ores: enable per wiki deployment of Ores deprecation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) [14:45:15] wow --^ [14:45:39] 10Puppet, 10Cloud-VPS, 10Patch-For-Review: role::puppetmaster::standalone clones Git repositories as gitpuppet, git-sync-upstream overwrites them as root - https://phabricator.wikimedia.org/T152059 (10joanna_borun) [14:47:38] 10Puppet, 10Infrastructure-Foundations: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054 (10jbond) 05Open→03In progress I belive we now have this in puppet please re-open if i missed something [14:48:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:50:14] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm but let's make sure the pcc thinks this doesn't change anything in eqiad1." [puppet] - 10https://gerrit.wikimedia.org/r/929334 (owner: 10Arturo Borrero Gonzalez) [14:50:16] 10Puppet, 10Infrastructure-Foundations, 10PostgreSQL, 10User-jbond: puppetdb: tune postgress instance - https://phabricator.wikimedia.org/T287672 (10jbond) 05Open→03Resolved a:03jbond unfortunately i forgot what this relates to and general performance is improved now [14:50:22] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [14:50:29] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, and 2 others: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10joanna_borun) [14:51:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6001.drmrs.wmnet [14:51:59] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Puppet-Core, 10Sustainability (Incident Followup): Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145 (10joanna_borun) [14:52:37] (03CR) 10Ahmon Dancy: [C: 03+1] scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:53:42] (03PS1) 10Effie Mouzeli: shellbox: Add service mesh envoy retries [puppet] - 10https://gerrit.wikimedia.org/r/929354 [14:54:10] (03CR) 10Ladsgroup: ores: enable per wiki deployment of Ores deprecation (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [14:55:50] 10Puppet, 10Infrastructure-Foundations: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054 (10jbond) 05In progress→03Resolved a:03jbond [14:55:56] (03PS1) 10Andrew Bogott: base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356 [14:56:02] (03PS1) 10Arturo Borrero Gonzalez: openstack: puppetmaster: frontend: remove IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/929357 [14:56:10] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10Performance-Team (Radar): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10joanna_borun) [14:56:28] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6009.drmrs.wmnet [14:56:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356 (owner: 10Andrew Bogott) [14:56:41] !log reboot cp6009.drmrs.wmnet for pgrade [14:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:29] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:58:42] (03PS2) 10Andrew Bogott: base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356 [14:58:45] 10Puppet, 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10User-jbond: PKI server don't reimage cleanly - https://phabricator.wikimedia.org/T270269 (10joanna_borun) [14:59:27] 10Puppet, 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10User-jbond: PKI server don't reimage cleanly - https://phabricator.wikimedia.org/T270269 (10joanna_borun) p:05Medium→03Low [15:00:00] (03PS2) 10Ilias Sarantopoulos: ores: override Beta cluster liftwing URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) [15:00:11] (03CR) 10Ilias Sarantopoulos: ores: override Beta cluster liftwing URL (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [15:00:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:00:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1011.eqiad.wmnet with OS bullseye [15:00:29] 10Puppet, 10SRE, 10Infrastructure-Foundations: unbound variable error when calling puppet-merge script with an explicit treeish - https://phabricator.wikimedia.org/T264014 (10jbond) 05Open→03Resolved a:03jbond [15:00:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye completed: - back... [15:01:28] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10joanna_borun) [15:01:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10jbond) @Volans do you know if this is still an issue [15:01:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356 (owner: 10Andrew Bogott) [15:01:39] (03CR) 10AOkoth: [C: 03+2] vrts: use variables in rsyncquickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/928136 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [15:03:40] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC NOOP: https://puppet-compiler.wmflabs.org/output/929334/41669/" [puppet] - 10https://gerrit.wikimedia.org/r/929334 (owner: 10Arturo Borrero Gonzalez) [15:04:27] (03CR) 10Ladsgroup: ores: override Beta cluster liftwing URL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [15:04:44] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6009.drmrs.wmnet [15:06:09] (03PS1) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) [15:06:11] (03PS1) 10Andrew Bogott: openstack::clientpackages::vms::common: don't install ebtables [puppet] - 10https://gerrit.wikimedia.org/r/929365 [15:06:14] (03CR) 10Andrew Bogott: [C: 03+2] base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356 (owner: 10Andrew Bogott) [15:07:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack::clientpackages::vms::common: don't install ebtables [puppet] - 10https://gerrit.wikimedia.org/r/929365 (owner: 10Andrew Bogott) [15:08:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:09:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:10:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10Volans) @jbond , no idea if this is till happening, I guess we could look at a bunch of puppet run logs from the reimages and see if there... [15:10:32] (03PS2) 10Andrew Bogott: openstack::clientpackages::vms::common: don't install ebtables [puppet] - 10https://gerrit.wikimedia.org/r/929365 [15:12:48] (03CR) 10Andrew Bogott: [C: 03+2] openstack::clientpackages::vms::common: don't install ebtables [puppet] - 10https://gerrit.wikimedia.org/r/929365 (owner: 10Andrew Bogott) [15:14:16] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6002.drmrs.wmnet [15:17:24] !log reboot cp6002.drmrs.wmnet for upgrade [15:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:37] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41671/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [15:18:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10aborrero) [15:18:32] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:18:38] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) 05Open→03Stalled This is done for codfw1dev DNS servers. I'll mark this task as stalled unti... [15:19:36] PROBLEM - Check systemd state on vrts2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cron.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host backup1010.eqiad.wmnet with OS bullseye [15:21:04] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Gehel) Removing Search Platform, our work here is done. [15:23:07] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:23:23] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6002.drmrs.wmnet [15:24:08] (ProbeDown) firing: Service vrts2001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts2001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:13] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6010.drmrs.wmnet [15:25:36] !log rebooting cp6010.drmrs.wmnet for upgrade [15:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:25] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC NOOP: https://puppet-compiler.wmflabs.org/output/929357/41673/" [puppet] - 10https://gerrit.wikimedia.org/r/929357 (owner: 10Arturo Borrero Gonzalez) [15:30:04] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1530). [15:32:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] shellbox: Add service mesh envoy retries [puppet] - 10https://gerrit.wikimedia.org/r/929354 (owner: 10Effie Mouzeli) [15:32:54] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [15:34:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6010.drmrs.wmnet [15:40:14] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929378 (https://phabricator.wikimedia.org/T128546) [15:41:45] (03CR) 10BCornwall: [C: 03+2] Remove leftover TODO item [dns] - 10https://gerrit.wikimedia.org/r/928900 (https://phabricator.wikimedia.org/T309074) (owner: 10BCornwall) [15:42:29] 10Puppet, 10SRE, 10Infrastructure-Foundations: puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10jbond) [15:42:59] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6003.drmrs.wmnet [15:43:21] !log reboot cp6003.drmrs.wmnet for upgrade [15:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:06] 10Puppet, 10SRE, 10Infrastructure-Foundations: puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10jbond) [15:44:17] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929378 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:45:16] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929378 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:51:53] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6003.drmrs.wmnet [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:57:02] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10Vgutierrez) I've replicated a successful mTLS handshake with openssl s_client using the following CMD: ` vgutierrez@cp4037:~$ sudo openssl s_client -connect kafka-jumbo1001.eqi... [15:57:20] 10Puppet, 10SRE, 10Infrastructure-Foundations: puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10jbond) p:05Triage→03Medium [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: puppetmaster: frontend: remove IPv6 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929357 (owner: 10Arturo Borrero Gonzalez) [15:59:08] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6011.drmrs.wmnet [15:59:28] !log reboot cp6011.drmrs.wmnet for upgrade [15:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:18] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:929378| Bumping portals to master (T128546)]] (duration: 14m 21s) [16:01:22] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:01:54] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete [16:02:06] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete [16:07:22] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:929378| Bumping portals to master (T128546)]] (duration: 06m 03s) [16:07:26] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:08:23] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6011.drmrs.wmnet [16:10:57] (03CR) 10Andrew Bogott: [C: 03+2] clouds.yaml: remove keystoneadmin section [puppet] - 10https://gerrit.wikimedia.org/r/923696 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:11:35] (03CR) 10Andrew Bogott: [C: 03+2] labs_boostrapvz: Remove class [puppet] - 10https://gerrit.wikimedia.org/r/892944 (owner: 10Majavah) [16:11:43] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, and 3 others: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10jbond) [16:12:33] 10Puppet, 10SRE, 10Traffic, 10User-jbond: In valid byte sequence: File[/etc/update-ocsp.d/hooks/trafficserver-tls-ocsp] - https://phabricator.wikimedia.org/T238198 (10jbond) [16:12:55] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) [16:13:12] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: Rake tasks: add colours and buffer output - https://phabricator.wikimedia.org/T237508 (10jbond) [16:13:34] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10jbond) [16:14:04] 10Puppet, 10SRE, 10Traffic, 10User-jbond: In valid byte sequence: File[/etc/update-ocsp.d/hooks/trafficserver-tls-ocsp] - https://phabricator.wikimedia.org/T238198 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez @jbond I think we can close this one [16:14:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: puppet-merge shouldn't fail if `tput` doesn't grok your terminal - https://phabricator.wikimedia.org/T221985 (10jbond) [16:14:46] 10Puppet, 10SRE, 10Infrastructure-Foundations: Why doesn't profile::mediawiki::nutcracker create /var/run/nutcracker/ ? - https://phabricator.wikimedia.org/T204450 (10jbond) 05Open→03Resolved a:03jbond we no longer have this profile [16:15:13] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10jbond) [16:15:32] (03CR) 10Ilias Sarantopoulos: ores: override Beta cluster liftwing URL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [16:15:42] (03PS1) 10Slyngshede: C:IDM Minor tweak to captcha. [puppet] - 10https://gerrit.wikimedia.org/r/929381 [16:15:46] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Performance-Team (Radar): Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034 (10jbond) [16:16:32] (03CR) 10Slyngshede: [C: 03+2] C:IDM Minor tweak to captcha. [puppet] - 10https://gerrit.wikimedia.org/r/929381 (owner: 10Slyngshede) [16:17:18] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet CI should fail over CRLF line endings (sometimes) - https://phabricator.wikimedia.org/T182641 (10jbond) [16:20:10] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet wmf-style-guide: array of classes not detected properly - https://phabricator.wikimedia.org/T179230 (10jbond) [16:21:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: more verbose hiera messages on failures - https://phabricator.wikimedia.org/T109692 (10jbond) [16:22:18] 10Puppet: Module uwsgi doesn't allow passing multiple config params of same name - https://phabricator.wikimedia.org/T123809 (10jbond) [16:23:57] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Mail: Cleanup debconf handling in mailman puppet setup - https://phabricator.wikimedia.org/T144933 (10jbond) @MoritzMuehlenhoff should this be closed [16:24:57] 10Puppet, 10SRE, 10Infrastructure-Foundations: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246 (10jbond) 05Open→03Resolved a:03jbond Im going to close this im pretty sure its fixed now but please re-open if not [16:25:20] 10Puppet, 10Toolforge, 10Documentation: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733 (10jbond) [16:25:48] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10jbond) [16:26:34] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10jbond) [16:27:03] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb: filter large factsets - https://phabricator.wikimedia.org/T287674 (10jbond) 05Open→03Resolved a:03jbond We added some filters and puppetdb performance seems to have settled [16:27:10] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [16:28:16] 10Puppet, 10SRE, 10Observability-Alerting, 10Puppet-Infrastructure: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10jbond) [16:28:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10conftool: confd fails to start after a reimage - https://phabricator.wikimedia.org/T244477 (10jbond) [16:29:43] 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10User-jbond: create puppetboard debian package - https://phabricator.wikimedia.org/T292523 (10jbond) 05In progress→03Resolved [16:29:53] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Upgrade puppetboard to the latest version - https://phabricator.wikimedia.org/T292522 (10jbond) [16:30:30] 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10User-jbond: Update python3-pypuppetdb package to 2.4.0 - https://phabricator.wikimedia.org/T292525 (10jbond) 05Open→03Resolved [16:30:34] 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10User-jbond: create puppetboard debian package - https://phabricator.wikimedia.org/T292523 (10jbond) [16:30:45] (03CR) 10Vgutierrez: "looks good overall. I'll get to merge it tomorrow EU morning." [puppet] - 10https://gerrit.wikimedia.org/r/928632 (https://phabricator.wikimedia.org/T338481) (owner: 10Majavah) [16:33:51] (03PS1) 10Jbond: tlsproxy::localssl: drop class [puppet] - 10https://gerrit.wikimedia.org/r/929383 (https://phabricator.wikimedia.org/T191393) [16:36:18] 10Puppet, 10Infrastructure-Foundations: Puppetdb: not refreshed on config change? - https://phabricator.wikimedia.org/T291540 (10jbond) 05Open→03Resolved a:03jbond @volans im going to reject this and say its better to manually disable puppet fleet wide to roll out theses changes but please re-open if you... [16:36:21] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [16:36:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [16:36:53] (03CR) 10Andrew Bogott: [C: 03+2] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:37:18] (03CR) 10Andrew Bogott: [C: 03+2] wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:39:08] (03CR) 10Andrew Bogott: [C: 04-2] "we aren't ready for this until https://storyboard.openstack.org/#!/story/2010784 is resolved" [puppet] - 10https://gerrit.wikimedia.org/r/923697 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [16:39:50] (03PS13) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [16:39:52] (03PS21) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [16:39:54] (03PS2) 10Andrew Bogott: Set OS_CLOUD in wmcs-openstack.sh [puppet] - 10https://gerrit.wikimedia.org/r/923697 (https://phabricator.wikimedia.org/T337577) [16:43:22] (03PS1) 10Elukey: profile::cache::kafka::certificate: fix pki cert path [puppet] - 10https://gerrit.wikimedia.org/r/929384 [16:46:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41675/console" [puppet] - 10https://gerrit.wikimedia.org/r/929384 (owner: 10Elukey) [16:46:49] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::cache::kafka::certificate: fix pki cert path [puppet] - 10https://gerrit.wikimedia.org/r/929384 (owner: 10Elukey) [16:47:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jhancock.wm) [16:47:25] (03CR) 10Andrew Bogott: [C: 03+2] ldap: inline yamlconfig [puppet] - 10https://gerrit.wikimedia.org/r/924984 (owner: 10Majavah) [16:48:17] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [16:48:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003'] [16:49:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [16:50:53] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:52:19] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:52:37] (03CR) 10Andrew Bogott: [C: 04-1] ldap::client::sssd: use strongly typed parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924985 (owner: 10Majavah) [16:54:45] (03PS1) 10BBlack: geo-maps: Move default to the top for visibility [dns] - 10https://gerrit.wikimedia.org/r/929386 (https://phabricator.wikimedia.org/T337535) [16:55:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003'] [16:55:19] (03CR) 10Ladsgroup: "shouldn't we exclude wikidata and commons explicitly?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [16:55:26] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [16:55:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe: Update puppet code to conform to puppet 4.x and later standards - https://phabricator.wikimedia.org/T181967 (10jbond) 05Open→03Resolved a:03jbond closing this it must be done now [16:55:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jhancock.wm) [16:55:44] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) p:05Medium→03Low [16:56:05] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Clean up SSL configuration - https://phabricator.wikimedia.org/T240941 (10jbond) [16:56:25] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jbond) [16:59:01] (03CR) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on all wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1700) [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1700). [17:00:55] (03PS1) 10Hnowlan: Thumbor: deploy various poolcounter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/929392 (https://phabricator.wikimedia.org/T337649) [17:03:28] !log creating ganeti VM people1004 with os==bookworm passed to makevm cookbook to test bookworm and because this is traditionally an early adoptor of new distro releases [17:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:47] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1004.eqiad.wmnet [17:03:50] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [17:03:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) a:05Jclark-ctr→03Jhancock.wm @BTullis hi, can you give me more information on what type of hardware raid we are using on these se... [17:07:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: puppet (systemd::service) attempts to start manually masked units - https://phabricator.wikimedia.org/T211027 (10jbond) > Looks like this is working as intended for systemd provider (/usr/lib/ruby/vendor_ruby/puppet/provider/service/systemd.rb) although if... [17:07:48] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [17:08:23] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [17:08:34] PROBLEM - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:08:36] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186 (10jbond) [17:08:47] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [17:08:49] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186 (10jbond) update the list in the description [17:09:01] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Fix regex.yaml single-regex issue - https://phabricator.wikimedia.org/T183565 (10jbond) [17:09:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [17:09:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:09:51] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors [17:09:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors [17:10:19] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people1004.eqiad.wmnet - dzahn@cumin1001" [17:10:59] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Use multiple puppetdbs on puppet masters - https://phabricator.wikimedia.org/T169318 (10jbond) Im curious how puppetdb failed? do you rember? As the postgress write master is always on the primary puppetdb server im not sure we would get much of a win her... [17:11:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people1004.eqiad.wmnet - dzahn@cumin1001" [17:12:23] (03PS2) 10Dzahn: phabricator: "Automate yearly metrics for wikitech-l": Fix var typo [puppet] - 10https://gerrit.wikimedia.org/r/928993 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [17:12:50] (03CR) 10Dzahn: [C: 03+2] "nitpick: in puppet repo, commit message should start with name of module followed by :" [puppet] - 10https://gerrit.wikimedia.org/r/928993 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [17:14:35] (03PS1) 10Hnowlan: poolcounter: use per-format throttling key [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/929394 (https://phabricator.wikimedia.org/T337649) [17:14:48] 10Puppet, 10Infrastructure-Foundations: Add check for puppetboard - https://phabricator.wikimedia.org/T296304 (10jbond) 05Open→03Resolved a:03jbond [17:15:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host people1004.eqiad.wmnet with OS bookworm [17:17:30] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet does not undo manual "systemctl mask $unit" - https://phabricator.wikimedia.org/T285425 (10jbond) [17:18:21] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, and 2 others: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) [17:18:31] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Upgrade puppetboard to the latest version - https://phabricator.wikimedia.org/T292522 (10jbond) 05In progress→03Resolved [17:18:46] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Upgrade puppetboard to the latest version - https://phabricator.wikimedia.org/T292522 (10jbond) [17:18:49] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Decommission puppetboard[12]001 - https://phabricator.wikimedia.org/T296744 (10jbond) 05In progress→03Resolved [17:19:47] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Hosts distribution across puppetmasters - https://phabricator.wikimedia.org/T291541 (10jbond) for the records with puppet 7 i plan to explore using srv records which may help with this [17:20:07] 10Puppet, 10Infrastructure-Foundations: update hiera order in production environment - https://phabricator.wikimedia.org/T301349 (10jbond) 05Open→03Resolved a:03jbond [17:21:32] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Maps, 10netbox: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10jbond) @hnowlan i did some patches to add support for this with the puppetdb upgrade. it no longer suports password changes but it dose all... [17:21:45] 10SRE, 10Infrastructure-Foundations, 10Maps, 10Puppet-Infrastructure, and 2 others: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10jbond) [17:22:12] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [17:22:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [17:23:28] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-jbond: puppetmasters: update the puppet masters so they use them self for the puppet run - https://phabricator.wikimedia.org/T238093 (10jbond) [17:23:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-jbond: missing CRL - https://phabricator.wikimedia.org/T235185 (10jbond) [17:24:10] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen): Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10jbond) [17:24:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10jbond) [17:24:37] 10Puppet, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin: check if additional groups in systemd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond) [17:25:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: puppet documentation generation is missing some compnets - https://phabricator.wikimedia.org/T271909 (10jbond) 05Open→03Resolved a:03jbond [17:25:14] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [17:26:20] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jbond) 05Open→03Resolved a:03jbond [17:26:56] 10SRE, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10jbond) [17:27:13] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) 05Open→03Resolved a:03jbond [17:27:21] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10observability, 10User-jbond: Add monitoring for the puppet-netbox repository - https://phabricator.wikimedia.org/T310639 (10jbond) [17:28:16] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [17:28:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [17:28:48] RECOVERY - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:28:54] 10Puppet, 10SRE, 10Infrastructure-Foundations: Usual git mechanism for aborting commit does not work on the private puppet repo - https://phabricator.wikimedia.org/T211121 (10jbond) 05Open→03Resolved a:03jbond closing, but please re-open if its still an issue [17:29:47] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Puppet-Infrastructure, 10observability: run-no-puppet leave puppet disabled on kill/crash - https://phabricator.wikimedia.org/T182228 (10jbond) 05Open→03Declined closing due to lack of response [17:30:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10observability, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10jbond) [17:31:24] 10SRE-tools, 10Infrastructure-Foundations: Fix autorestart and debclient dependency - https://phabricator.wikimedia.org/T324229 (10jbond) [17:31:48] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb Investigate the expected bahaviour of the edges table - https://phabricator.wikimedia.org/T287673 (10jbond) 05Open→03Resolved a:03jbond closing this we have hopefully made it past the puppetdb issues [17:31:53] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [17:32:16] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10jbond) [17:33:36] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10User-jbond: Audit puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10jbond) 05In progress→03Resolved a:03jbond going to resolve this i think the original question was answered [17:33:45] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) [17:35:21] 10Puppet, 10Puppet-Infrastructure, 10cloud-services-team: Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10jbond) [17:38:05] (03CR) 10Ladsgroup: [C: 03+1] poolcounter: use per-format throttling key [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/929394 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [17:38:15] (03CR) 10Ladsgroup: [C: 03+1] Thumbor: deploy various poolcounter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/929392 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [17:41:10] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:42:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:46:21] (03CR) 10BBlack: [C: 03+2] "This is a functional no-op, just moving and commenting on this "default" entry for clarity and visibility." [dns] - 10https://gerrit.wikimedia.org/r/929386 (https://phabricator.wikimedia.org/T337535) (owner: 10BBlack) [17:50:52] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:51:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:03:53] (03CR) 10Dzahn: [C: 03+2] "tested and looks good to me now:" [puppet] - 10https://gerrit.wikimedia.org/r/928993 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:04:06] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host people1004.eqiad.wmnet with OS bookworm [18:04:07] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [18:06:13] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:09:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:09:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:09:27] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors [18:09:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors [18:09:31] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host people1004.eqiad.wmnet [18:14:17] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1004.eqiad.wmnet [18:14:18] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [18:17:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host backup1010.eqiad.wmnet with OS bullseye [18:20:44] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:21:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:21:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:21:47] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors [18:21:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors [18:21:52] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [18:24:28] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:24:55] I have no idea why it first adds records and then removes them again [18:24:58] in the same cookbook run [18:25:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:25:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:32] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors [18:25:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors [18:25:41] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host people1004.eqiad.wmnet [18:26:50] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1004.eqiad.wmnet [18:26:51] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [18:28:21] mutante: on failure it rollbacks the new assigned IP and related DNS records [18:31:51] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "checked compiler output -> full catalog. looks good to me. this will add the rsync on 1003 to push to 2002 and it looks absented on 2002 i" [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [18:32:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) @Papaul Your side is all set. We have some switch overs scheduled for the end of the month to finish up our side of the task too. Thanks fo... [18:32:46] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:33:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:33:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:33:46] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors [18:33:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors [18:33:51] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [18:35:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:36:04] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:36:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:37:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [18:37:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:05] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors [18:37:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors [18:37:14] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host people1004.eqiad.wmnet [18:37:56] (03Abandoned) 10Sohom Datta: Add localhost:* to the beta wiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (https://phabricator.wikimedia.org/T338790) (owner: 10Sohom Datta) [18:39:25] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people2003.codfw.wmnet [18:39:26] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [18:41:50] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people2003.codfw.wmnet - dzahn@cumin1001" [18:42:20] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [18:42:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [18:42:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people2003.codfw.wmnet - dzahn@cumin1001" [18:42:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:42:47] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people2003.codfw.wmnet on all recursors [18:42:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people2003.codfw.wmnet on all recursors [18:43:16] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people2003.codfw.wmnet - dzahn@cumin1001" [18:44:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people2003.codfw.wmnet - dzahn@cumin1001" [18:48:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:54:20] 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Jclark-ctr) Replaced Failed Dimm DIMM_B6 [18:54:29] 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Jclark-ctr) 05Open→03Resolved [18:55:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) The Engineer is expected to arrive on 06/13/2023 09:00 AM to 06:00 PM [19:03:54] RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:05:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host people2003.codfw.wmnet with OS bookworm [19:08:32] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:11:15] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [19:11:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [19:11:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye [19:11:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err... [19:14:52] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [19:14:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [19:14:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye [19:15:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err... [19:15:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:15:44] (03PS1) 10Chad: deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) [19:16:13] (03CR) 10CI reject: [V: 04-1] deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) (owner: 10Chad) [19:18:22] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:32] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:10] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:42] 10Puppet, 10SRE, 10Infrastructure-Foundations: puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10MoritzMuehlenhoff) [19:28:06] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149'] [19:32:12] (03PS3) 10Samtar: prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy) [19:33:08] (03PS6) 10Samtar: Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy) [19:33:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [19:33:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye [19:33:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [19:33:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err... [19:33:42] (03PS2) 10Samtar: Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy) [19:34:13] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 136 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:34:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Papaul) [19:35:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Papaul) @Jhancock.wm you can proceed with the OS install [19:35:44] (03PS1) 10Aklapper: phabricator: "Automate yearly metrics for wikitech-l": Fix var typo [puppet] - 10https://gerrit.wikimedia.org/r/929402 (https://phabricator.wikimedia.org/T337388) [19:35:44] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@fb9dba3]: repoint drafttopic ingestion to model specific stream [19:35:54] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@fb9dba3]: repoint drafttopic ingestion to model specific stream (duration: 00m 10s) [19:38:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1149'] [19:38:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [19:38:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [19:38:29] (03CR) 10Dzahn: [C: 03+2] phabricator: "Automate yearly metrics for wikitech-l": Fix var typo [puppet] - 10https://gerrit.wikimedia.org/r/929402 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [19:39:19] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:41:11] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149'] [19:46:22] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41678/console" [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [19:47:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1149'] [19:49:44] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41679/console" [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [19:50:46] (03CR) 10BCornwall: [C: 03+1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [19:51:44] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage [19:54:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage [19:57:52] (03CR) 10BCornwall: [V: 03+1 C: 03+1] fifo_log_demux: Fix systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T2000). [20:00:05] Daimona and Urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:28] o/ [20:00:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye [20:00:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest1003.eqiad.wmnet with OS bullseye [20:01:37] i can deploy today [20:01:37] * TheresNoTime will assume urbanecm will be doing the deploy window given their patches ^^ [20:01:43] * TheresNoTime assumed correctly [20:01:51] * taavi was just about to assume that [20:02:25] (03CR) 10Urbanecm: [C: 03+2] prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy) [20:02:36] * Daimona thanks Urbanecm [20:02:40] (03CR) 10Urbanecm: [C: 03+2] Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy) [20:02:45] (03CR) 10Urbanecm: [C: 03+2] Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy) [20:03:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy) [20:03:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy) [20:03:12] (03CR) 10BCornwall: [V: 03+1 C: 03+2] service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [20:03:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy) [20:03:16] (03Merged) 10jenkins-bot: prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy) [20:03:25] (03Merged) 10jenkins-bot: Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy) [20:03:35] !log Roll restarting pybal on lvs2014 then lvs2013 - T863380 [20:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:40] (03PS3) 10Urbanecm: Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy) [20:03:44] (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy) [20:04:31] (03Merged) 10jenkins-bot: Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy) [20:04:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:925775|prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336364)]], [[gerrit:802894|Remove references to $wgEnableLocalTimedText from CommonSettings]], [[gerrit:929318|Remove unused variable wmgEnableLocalTimedText]] [20:04:51] T336364: Update prod to use the new tracking tools schema - https://phabricator.wikimedia.org/T336364 [20:04:52] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [20:05:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:05:19] (03PS2) 10Chad: deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) [20:06:15] !log urbanecm@deploy1002 daimona and urbanecm: Backport for [[gerrit:925775|prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336364)]], [[gerrit:802894|Remove references to $wgEnableLocalTimedText from CommonSettings]], [[gerrit:929318|Remove unused variable wmgEnableLocalTimedText]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codf [20:06:15] w.wmnet [20:06:26] Daimona: can you test your patches at mwdebug1001? [20:07:38] Hmmmm... First one should be a noop, so I can try and make sure that nothing explodes. No idea for the other two, though... [20:08:49] the other two seems no-ops too to me? [20:08:52] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:08:55] testing nothing explodes makes sense to me :) [20:08:56] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:09:47] Yeah, they should all be noop actually [20:10:15] And it's looking good to me on mwdebug1001 [20:10:42] good, syncing [20:11:55] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:14:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:14:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1010.eqiad.wmnet with OS bullseye [20:14:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye completed: - back... [20:15:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) [20:15:40] (03PS1) 10BCornwall: Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 [20:16:07] (03CR) 10CI reject: [V: 04-1] Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (owner: 10BCornwall) [20:16:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10Jclark-ctr) a:03Jclark-ctr [20:16:21] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:925775|prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336364)]], [[gerrit:802894|Remove references to $wgEnableLocalTimedText from CommonSettings]], [[gerrit:929318|Remove unused variable wmgEnableLocalTimedText]] (duration: 11m 33s) [20:16:25] T336364: Update prod to use the new tracking tools schema - https://phabricator.wikimedia.org/T336364 [20:16:28] Daimona: deployed :) [20:16:30] anything else? [20:16:42] (03PS2) 10Urbanecm: [Growth] Enable user impact refresh for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203) [20:16:46] (03CR) 10Urbanecm: [C: 03+2] [Growth] Enable user impact refresh for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [20:16:52] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.25:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:16:56] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 77 connections established with conf2004.codfw.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [20:17:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [20:17:24] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10Jclark-ctr) mw1492 T338566 Server down to failed Mainboard pending replacement [20:17:42] (03Merged) 10jenkins-bot: [Growth] Enable user impact refresh for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [20:17:57] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:928824|[Growth] Enable user impact refresh for rowiki (T336203)]] [20:18:00] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [20:18:37] Amazing, thank you :) [20:18:49] any time [20:19:25] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:928824|[Growth] Enable user impact refresh for rowiki (T336203)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:19:46] (03PS1) 10Dzahn: Revert "Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter"" [puppet] - 10https://gerrit.wikimedia.org/r/928943 [20:20:29] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10Jclark-ctr) 05Open→03Resolved Replaced cable on ganeti1031 [20:21:40] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 80 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [20:22:34] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host people2003.codfw.wmnet with OS bookworm [20:22:34] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:23:45] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:23:58] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10Jclark-ctr) @ayounsi Tomorrow i would like you assistance if available to clean fiber /replace optic [20:24:07] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10Jclark-ctr) a:03Jclark-ctr [20:24:50] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:928824|[Growth] Enable user impact refresh for rowiki (T336203)]] (duration: 06m 53s) [20:25:00] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [20:25:48] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:26:42] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.25:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:28:02] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people2003.codfw.wmnet - dzahn@cumin1001" [20:28:10] (03PS1) 10BCornwall: service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 [20:28:29] (03CR) 10BBlack: [C: 03+1] service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (owner: 10BCornwall) [20:28:34] (03PS2) 10Urbanecm: [Growth] Enable new Impact module for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928825 (https://phabricator.wikimedia.org/T336203) [20:28:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928825 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [20:28:40] (03CR) 10CI reject: [V: 04-1] service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (owner: 10BCornwall) [20:28:53] !log Run extensions/GrowthExperiments/maintenance/refreshUserImpactData.php for rowiki (T336203) [20:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people2003.codfw.wmnet - dzahn@cumin1001" [20:29:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:29:06] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people2003.codfw.wmnet on all recursors [20:29:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people2003.codfw.wmnet on all recursors [20:29:10] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host people2003.codfw.wmnet [20:29:34] (03Merged) 10jenkins-bot: [Growth] Enable new Impact module for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928825 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [20:29:50] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:928825|[Growth] Enable new Impact module for rowiki (T336203)]] [20:29:52] (03PS2) 10BCornwall: service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (https://phabricator.wikimedia.org/T863380) [20:30:15] (03CR) 10BBlack: service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (https://phabricator.wikimedia.org/T863380) (owner: 10BCornwall) [20:30:28] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.25:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:30:38] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 126 connections established with conf1007.eqiad.wmnet:4001 (min=127) https://wikitech.wikimedia.org/wiki/PyBal [20:31:10] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41680/console" [puppet] - 10https://gerrit.wikimedia.org/r/929410 (https://phabricator.wikimedia.org/T863380) (owner: 10BCornwall) [20:31:10] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:928825|[Growth] Enable new Impact module for rowiki (T336203)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:31:17] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [20:31:27] (03CR) 10BCornwall: [V: 03+1 C: 03+2] service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (https://phabricator.wikimedia.org/T863380) (owner: 10BCornwall) [20:31:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:33:24] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:33:28] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:36:57] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:928825|[Growth] Enable new Impact module for rowiki (T336203)]] (duration: 07m 06s) [20:37:01] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [20:38:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:41:45] * urbanecm done [20:44:15] (03PS2) 10Dzahn: Revert "Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter"" [puppet] - 10https://gerrit.wikimedia.org/r/928943 (https://phabricator.wikimedia.org/T337387) [20:46:19] (03CR) 10Dzahn: "probably we should just use a single config file instead of repeating the same mysql metrics user for each script.. but let's do that sepa" [puppet] - 10https://gerrit.wikimedia.org/r/928943 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [20:48:20] (03CR) 10Ladsgroup: [C: 03+1] Switch VisualEditor to bypass RESTbase on all wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [20:50:58] (03PS1) 10Ebernhardson: cirrus: Enable analysis chain deduplication for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929411 (https://phabricator.wikimedia.org/T334194) [20:56:03] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1003.eqiad.wmnet with OS bullseye [20:56:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest1003.eqiad.wmnet with OS bullseye executed with errors: - srete... [20:58:21] 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Ladsgroup) Thanks! I'm setting the mysql up and making sure it's getting replicated. [21:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T2100) [21:03:48] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter"" [puppet] - 10https://gerrit.wikimedia.org/r/928943 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [21:05:04] (03CR) 10Dzahn: [C: 03+2] "Failed to parse calendar specification '*-1, 4, 7, 10-1 0:0:00': Invalid argument :/" [puppet] - 10https://gerrit.wikimedia.org/r/928943 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [21:09:55] (03PS1) 10Dzahn: phabricator: try to fix month range format for quartely timer [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) [21:10:12] (03CR) 10CI reject: [V: 04-1] phabricator: try to fix month range format for quartely timer [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [21:16:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [21:16:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster [21:20:06] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [21:20:32] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [21:23:25] (03PS1) 10Papaul: Add sretest1003 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/929418 (https://phabricator.wikimedia.org/T334393) [21:24:24] (03CR) 10Papaul: [C: 03+2] Add sretest1003 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/929418 (https://phabricator.wikimedia.org/T334393) (owner: 10Papaul) [21:28:34] (03PS2) 10Dzahn: phabricator: try to fix month range format for quartely timer [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) [21:28:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye [21:29:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest1003.eqiad.wmnet with OS bullseye [21:30:00] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:30:14] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:31:24] (03PS1) 10BCornwall: prometheus: Disable SNI support in Envoy tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/929421 (https://phabricator.wikimedia.org/T863380) [21:32:09] (03CR) 10Dzahn: [C: 03+2] phabricator: try to fix month range format for quartely timer [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [21:33:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [21:33:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) @ArielGlenn i am still working on those servers after @MoritzMuehlenhoff show me the fix on installing Buster on those servers i tried it o... [21:34:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:11] (03CR) 10Dzahn: [C: 03+2] "[phab1004:~] $ sudo systemctl status phabricator_stats_job_quarterly_metrics.timer" [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [21:38:51] (03CR) 10Dzahn: [C: 03+2] "@Aklapper: got it mostly solved but some numbers are missing:" [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [21:39:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10MoritzMuehlenhoff) In the busybox shell, what does "uname -a" show as the running kernel version? [21:42:09] (03PS1) 10Dzahn: phabricator: replace cut with sed in quarterly_metrics.sh [puppet] - 10https://gerrit.wikimedia.org/r/929423 (https://phabricator.wikimedia.org/T337387) [21:43:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) @MoritzMuehlenhoff ` (initramfs) uname -a Linux (none) 4.19.0-24-amd64 #1 SMP Debian 4.19.282-1 (2023-04-29) x86_64 GNU/Linux (initramfs) [21:44:02] (03PS2) 10BCornwall: prometheus: Disable SNI support in Envoy tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/929421 (https://phabricator.wikimedia.org/T301944) [21:44:59] Phab admin or acl*userdisable https://phabricator.wikimedia.org/p/Rule34Enjoyer/ [21:48:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338904 (10phaultfinder) [21:50:31] Disabled the account [21:51:08] (03CR) 10Dzahn: [C: 03+2] phabricator: replace cut with sed in quarterly_metrics.sh [puppet] - 10https://gerrit.wikimedia.org/r/929423 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [21:51:14] (03PS2) 10Dzahn: phabricator: replace cut with sed in quarterly_metrics.sh [puppet] - 10https://gerrit.wikimedia.org/r/929423 (https://phabricator.wikimedia.org/T337387) [22:05:20] RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:18] (03CR) 10Dzahn: "works now after this:" [puppet] - 10https://gerrit.wikimedia.org/r/929423 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn) [22:13:00] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:14] (03PS2) 10BCornwall: Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944) [22:15:26] (03CR) 10CI reject: [V: 04-1] Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall) [22:16:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster [22:16:54] (03PS3) 10BCornwall: Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944) [22:17:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e... [22:17:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:18:51] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41681/console" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall) [22:20:43] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall) [22:21:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [22:22:32] !log Roll restarting pybal on lvs2014 to revert prometheus service rollout - T326657 [22:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:36] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [22:22:40] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:23:06] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 80 connections established with conf1007.eqiad.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [22:23:44] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:23:48] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 77 connections established with conf2004.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [22:24:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [22:26:11] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:26:17] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 126 connections established with conf1007.eqiad.wmnet:4001 (min=126) https://wikitech.wikimedia.org/wiki/PyBal [22:27:17] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.25:443]) https://wikitech.wikimedia.org/wiki/PyBal [22:31:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:34:38] (03CR) 10EoghanGaffney: admin: reserve gerrit uid/gid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T338470) (owner: 10Hashar) [22:37:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:39:09] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:40:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:42:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:42:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bullseye [22:42:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest1003.eqiad.wmnet with OS bullseye completed: - sretest1003 (**P... [22:43:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jhancock.wm) [22:46:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jhancock.wm) 05Open→03Resolved @jbond or @Volans finished this. all yours. [22:46:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:48:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:49:47] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:54:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:57:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) 05Open→03Resolved [23:03:31] RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:21] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [23:05:27] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [23:08:09] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:31] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [23:14:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [23:17:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye [23:17:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [23:36:22] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bullseye [23:36:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye [23:46:29] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:48:01] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:49:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy1027.eqiad.wmnet with OS bullseye [23:49:20] (03Abandoned) 10Chad: WIP: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/915701 (https://phabricator.wikimedia.org/T320390) (owner: 10Chad) [23:52:34] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [23:52:40] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye