[00:00:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:39:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928909
[00:39:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928909 (owner: 10TrainBranchBot)
[00:45:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:10] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:58:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928909 (owner: 10TrainBranchBot)
[01:30:22] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:35:02] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:14] <icinga-wm>	 RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:01:00] <icinga-wm>	 RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:54] <icinga-wm>	 PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:15:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:20:02] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:27:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:51:27] <Amir1>	 !log ladsgroup@mwmaint1002:~$ mwscript maintenance/storage/moveToExternal.php --wiki=enwiki --end 32000000 --undo /home/ladsgroup/T128151.undo.sql --iconv DB cluster27 (T128151)
[02:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:31] <stashbot>	 T128151: Migrate all old DB rows from windows-1252 to UTF-8 on enwiki - https://phabricator.wikimedia.org/T128151
[03:00:08] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:05:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:08:28] <icinga-wm>	 PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:24] <icinga-wm>	 PROBLEM - Check systemd state on db1139 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:25:56] <Amir1>	 I bring back db1139, I can ssh into it but needs a data check and such
[03:28:22] <icinga-wm>	 RECOVERY - MariaDB read only s2 on db1139 is OK: Version 10.4.25-MariaDB, Uptime 36s, read_only: True, event_scheduler: True, 20.53 QPS, connection latency: 0.003834s, query latency: 0.000428s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[03:28:24] <icinga-wm>	 RECOVERY - MariaDB read only s1 on db1139 is OK: Version 10.4.25-MariaDB, Uptime 96s, read_only: True, event_scheduler: True, 1798.92 QPS, connection latency: 0.004415s, query latency: 0.000292s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[03:28:48] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s1 on db1139 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:29:10] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s1 on db1139 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:29:14] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on db1139 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:29:16] <icinga-wm>	 RECOVERY - mysqld processes on db1139 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[03:29:50] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on db1139 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:45:16] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:49:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_netflow.service,produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:12:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance
[04:12:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance
[04:15:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance
[04:16:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance
[04:32:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance
[04:32:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance
[05:10:10] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on db1139 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:17:33] <wikibugs>	 (03PS1) 10Ladsgroup: Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343)
[05:23:07] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-06-12-051618-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929036 (https://phabricator.wikimedia.org/T338146)
[05:33:36] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-06-10-124931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929038 (https://phabricator.wikimedia.org/T284905)
[05:36:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:38:15] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10phaultfinder)
[05:41:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:12:38] * kart_ updating MinT now..
[06:12:56] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-06-10-124931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929038 (https://phabricator.wikimedia.org/T284905) (owner: 10KartikMistry)
[06:14:04] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2023-06-10-124931-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929038 (https://phabricator.wikimedia.org/T284905) (owner: 10KartikMistry)
[06:16:43] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[06:25:36] <kart_>	 Service update seems taking more time than usual :/
[06:27:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:36:01] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[06:36:05] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org
[06:37:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/929019 (owner: 10Majavah)
[06:41:44] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[06:44:12] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org
[06:45:13] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[06:48:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[06:48:37] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[06:50:57] <moritzm>	 !log upgrading booworm pilot installations to final/released bookworm package state T330495
[06:54:26] <kart_>	 !log Updated MinT to 2023-06-10-124931-production (T284905)
[06:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:30] <stashbot>	 T284905: Softcatalà translator - requested for integration as an MT service for CX - https://phabricator.wikimedia.org/T284905
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T0700)
[07:00:05] <jouncebot>	 Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:18] <taavi>	 o/
[07:01:11] <moritzm>	 !log upgrading bookworm netboot images to final/released bookworm images T330495
[07:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:16] <stashbot>	 T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495
[07:01:19] <taavi>	 Superpes: around?
[07:03:22] <wikibugs>	 (03PS3) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825)
[07:04:54] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41666/console" [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[07:09:26] <wikibugs>	 (03PS6) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825)
[07:09:28] <wikibugs>	 (03PS4) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825)
[07:10:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) 05Open→03Resolved All done, bookworm has been released (https://lists.debian.org/debian-announce/2023/msg00001.html) and our installer/base...
[07:10:48] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41667/console" [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[07:27:30] <wikibugs>	 (03PS7) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825)
[07:27:32] <wikibugs>	 (03PS5) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825)
[07:28:44] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41668/console" [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[08:08:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/928644 (owner: 10JHathaway)
[08:13:07] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) >>! In T330884#8863775, @MoritzMuehlenhoff wrote: > @ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage...
[08:14:24] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi) Awesome! In place is fine.
[08:19:52] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[08:23:47] <wikibugs>	 (03PS1) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480)
[08:25:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929015 (owner: 10Majavah)
[08:25:57] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff)
[08:26:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[08:26:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929016 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah)
[08:28:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[08:29:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929018 (owner: 10Majavah)
[08:29:59] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[08:30:18] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[08:30:29] <Superpes>	 @Taavi Sorry I had a sudden commitment and was not able to be present! Will reschedule them for another window :)
[08:30:30] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk
[08:30:38] <taavi>	 jouncebot: nowandnext
[08:30:38] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 29 minute(s)
[08:30:38] <jouncebot>	 In 1 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1000)
[08:30:43] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk
[08:30:46] <taavi>	 Superpes: or we can just push it out now if you have time
[08:31:03] <Superpes>	 taavi Yep, if you can, many thanks :)
[08:32:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928504 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15)
[08:32:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928992 (https://phabricator.wikimedia.org/T338621) (owner: 10Superpes15)
[08:32:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:33:03] <wikibugs>	 (03Merged) 10jenkins-bot: [knwiki] Add a temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928504 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15)
[08:33:06] <wikibugs>	 (03Merged) 10jenkins-bot: [lmowiki] Removing the Purtaal namespace and fixing the Portal talk translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928992 (https://phabricator.wikimedia.org/T338621) (owner: 10Superpes15)
[08:33:42] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:928504|[knwiki] Add a temporary logo for the 20th anniversary (T338136)]], [[gerrit:928992|[lmowiki] Removing the Purtaal namespace and fixing the Portal talk translation (T338621)]]
[08:33:47] <stashbot>	 T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136
[08:33:48] <stashbot>	 T338621: lmo.wiki namespaces change - https://phabricator.wikimedia.org/T338621
[08:36:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Clement_Goubert) p:05Triage→03Medium a:03Jclark-ctr
[08:38:09] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::apt_pinning: remove unused pinnings [puppet] - 10https://gerrit.wikimedia.org/r/929016 (https://phabricator.wikimedia.org/T290494)
[08:38:11] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::apt_pinning: remove sssd pinning [puppet] - 10https://gerrit.wikimedia.org/r/929159 (https://phabricator.wikimedia.org/T290494)
[08:38:28] <wikibugs>	 (03PS3) 10Jbond: dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[08:39:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow4002.ulsfo.wmnet with OS bookworm
[08:39:34] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow4002.ulsfo.wmnet with OS bookworm
[08:40:49] <taavi>	 Superpes: looks like the container build is again taking ages :/ I'll ping you when they are available for testing
[08:41:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[08:42:36] <wikibugs>	 (03PS4) 10Jbond: dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[08:42:38] <wikibugs>	 (03PS1) 10Jbond: wmflib.is_container: add mocked fact [puppet] - 10https://gerrit.wikimedia.org/r/929161 (https://phabricator.wikimedia.org/T337972)
[08:42:46] <logmsgbot>	 !log taavi@deploy1002 superpes and taavi: Backport for [[gerrit:928504|[knwiki] Add a temporary logo for the 20th anniversary (T338136)]], [[gerrit:928992|[lmowiki] Removing the Purtaal namespace and fixing the Portal talk translation (T338621)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[08:42:51] <stashbot>	 T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136
[08:42:51] <stashbot>	 T338621: lmo.wiki namespaces change - https://phabricator.wikimedia.org/T338621
[08:43:20] <Superpes>	 Testing
[08:44:07] <Superpes>	 Everything is fine :) Thanks @taavi!
[08:44:33] <taavi>	 syncing!
[08:45:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, CI error was related to the missing mocked fact, see preceding CR (i think i saw you add this fact some where else so feel free to r" [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[08:48:08] <wikibugs>	 (03CR) 10Jbond: "nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[08:49:47] <wikibugs>	 (03PS1) 10Btullis: Deploy the new eventgate-wikimedia container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/929164 (https://phabricator.wikimedia.org/T335716)
[08:50:27] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:928504|[knwiki] Add a temporary logo for the 20th anniversary (T338136)]], [[gerrit:928992|[lmowiki] Removing the Purtaal namespace and fixing the Portal talk translation (T338621)]] (duration: 16m 44s)
[08:50:32] <stashbot>	 T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136
[08:50:32] <stashbot>	 T338621: lmo.wiki namespaces change - https://phabricator.wikimedia.org/T338621
[08:51:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Bookworm to debdeploy config and remove Stretch [puppet] - 10https://gerrit.wikimedia.org/r/929165
[08:52:16] <aanzx>	 Thanks taavi and Superpes , logo working on wiki now 
[08:56:03] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk
[08:56:05] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk
[08:56:17] <Superpes>	 Thanks taavi :)
[08:56:43] <wikibugs>	 (03CR) 10Jbond: "LGTM but it would be good to also include https://gerrit.wikimedia.org/r/c/operations/puppet/+/929161 in this patch" [puppet] - 10https://gerrit.wikimedia.org/r/928903 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[08:56:55] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[08:57:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow4002.ulsfo.wmnet with reason: host reimage
[08:58:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::apt_pinning: remove unused pinnings [puppet] - 10https://gerrit.wikimedia.org/r/929016 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah)
[09:00:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:08] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: P:toolforge::apt_pinning: remove sssd pinning [puppet] - 10https://gerrit.wikimedia.org/r/929159 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah)
[09:01:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow4002.ulsfo.wmnet with reason: host reimage
[09:02:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Create additional nftables directories [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:04:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929165 (owner: 10Muehlenhoff)
[09:04:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::apt_pinning: remove sssd pinning [puppet] - 10https://gerrit.wikimedia.org/r/929159 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah)
[09:04:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Bookworm to debdeploy config and remove Stretch [puppet] - 10https://gerrit.wikimedia.org/r/929165 (owner: 10Muehlenhoff)
[09:05:09] <arturo>	 moritzm: we just hit submit at the same time @ gerrit :-P
[09:05:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:33] <arturo>	 moritzm: ok to merge your patch? Add Bookworm to debdeploy config and remove Stretch (051e74420a)
[09:06:48] <moritzm>	 yeah, I was trying to puppet-merge it, but you had the lock already :-)
[09:07:02] <arturo>	 cool, merging
[09:08:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptly::client: configure unattended-upgrades [puppet] - 10https://gerrit.wikimedia.org/r/929018 (owner: 10Majavah)
[09:09:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:bookworm: prepare apt repo for toolforge [puppet] - 10https://gerrit.wikimedia.org/r/929015 (owner: 10Majavah)
[09:10:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, we can also limit the deployment to event-gate main IIUC, but rolling it to all the instances will be more consistent. As you prefer" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929164 (https://phabricator.wikimedia.org/T335716) (owner: 10Btullis)
[09:13:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Build Bookworm base image [puppet] - 10https://gerrit.wikimedia.org/r/929168 (https://phabricator.wikimedia.org/T335560)
[09:19:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm obviously depends on https://gerrit.wikimedia.org/r/c/operations/puppet/+/928903" [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[09:25:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:puppet::agent: add script to locate unmanaged files [puppet] - 10https://gerrit.wikimedia.org/r/928857 (owner: 10Majavah)
[09:26:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk
[09:26:41] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk
[09:30:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:55] <jinxer-wm>	 (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[09:33:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Deploy the new eventgate-wikimedia container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/929164 (https://phabricator.wikimedia.org/T335716) (owner: 10Btullis)
[09:34:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow4002.ulsfo.wmnet with OS bookworm
[09:34:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:50] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow4002.ulsfo.wmnet with OS bookworm completed: - netflow4002 (**PASS**)   -...
[09:34:58] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy the new eventgate-wikimedia container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/929164 (https://phabricator.wikimedia.org/T335716) (owner: 10Btullis)
[09:38:33] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder)
[09:39:26] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "Looks correct to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/929168 (https://phabricator.wikimedia.org/T335560) (owner: 10Muehlenhoff)
[09:44:51] <wikibugs>	 (03PS1) 10Elukey: Revert "Move cp4037's varnishkafka instances to PKI" [puppet] - 10https://gerrit.wikimedia.org/r/928934
[09:45:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Move cp4037's varnishkafka instances to PKI" [puppet] - 10https://gerrit.wikimedia.org/r/928934 (owner: 10Elukey)
[09:48:28] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[09:56:06] <wikibugs>	 (03PS1) 10Slyngshede: Enable password reset and fix wording. [software/bitu] - 10https://gerrit.wikimedia.org/r/929171
[09:57:41] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[09:58:53] <wikibugs>	 (03PS2) 10Slyngshede: Enable password reset and fix wording. [software/bitu] - 10https://gerrit.wikimedia.org/r/929171
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1000)
[10:07:25] <wikibugs>	 (03CR) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar)
[10:07:41] <wikibugs>	 (03PS2) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205)
[10:08:18] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[10:09:59] <wikibugs>	 (03CR) 10Jbond: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[10:13:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, not sure what the CI issue is i rebased locally fine, tried to push and got a no changes error.  its also nice to see such a saving " [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[10:15:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:27:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:31:25] <wikibugs>	 (03CR) 10Jbond: "Thanks for the feedback but its probably best to move this to a task" [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh)
[10:40:54] <Amir1>	 !log mwscript maintenance/storage/moveToExternal.php --wiki=enwiki --start 31000000 --end 110000000 --undo /home/ladsgroup/T128151.undo.sql --iconv DB cluster27 (T128151)
[10:40:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:58] <stashbot>	 T128151: Migrate all old DB rows from windows-1252 to UTF-8 on enwiki - https://phabricator.wikimedia.org/T128151
[10:42:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow5002.eqsin.wmnet with OS bookworm
[10:43:00] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow5002.eqsin.wmnet with OS bookworm
[10:46:36] <wikibugs>	 (03PS1) 10Ayounsi: Add Python 3.11 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929177
[10:48:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:49:09] <wikibugs>	 (03PS1) 10Ayounsi: Remove call to now gone _validate_vm function [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929178
[10:51:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929178 (owner: 10Ayounsi)
[10:51:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add Python 3.11 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929177 (owner: 10Ayounsi)
[10:52:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Create additional nftables directories [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:52:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove call to now gone _validate_vm function [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929178 (owner: 10Ayounsi)
[10:53:23] <wikibugs>	 (03Merged) 10jenkins-bot: Remove call to now gone _validate_vm function [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929178 (owner: 10Ayounsi)
[10:56:06] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:56:10] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[10:56:13] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[10:56:44] <wikibugs>	 (03PS1) 10Slyngshede: Captcha: Allow users to request a new captcha. [software/bitu] - 10https://gerrit.wikimedia.org/r/929180
[10:56:59] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[10:57:05] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[10:59:55] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[11:00:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:03:10] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:30] <wikibugs>	 (03PS2) 10Slyngshede: Captcha: Allow users to request a new captcha. [software/bitu] - 10https://gerrit.wikimedia.org/r/929180
[11:07:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add Python 3.11 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929177 (owner: 10Ayounsi)
[11:07:59] <wikibugs>	 (03Merged) 10jenkins-bot: Add Python 3.11 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/929177 (owner: 10Ayounsi)
[11:08:50] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:13:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: codfw1dev: use new recursor address [puppet] - 10https://gerrit.wikimedia.org/r/928589 (https://phabricator.wikimedia.org/T338433) (owner: 10Arturo Borrero Gonzalez)
[11:15:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:16:45] <wikibugs>	 (03PS4) 10Jbond: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[11:17:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Build Bookworm base image [puppet] - 10https://gerrit.wikimedia.org/r/929168 (https://phabricator.wikimedia.org/T335560) (owner: 10Muehlenhoff)
[11:18:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[11:19:42] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:20] <wikibugs>	 (03PS1) 10Ayounsi: Prioritize direct peers connected to primary IXP [homer/public] - 10https://gerrit.wikimedia.org/r/929301 (https://phabricator.wikimedia.org/T338201)
[11:23:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow5002.eqsin.wmnet with reason: host reimage
[11:25:37] <wikibugs>	 (03PS2) 10Ayounsi: Prioritize direct peers connected to primary IXP [homer/public] - 10https://gerrit.wikimedia.org/r/929301 (https://phabricator.wikimedia.org/T338201)
[11:27:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow5002.eqsin.wmnet with reason: host reimage
[11:28:46] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[11:29:06] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[11:30:38] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[11:31:35] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[11:32:06] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[11:32:36] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[11:32:38] <wikibugs>	 (03PS2) 10Stevemunene: analytics: Decommission analytics10[59-60] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/928478 (https://phabricator.wikimedia.org/T338408)
[11:40:04] <wikibugs>	 (03PS2) 10Ladsgroup: Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343)
[11:47:31] <Amir1>	 jouncebot: nowandnexr
[11:47:32] <Amir1>	 jouncebot: nowandnext
[11:47:33] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 12 minute(s)
[11:47:33] <jouncebot>	 In 1 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1300)
[11:47:43] <Amir1>	 cooool
[11:47:48] <wikibugs>	 (03PS3) 10Ladsgroup: Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343)
[11:47:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup)
[11:48:49] <wikibugs>	 (03Merged) 10jenkins-bot: Set small wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929035 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup)
[11:49:09] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:929035|Set small wikis to read new for externallinks (T335343)]]
[11:49:13] <stashbot>	 T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343
[11:49:55] <jinxer-wm>	 (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[11:50:33] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:929035|Set small wikis to read new for externallinks (T335343)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[11:51:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow5002.eqsin.wmnet with OS bookworm
[11:51:13] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow5002.eqsin.wmnet with OS bookworm completed: - netflow5002 (**PASS**)   -...
[11:51:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Add support for nftables in profile::firewall - https://phabricator.wikimedia.org/T336497 (10MoritzMuehlenhoff)
[11:51:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497)
[11:52:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:53:37] <wikibugs>	 (03PS1) 10Ayounsi: Remove cloudsw-loopback.pol (folded into common-loopback) [homer/public] - 10https://gerrit.wikimedia.org/r/929316
[11:54:02] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497)
[11:55:08] <wikibugs>	 (03PS1) 10David Martin: Correct the value for schema_title for stream wikifunctions.ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722)
[11:55:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] analytics: Decommission analytics10[59-60] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/928478 (https://phabricator.wikimedia.org/T338408) (owner: 10Stevemunene)
[11:56:50] <wikibugs>	 (03PS3) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497)
[11:57:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:01:31] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:929035|Set small wikis to read new for externallinks (T335343)]] (duration: 12m 22s)
[12:01:35] <stashbot>	 T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343
[12:05:56] <wikibugs>	 (03PS2) 10Daimona Eaytoy: prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364)
[12:08:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[12:08:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[12:09:18] <wikibugs>	 (03PS4) 10Daimona Eaytoy: Remove references to $wgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894
[12:10:42] <wikibugs>	 (03PS5) 10Daimona Eaytoy: Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894
[12:11:53] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318
[12:20:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy)
[12:20:44] <wikibugs>	 (03CR) 10Daimona Eaytoy: "(Note, I've scheduled this and the other patch for today's late backport window)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy)
[12:22:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy)
[12:27:19] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Write a cookbook to roll reboot cache hosts - https://phabricator.wikimedia.org/T338783 (10Volans) p:05Triage→03Medium
[12:28:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow6001.drmrs.wmnet with OS bookworm
[12:28:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow6001.drmrs.wmnet with OS bookworm
[12:29:03] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] analytics: Decommission analytics10[59-60] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/928478 (https://phabricator.wikimedia.org/T338408) (owner: 10Stevemunene)
[12:29:13] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10Ottomata) > but nobody has complained about any specific errors  IIRC< That's because th...
[12:31:10] <wikibugs>	 (03CR) 10Muehlenhoff: Captcha: Allow users to request a new captcha. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 (owner: 10Slyngshede)
[12:34:50] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) 05Resolved→03Open > Now that we can specify a port range, we should, and we...
[12:34:56] <wikibugs>	 (03PS1) 10Jbond: homer: update tests for graphQL [software/homer] - 10https://gerrit.wikimedia.org/r/929324
[12:34:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] homer: update tests for graphQL [software/homer] - 10https://gerrit.wikimedia.org/r/929324 (owner: 10Jbond)
[12:36:15] <wikibugs>	 (03CR) 10Muehlenhoff: "Few more nits/typos/comments" [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 (owner: 10Slyngshede)
[12:36:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:36:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:37:28] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10MoritzMuehlenhoff) >>! In T111433#8922642, @BTullis wrote: >> Now that we can specify a...
[12:38:12] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff)
[12:38:31] <icinga-wm>	 PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:44:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:45:55] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[12:47:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow6001.drmrs.wmnet with reason: host reimage
[12:47:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) Server will not boot  Unable to pull tsr report.  Troubleshooted steps already perfromed  Flea power Drain  Minimum configuration  diabling power button.  led light status on ma...
[12:49:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:49:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:50:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow6001.drmrs.wmnet with reason: host reimage
[12:52:41] <icinga-wm>	 PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:02] <wikibugs>	 (03PS1) 10Ayounsi: Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082)
[12:55:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[12:59:33] <wikibugs>	 (03PS1) 10Ayounsi: Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082)
[12:59:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[13:00:07] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1300).
[13:00:07] <jouncebot>	 mfossati, duesen, Sohom_Datta, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:27] <taavi>	 o/ I'm around but would prefer let someone else deploy
[13:00:32] <Lucas_WMDE>	 o/
[13:00:35] <Sohom_Datta>	 o/
[13:00:36] <duesen>	 p/
[13:00:36] <Lucas_WMDE>	 I can deploy
[13:00:38] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: cleanup labsdnsconfig usage [puppet] - 10https://gerrit.wikimedia.org/r/929334
[13:00:52] <TheresNoTime>	 Lucas_WMDE: cool, was gonna ask if you could
[13:00:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Dwisehaupt anything else we can do to help with this task?
[13:01:02] <duesen>	 Lucas_WMDE: cool. I'm here. 
[13:01:14] <duesen>	 Amir1: are you around to keep an eye on x2 as well?
[13:01:37] <Lucas_WMDE>	 looks like no backports, that saves CI time ^^
[13:01:41] <Amir1>	 I'm around but for half an hour only
[13:01:41] <Lucas_WMDE>	 ~~Oops! All Backports~~
[13:01:52] <Lucas_WMDE>	 let’s start with duesen then?
[13:02:24] <mfossati>	 hi folks, here I am!
[13:02:39] <duesen>	 Lucas_WMDE: i'm ready whenever
[13:02:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10MoritzMuehlenhoff) The driver is is simply not present in the Linux kernel present in Buster, so the problem isn't in the Buster installer per se :-)...
[13:02:48] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Switch VisualEditor to not use RESTbase on English Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:03:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:03:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10Papaul) @ssingh any update on this?
[13:03:21] * Lucas_WMDE takes a look at the other changes
[13:03:51] <wikibugs>	 (03Merged) 10jenkins-bot: Switch VisualEditor to not use RESTbase on English Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:04:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:928590|Switch VisualEditor to not use RESTbase on English Wikipedia. (T320529)]]
[13:04:12] <stashbot>	 T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529
[13:05:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and daniel: Backport for [[gerrit:928590|Switch VisualEditor to not use RESTbase on English Wikipedia. (T320529)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[13:05:57] <Lucas_WMDE>	 duesen: anything to test on mwdebug?
[13:06:31] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] "surprised this is valid, but the wisdom of S.O. says it is..! 🤷‍♀️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (owner: 10Sohom Datta)
[13:06:33] <duesen>	 Lucas_WMDE: on it
[13:07:06] <Lucas_WMDE>	 ok
[13:07:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, though I can’t help but notice that all pages except one are linked to Q4618557 (the idwiki one is linked to Q6618850 instead), so I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati)
[13:07:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) Hi @papaul: We are all good from dc-ops side, this is on Traffic now. We wanted to get a few NTP changes out of the way before reimaging the next batch and therefore it's blocke...
[13:08:15] <TheresNoTime>	 Sohom_Datta: at least https://gerrit.wikimedia.org/r/929000 will stop that mildly annoying warning when developing..
[13:09:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:09:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow6001.drmrs.wmnet with OS bookworm
[13:09:20] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow6001.drmrs.wmnet with OS bookworm completed: - netflow6001 (**PASS**)   -...
[13:09:30] <duesen>	 Lucas_WMDE: looks good
[13:09:32] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Not sure I feel comfortable deploying this tbh… who’s normally responsible for the CSP? Security team?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (owner: 10Sohom Datta)
[13:09:35] <Lucas_WMDE>	 ok, syncing
[13:09:38] <taavi>	 moritzm: did you trigger a manual build of the bullseye image?
[13:09:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster
[13:09:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster
[13:10:02] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster
[13:10:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e...
[13:11:53] <moritzm>	 taavi: you mean bookworm? not yet, do you need it on short notice, then I can kick it off manually
[13:12:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster
[13:12:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster
[13:13:02] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster
[13:13:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e...
[13:13:09] <taavi>	 moritzm: yeah, bookworm. I'm not in a particular hurry, but I'd also prefer not to wait for a week for it as the timer runs every Sunday
[13:14:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:14:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster
[13:14:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster
[13:14:37] <wikibugs>	 (03CR) 10Sohom Datta: Add localhost:* to the beta wiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (owner: 10Sohom Datta)
[13:14:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:928590|Switch VisualEditor to not use RESTbase on English Wikipedia. (T320529)]] (duration: 10m 51s)
[13:15:03] <stashbot>	 T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529
[13:15:34] <moritzm>	 taavi: sure thing, I've kicked it off manually now
[13:15:35] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/929335 (https://phabricator.wikimedia.org/T328899)
[13:15:41] <taavi>	 thank you!
[13:17:46] <Lucas_WMDE>	 duesen, Amir1: deployed
[13:17:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10Papaul) @ssingh thanks
[13:18:31] <Lucas_WMDE>	 also, Gerrit feels super slow whenever I open a new tab o_O
[13:18:39] <Lucas_WMDE>	 (but moving around in an existing gerrit tab is fine)
[13:18:52] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): ImageSuggestions: add help link to 4 new languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati)
[13:19:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati)
[13:19:45] <duesen>	 Amir: stash writes are going up
[13:20:22] <wikibugs>	 (03Merged) 10jenkins-bot: ImageSuggestions: add help link to 4 new languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati)
[13:20:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:928801|ImageSuggestions: add help link to 4 new languages (T331036)]]
[13:20:42] <stashbot>	 T331036: [S] Add help link to article level image suggestions notifications for four additional languages - https://phabricator.wikimedia.org/T331036
[13:20:55] <jinxer-wm>	 (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[13:21:01] <Sohom_Datta>	 @Lucas_WMDE Should I open a phab task for the Beta wiki CSP task ?
[13:21:17] <Sohom_Datta>	 (and add the Security Team)
[13:21:19] <Lucas_WMDE>	 Sohom_Datta: I think that would be a good idea
[13:21:33] <Lucas_WMDE>	 not 100% sure we need to block this on security team
[13:21:40] <Lucas_WMDE>	 but at least a phab task seems like a better place to discuss this in general
[13:21:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and mfossati: Backport for [[gerrit:928801|ImageSuggestions: add help link to 4 new languages (T331036)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:22:07] <Lucas_WMDE>	 mfossati: can you test on mwdebug?
[13:22:51] <mfossati>	 Lucas_WMDE: give me a sec
[13:22:54] <duesen>	 Amir1: stash writes went from ~20 per minute to ~40 per minute... the original prediction was that per *second*. I am starting to think that we got our time units mixed up during the initial estimation.
[13:23:22] <duesen>	 If that is the case, we are looking at less than 2GB of data.
[13:23:24] <Lucas_WMDE>	 :D
[13:23:46] <duesen>	 otoh, the USA are still largely asleep.
[13:24:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update docker images for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/929335 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[13:24:24] <duesen>	 ok, hitting 80 writes/minute now
[13:24:25] <Amir1>	 duesen: I'm very skeptical of 140GB for VE data
[13:25:02] <Amir1>	 East coast should be awake by now but US is not that big in edits flows and traffic
[13:25:28] <wikibugs>	 (03PS1) 10Slyngshede: Wikimedia account link, clear banner on linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/929338
[13:26:07] <mfossati>	 Lucas_WMDE: you can go ahead, no testing on mwdebug is needed.
[13:26:38] <Lucas_WMDE>	 ok
[13:28:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Jclark-ctr) a:03Jclark-ctr
[13:28:34] <duesen>	 parser cache writes are up, from 10k to 15k per minute
[13:28:57] <duesen>	 stash writes hovering at around 60 per minute
[13:29:11] * duesen gos to double-check the unit
[13:29:21] <wikibugs>	 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh)
[13:29:39] * duesen confirms that it's per minute
[13:30:25] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff)
[13:31:51] <duesen>	 Amir1: network utilization seems to be going up a bit on db2142, but not dramatically
[13:32:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:928801|ImageSuggestions: add help link to 4 new languages (T331036)]] (duration: 11m 23s)
[13:32:10] <stashbot>	 T331036: [S] Add help link to article level image suggestions notifications for four additional languages - https://phabricator.wikimedia.org/T331036
[13:32:12] <Amir1>	 let let me check
[13:32:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove netflow2002 from Kafka config [puppet] - 10https://gerrit.wikimedia.org/r/929340 (https://phabricator.wikimedia.org/T330884)
[13:33:54] <Lucas_WMDE>	 can someone maybe +1 my config changes before I deploy them?
[13:34:00] <Lucas_WMDE>	 so they don’t have no review at all ^^
[13:34:08] <Lucas_WMDE>	 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1300)
[13:34:25] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) (owner: 10TheDJ)
[13:35:43] <Lucas_WMDE>	 maybe TheresNoTime? 🥺
[13:35:54] <TheresNoTime>	 looking
[13:36:01] <duesen>	 Lucas_WMDE: I'm looking at them now. I have no idea what they mean :)
[13:36:09] <Lucas_WMDE>	 thanks :)
[13:36:17] <duesen>	 But I can +1 as "looks harmless" ;)
[13:36:25] <mfossati>	 Lucas_WMDE: same :-)
[13:36:30] <Lucas_WMDE>	 I’m doing a grep -r for the config cleanups to confirm there are no references to them in wmf.21
[13:36:31] <Lucas_WMDE>	 *.12
[13:36:45] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+1] "looks fine to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) (owner: 10Lucas Werkmeister (WMDE))
[13:36:51] * TheresNoTime defers to others to break prod
[13:36:54] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): [wikidatawiki] Add pagelang to wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760)
[13:37:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) (owner: 10Lucas Werkmeister (WMDE))
[13:37:06] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+1] "Don't know what this means for Wikidata, but shouldn't break anythign else at least!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) (owner: 10Lucas Werkmeister (WMDE))
[13:37:10] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+1] "Don't know what this means for Wikidata, but shouldn't break anythign else at least!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) (owner: 10Lucas Werkmeister (WMDE))
[13:37:15] <Lucas_WMDE>	 thanks!
[13:37:20] <Lucas_WMDE>	 the permissions change is straightforward to test at least
[13:37:24] * Lucas_WMDE prepares the curl command
[13:38:28] <wikibugs>	 (03Merged) 10jenkins-bot: [wikidatawiki] Add pagelang to wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) (owner: 10Lucas Werkmeister (WMDE))
[13:38:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:928798|[wikidatawiki] Add pagelang to wikidata-staff (T337760)]]
[13:38:46] <stashbot>	 T337760: Wikidata: Add pagelang right to wikidata-staff group - https://phabricator.wikimedia.org/T337760
[13:40:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:928798|[wikidatawiki] Add pagelang to wikidata-staff (T337760)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:41:29] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Tested on mwdebug with:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) (owner: 10Lucas Werkmeister (WMDE))
[13:45:31] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783)
[13:46:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:928798|[wikidatawiki] Add pagelang to wikidata-staff (T337760)]] (duration: 07m 27s)
[13:46:13] <stashbot>	 T337760: Wikidata: Add pagelang right to wikidata-staff group - https://phabricator.wikimedia.org/T337760
[13:46:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) (owner: 10Lucas Werkmeister (WMDE))
[13:47:02] <Sohom_Datta>	 Lucas_WMDE: Created https://phabricator.wikimedia.org/T338790
[13:47:28] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) (owner: 10Lucas Werkmeister (WMDE))
[13:47:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:923619|Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag (T335783)]]
[13:49:05] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:923619|Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag (T335783)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:49:26] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Tested by loading https://www.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=L1-S1&format=xmlfm and checking that the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) (owner: 10Lucas Werkmeister (WMDE))
[13:49:52] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107)
[13:50:48] <duesen>	 Amir1: ok, I'm calling this a success. I see no impact at all on x2.
[13:51:10] <Amir1>	 awesome, I still think this needs to be compressed :P
[13:51:11] <Lucas_WMDE>	 \o/
[13:51:39] <wikibugs>	 (03PS3) 10Slyngshede: Enable password reset and fix wording. [software/bitu] - 10https://gerrit.wikimedia.org/r/929171
[13:51:46] <wikibugs>	 (03CR) 10Slyngshede: Enable password reset and fix wording. (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 (owner: 10Slyngshede)
[13:52:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED
[13:52:18] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED
[13:52:19] <wikibugs>	 (03PS2) 10Hashar: Add localhost:* to the beta wiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (https://phabricator.wikimedia.org/T338790) (owner: 10Sohom Datta)
[13:52:34] <wikibugs>	 10SRE: compare Probenet data w/ NEL data - https://phabricator.wikimedia.org/T337317 (10CDanis)
[13:54:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:923619|Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag (T335783)]] (duration: 06m 54s)
[13:54:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) (owner: 10Lucas Werkmeister (WMDE))
[13:54:57] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:55:32] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) (owner: 10Lucas Werkmeister (WMDE))
[13:55:42] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update outlink isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/929342
[13:55:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:923623|Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag (T335107)]]
[13:55:50] <stashbot>	 T335107: Remove temporary feature flag for Entity Labels in parsed edit summaries in API requests again - https://phabricator.wikimedia.org/T335107
[13:56:10] <wikibugs>	 (03PS3) 10Slyngshede: Captcha: Allow users to request a new captcha. [software/bitu] - 10https://gerrit.wikimedia.org/r/929180
[13:57:02] * duesen goes afk for an hour
[13:57:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:923623|Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag (T335107)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[13:57:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 (owner: 10Slyngshede)
[13:57:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Tested by loading https://www.wikidata.org/w/api.php?action=query&format=json&prop=revisions&revids=1810599589&formatversion=2&rvprop=comm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) (owner: 10Lucas Werkmeister (WMDE))
[13:58:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update outlink isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/929342 (owner: 10AikoChou)
[13:58:12] <wikibugs>	 (03CR) 10Slyngshede: Captcha: Allow users to request a new captcha. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 (owner: 10Slyngshede)
[13:59:33] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:01:16] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[14:01:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 (owner: 10Slyngshede)
[14:02:06] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Captcha: Allow users to request a new captcha. [software/bitu] - 10https://gerrit.wikimedia.org/r/929180 (owner: 10Slyngshede)
[14:02:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:923623|Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag (T335107)]] (duration: 06m 49s)
[14:02:40] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:02:41] <stashbot>	 T335107: Remove temporary feature flag for Entity Labels in parsed edit summaries in API requests again - https://phabricator.wikimedia.org/T335107
[14:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:02] <mfossati>	 Lucas_WMDE: thanks for deploying!
[14:03:15] <Lucas_WMDE>	 Sohom_Datta: let’s see what happens on https://phabricator.wikimedia.org/T338790 – if no one else shares my concerns then I’m okay with this being deployed after all, but I’d like to leave some opportunity for feedback
[14:03:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr)
[14:03:33] <Sohom_Datta>	 Sure :)
[14:04:19] <icinga-wm>	 RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:32] <wikibugs>	 (03CR) 10Jbond: "lgtm but see comments" [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:05:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr)
[14:07:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:08:46] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I have amended the commit message to point to T338790." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (https://phabricator.wikimedia.org/T338790) (owner: 10Sohom Datta)
[14:11:59] <icinga-wm>	 PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:12:59] <wikibugs>	 (03CR) 10Jbond: sre.hosts.reboot-cluster: fix-ups for Traffic/SRE usage (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh)
[14:15:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1010.eqiad.wmnet']
[14:15:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1011.eqiad.wmnet']
[14:16:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup1010.eqiad.wmnet']
[14:16:15] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup1011.eqiad.wmnet']
[14:16:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1010.eqiad.wmnet']
[14:16:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1011.eqiad.wmnet']
[14:17:32] <zabe>	 jouncebot: nowandnext
[14:17:32] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 12 minute(s)
[14:17:32] <jouncebot>	 In 1 hour(s) and 12 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1530)
[14:17:34] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:43] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Correct the value for schema_title for stream wikifunctions.ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin)
[14:17:48] <wikibugs>	 (03PS2) 10Jforrester: Correct the value for schema_title for stream wikifunctions.ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin)
[14:17:54] <wikibugs>	 (03CR) 10Jforrester: "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin)
[14:18:50] <wikibugs>	 (03Merged) 10jenkins-bot: Correct the value for schema_title for stream wikifunctions.ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929317 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin)
[14:22:52] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup1010.eqiad.wmnet']
[14:23:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup1011.eqiad.wmnet']
[14:23:34] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder)
[14:24:54] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Enable password reset and fix wording. [software/bitu] - 10https://gerrit.wikimedia.org/r/929171 (owner: 10Slyngshede)
[14:26:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye
[14:26:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1011.eqiad.wmnet with OS bullseye
[14:26:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye
[14:26:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye
[14:28:35] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster
[14:28:38] <wikibugs>	 (03PS2) 10Slyngshede: Wikimedia account link, clear banner on linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/929338
[14:28:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e...
[14:29:11] <zabe>	 !log Deployed updated mitigations for T336027
[14:29:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:35:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Remove netflow2002 from Kafka config [puppet] - 10https://gerrit.wikimedia.org/r/929340 (https://phabricator.wikimedia.org/T330884) (owner: 10Muehlenhoff)
[14:38:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1011.eqiad.wmnet with reason: host reimage
[14:41:35] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[14:41:50] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1011.eqiad.wmnet with reason: host reimage
[14:42:56] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6001.drmrs.wmnet
[14:44:02] <fabfur>	 !log rebooting cp6001.drmrs.wmnet for upgrade
[14:44:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:37] <wikibugs>	 10Puppet, 10Analytics-Radar, 10Data-Engineering-Icebox: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104 (10joanna_borun)
[14:44:43] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores: enable per wiki deployment of Ores deprecation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170)
[14:45:15] <elukey>	 wow --^
[14:45:39] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Patch-For-Review: role::puppetmaster::standalone clones Git repositories as gitpuppet, git-sync-upstream overwrites them as root - https://phabricator.wikimedia.org/T152059 (10joanna_borun)
[14:47:38] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054 (10jbond) 05Open→03In progress I belive we now have this in puppet please re-open if i missed something
[14:48:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:50:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm but let's make sure the pcc thinks this doesn't change anything in eqiad1." [puppet] - 10https://gerrit.wikimedia.org/r/929334 (owner: 10Arturo Borrero Gonzalez)
[14:50:16] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10PostgreSQL, 10User-jbond: puppetdb: tune postgress instance - https://phabricator.wikimedia.org/T287672 (10jbond) 05Open→03Resolved a:03jbond unfortunately i forgot what this relates to and general performance is improved now
[14:50:22] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond)
[14:50:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, and 2 others: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10joanna_borun)
[14:51:20] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6001.drmrs.wmnet
[14:51:59] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Puppet-Core, 10Sustainability (Incident Followup): Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145 (10joanna_borun)
[14:52:37] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar)
[14:53:42] <wikibugs>	 (03PS1) 10Effie Mouzeli: shellbox: Add service mesh envoy retries [puppet] - 10https://gerrit.wikimedia.org/r/929354
[14:54:10] <wikibugs>	 (03CR) 10Ladsgroup: ores: enable per wiki deployment of Ores deprecation (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[14:55:50] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054 (10jbond) 05In progress→03Resolved a:03jbond
[14:55:56] <wikibugs>	 (03PS1) 10Andrew Bogott: base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356
[14:56:02] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: puppetmaster: frontend: remove IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/929357
[14:56:10] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10Performance-Team (Radar): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10joanna_borun)
[14:56:28] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6009.drmrs.wmnet
[14:56:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356 (owner: 10Andrew Bogott)
[14:56:41] <fabfur>	 !log reboot cp6009.drmrs.wmnet for pgrade
[14:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED
[14:58:10] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED
[14:58:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[14:58:42] <wikibugs>	 (03PS2) 10Andrew Bogott: base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356
[14:58:45] <wikibugs>	 10Puppet, 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10User-jbond: PKI server don't reimage cleanly - https://phabricator.wikimedia.org/T270269 (10joanna_borun)
[14:59:27] <wikibugs>	 10Puppet, 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10User-jbond: PKI server don't reimage cleanly - https://phabricator.wikimedia.org/T270269 (10joanna_borun) p:05Medium→03Low
[15:00:00] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ores: override Beta cluster liftwing URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170)
[15:00:11] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ores: override Beta cluster liftwing URL (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[15:00:16] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:00:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1011.eqiad.wmnet with OS bullseye
[15:00:29] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: unbound variable error when calling puppet-merge script with an explicit treeish - https://phabricator.wikimedia.org/T264014 (10jbond) 05Open→03Resolved a:03jbond
[15:00:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye completed: - back...
[15:01:28] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10joanna_borun)
[15:01:31] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10jbond) @Volans do you know if this is still an issue
[15:01:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356 (owner: 10Andrew Bogott)
[15:01:39] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: use variables in rsyncquickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/928136 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth)
[15:03:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC NOOP: https://puppet-compiler.wmflabs.org/output/929334/41669/" [puppet] - 10https://gerrit.wikimedia.org/r/929334 (owner: 10Arturo Borrero Gonzalez)
[15:04:27] <wikibugs>	 (03CR) 10Ladsgroup: ores: override Beta cluster liftwing URL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[15:04:44] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6009.drmrs.wmnet
[15:06:09] <wikibugs>	 (03PS1) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529)
[15:06:11] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack::clientpackages::vms::common: don't install ebtables [puppet] - 10https://gerrit.wikimedia.org/r/929365
[15:06:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] base-apt-conf: corrected comments slightly [puppet] - 10https://gerrit.wikimedia.org/r/929356 (owner: 10Andrew Bogott)
[15:07:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack::clientpackages::vms::common: don't install ebtables [puppet] - 10https://gerrit.wikimedia.org/r/929365 (owner: 10Andrew Bogott)
[15:08:28] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED
[15:09:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED
[15:10:21] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10Volans) @jbond , no idea if this is till happening, I guess we could look at a bunch of puppet run logs from the reimages and see if there...
[15:10:32] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack::clientpackages::vms::common: don't install ebtables [puppet] - 10https://gerrit.wikimedia.org/r/929365
[15:12:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack::clientpackages::vms::common: don't install ebtables [puppet] - 10https://gerrit.wikimedia.org/r/929365 (owner: 10Andrew Bogott)
[15:14:16] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6002.drmrs.wmnet
[15:17:24] <fabfur>	 !log reboot cp6002.drmrs.wmnet for upgrade
[15:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:37] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41671/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[15:18:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10aborrero)
[15:18:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[15:18:38] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) 05Open→03Stalled This is done for codfw1dev DNS servers.  I'll mark this task as stalled unti...
[15:19:36] <icinga-wm>	 PROBLEM - Check systemd state on vrts2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_cron.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:06] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host backup1010.eqiad.wmnet with OS bullseye
[15:21:04] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Gehel) Removing Search Platform, our work here is done.
[15:23:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[15:23:23] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6002.drmrs.wmnet
[15:24:08] <jinxer-wm>	 (ProbeDown) firing: Service vrts2001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts2001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:25:13] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6010.drmrs.wmnet
[15:25:36] <fabfur>	 !log rebooting cp6010.drmrs.wmnet for upgrade
[15:25:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC NOOP: https://puppet-compiler.wmflabs.org/output/929357/41673/" [puppet] - 10https://gerrit.wikimedia.org/r/929357 (owner: 10Arturo Borrero Gonzalez)
[15:30:04] <jouncebot>	 jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1530).
[15:32:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] shellbox: Add service mesh envoy retries [puppet] - 10https://gerrit.wikimedia.org/r/929354 (owner: 10Effie Mouzeli)
[15:32:54] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero)
[15:34:20] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6010.drmrs.wmnet
[15:40:14] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929378 (https://phabricator.wikimedia.org/T128546)
[15:41:45] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Remove leftover TODO item [dns] - 10https://gerrit.wikimedia.org/r/928900 (https://phabricator.wikimedia.org/T309074) (owner: 10BCornwall)
[15:42:29] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10jbond)
[15:42:59] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6003.drmrs.wmnet
[15:43:21] <fabfur>	 !log reboot cp6003.drmrs.wmnet for upgrade
[15:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:06] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10jbond)
[15:44:17] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929378 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:45:16] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929378 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:51:53] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6003.drmrs.wmnet
[15:53:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:57:02] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10Vgutierrez) I've replicated a successful mTLS handshake with openssl s_client using the following CMD: ` vgutierrez@cp4037:~$ sudo openssl s_client -connect kafka-jumbo1001.eqi...
[15:57:20] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10jbond) p:05Triage→03Medium
[15:58:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:58:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: puppetmaster: frontend: remove IPv6 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929357 (owner: 10Arturo Borrero Gonzalez)
[15:59:08] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6011.drmrs.wmnet
[15:59:28] <fabfur>	 !log reboot cp6011.drmrs.wmnet for upgrade
[15:59:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:18] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:929378| Bumping portals to master (T128546)]] (duration: 14m 21s)
[16:01:22] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[16:01:54] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete
[16:02:06] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete
[16:07:22] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:929378| Bumping portals to master (T128546)]] (duration: 06m 03s)
[16:07:26] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[16:08:23] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6011.drmrs.wmnet
[16:10:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] clouds.yaml: remove keystoneadmin section [puppet] - 10https://gerrit.wikimedia.org/r/923696 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[16:11:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labs_boostrapvz: Remove class [puppet] - 10https://gerrit.wikimedia.org/r/892944 (owner: 10Majavah)
[16:11:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, and 3 others: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10jbond)
[16:12:33] <wikibugs>	 10Puppet, 10SRE, 10Traffic, 10User-jbond: In valid byte sequence: File[/etc/update-ocsp.d/hooks/trafficserver-tls-ocsp] - https://phabricator.wikimedia.org/T238198 (10jbond)
[16:12:55] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond)
[16:13:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: Rake tasks: add colours and buffer output - https://phabricator.wikimedia.org/T237508 (10jbond)
[16:13:34] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10jbond)
[16:14:04] <wikibugs>	 10Puppet, 10SRE, 10Traffic, 10User-jbond: In valid byte sequence: File[/etc/update-ocsp.d/hooks/trafficserver-tls-ocsp] - https://phabricator.wikimedia.org/T238198 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez @jbond I think we can close this one
[16:14:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: puppet-merge shouldn't fail if `tput` doesn't grok your terminal - https://phabricator.wikimedia.org/T221985 (10jbond)
[16:14:46] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Why doesn't profile::mediawiki::nutcracker create /var/run/nutcracker/ ? - https://phabricator.wikimedia.org/T204450 (10jbond) 05Open→03Resolved a:03jbond we no longer have this profile
[16:15:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10jbond)
[16:15:32] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ores: override Beta cluster liftwing URL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[16:15:42] <wikibugs>	 (03PS1) 10Slyngshede: C:IDM Minor tweak to captcha. [puppet] - 10https://gerrit.wikimedia.org/r/929381
[16:15:46] <wikibugs>	 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Performance-Team (Radar): Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034 (10jbond)
[16:16:32] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:IDM Minor tweak to captcha. [puppet] - 10https://gerrit.wikimedia.org/r/929381 (owner: 10Slyngshede)
[16:17:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet CI should fail over CRLF line endings (sometimes) - https://phabricator.wikimedia.org/T182641 (10jbond)
[16:20:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet wmf-style-guide: array of classes not detected properly - https://phabricator.wikimedia.org/T179230 (10jbond)
[16:21:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: more verbose hiera messages on failures - https://phabricator.wikimedia.org/T109692 (10jbond)
[16:22:18] <wikibugs>	 10Puppet: Module uwsgi doesn't allow passing multiple config params of same name - https://phabricator.wikimedia.org/T123809 (10jbond)
[16:23:57] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Mail: Cleanup debconf handling in mailman puppet setup - https://phabricator.wikimedia.org/T144933 (10jbond) @MoritzMuehlenhoff should this be closed
[16:24:57] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246 (10jbond) 05Open→03Resolved a:03jbond Im going to close this im pretty sure its fixed now but please re-open if not
[16:25:20] <wikibugs>	 10Puppet, 10Toolforge, 10Documentation: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733 (10jbond)
[16:25:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10jbond)
[16:26:34] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10jbond)
[16:27:03] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb: filter large factsets - https://phabricator.wikimedia.org/T287674 (10jbond) 05Open→03Resolved a:03jbond We added some filters and puppetdb performance seems to have settled
[16:27:10] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond)
[16:28:16] <wikibugs>	 10Puppet, 10SRE, 10Observability-Alerting, 10Puppet-Infrastructure: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10jbond)
[16:28:53] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10conftool: confd fails to start after a reimage - https://phabricator.wikimedia.org/T244477 (10jbond)
[16:29:43] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10User-jbond: create puppetboard debian package - https://phabricator.wikimedia.org/T292523 (10jbond) 05In progress→03Resolved
[16:29:53] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Upgrade puppetboard to the latest version - https://phabricator.wikimedia.org/T292522 (10jbond)
[16:30:30] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10User-jbond: Update python3-pypuppetdb package to 2.4.0 - https://phabricator.wikimedia.org/T292525 (10jbond) 05Open→03Resolved
[16:30:34] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10User-jbond: create puppetboard debian package - https://phabricator.wikimedia.org/T292523 (10jbond)
[16:30:45] <wikibugs>	 (03CR) 10Vgutierrez: "looks good overall. I'll get to merge it tomorrow EU morning." [puppet] - 10https://gerrit.wikimedia.org/r/928632 (https://phabricator.wikimedia.org/T338481) (owner: 10Majavah)
[16:33:51] <wikibugs>	 (03PS1) 10Jbond: tlsproxy::localssl: drop class [puppet] - 10https://gerrit.wikimedia.org/r/929383 (https://phabricator.wikimedia.org/T191393)
[16:36:18] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Puppetdb: not refreshed on config change? - https://phabricator.wikimedia.org/T291540 (10jbond) 05Open→03Resolved a:03jbond @volans im going to reject this and say its better to manually disable puppet fleet wide to roll out theses changes but please re-open if you...
[16:36:21] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond)
[16:36:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED
[16:36:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[16:37:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott)
[16:39:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-2] "we aren't ready for this until https://storyboard.openstack.org/#!/story/2010784 is resolved" [puppet] - 10https://gerrit.wikimedia.org/r/923697 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott)
[16:39:50] <wikibugs>	 (03PS13) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759)
[16:39:52] <wikibugs>	 (03PS21) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759)
[16:39:54] <wikibugs>	 (03PS2) 10Andrew Bogott: Set OS_CLOUD in wmcs-openstack.sh [puppet] - 10https://gerrit.wikimedia.org/r/923697 (https://phabricator.wikimedia.org/T337577)
[16:43:22] <wikibugs>	 (03PS1) 10Elukey: profile::cache::kafka::certificate: fix pki cert path [puppet] - 10https://gerrit.wikimedia.org/r/929384
[16:46:25] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41675/console" [puppet] - 10https://gerrit.wikimedia.org/r/929384 (owner: 10Elukey)
[16:46:49] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::cache::kafka::certificate: fix pki cert path [puppet] - 10https://gerrit.wikimedia.org/r/929384 (owner: 10Elukey)
[16:47:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jhancock.wm)
[16:47:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] ldap: inline yamlconfig [puppet] - 10https://gerrit.wikimedia.org/r/924984 (owner: 10Majavah)
[16:48:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003']
[16:48:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003']
[16:49:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003']
[16:50:53] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[16:52:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[16:52:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] ldap::client::sssd: use strongly typed parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924985 (owner: 10Majavah)
[16:54:45] <wikibugs>	 (03PS1) 10BBlack: geo-maps: Move default to the top for visibility [dns] - 10https://gerrit.wikimedia.org/r/929386 (https://phabricator.wikimedia.org/T337535)
[16:55:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003']
[16:55:19] <wikibugs>	 (03CR) 10Ladsgroup: "shouldn't we exclude wikidata and commons explicitly?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[16:55:26] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond)
[16:55:30] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe: Update puppet code to conform to puppet 4.x and later standards - https://phabricator.wikimedia.org/T181967 (10jbond) 05Open→03Resolved a:03jbond closing this it must be done now
[16:55:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jhancock.wm)
[16:55:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) p:05Medium→03Low
[16:56:05] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Clean up SSL configuration - https://phabricator.wikimedia.org/T240941 (10jbond)
[16:56:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jbond)
[16:59:01] <wikibugs>	 (03CR) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on all wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1700)
[17:00:05] <jouncebot>	 ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T1700).
[17:00:55] <wikibugs>	 (03PS1) 10Hnowlan: Thumbor: deploy various poolcounter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/929392 (https://phabricator.wikimedia.org/T337649)
[17:03:28] <mutante>	 !log creating ganeti VM people1004 with os==bookworm passed to makevm cookbook to test bookworm and because this is traditionally an early adoptor of new distro releases
[17:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:47] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1004.eqiad.wmnet
[17:03:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[17:03:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) a:05Jclark-ctr→03Jhancock.wm @BTullis hi, can you give me more information on what type of hardware raid we are using on these se...
[17:07:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: puppet (systemd::service) attempts to start manually masked units - https://phabricator.wikimedia.org/T211027 (10jbond) > Looks like this is working as intended for systemd provider (/usr/lib/ruby/vendor_ruby/puppet/provider/service/systemd.rb) although if...
[17:07:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[17:08:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[17:08:34] <icinga-wm>	 PROBLEM - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[17:08:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186 (10jbond)
[17:08:47] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[17:08:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186 (10jbond) update the list in the description
[17:09:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Fix regex.yaml single-regex issue - https://phabricator.wikimedia.org/T183565 (10jbond)
[17:09:51] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[17:09:51] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:09:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors
[17:09:54] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors
[17:10:19] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people1004.eqiad.wmnet - dzahn@cumin1001"
[17:10:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Use multiple puppetdbs on puppet masters - https://phabricator.wikimedia.org/T169318 (10jbond) Im curious how puppetdb failed?  do you rember?  As the postgress write master is always on the primary puppetdb server im not sure we would get much of a win her...
[17:11:20] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people1004.eqiad.wmnet - dzahn@cumin1001"
[17:12:23] <wikibugs>	 (03PS2) 10Dzahn: phabricator: "Automate yearly metrics for wikitech-l": Fix var typo [puppet] - 10https://gerrit.wikimedia.org/r/928993 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper)
[17:12:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "nitpick: in puppet repo, commit message should start with name of module followed by :" [puppet] - 10https://gerrit.wikimedia.org/r/928993 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper)
[17:14:35] <wikibugs>	 (03PS1) 10Hnowlan: poolcounter: use per-format throttling key [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/929394 (https://phabricator.wikimedia.org/T337649)
[17:14:48] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Add check for puppetboard - https://phabricator.wikimedia.org/T296304 (10jbond) 05Open→03Resolved a:03jbond
[17:15:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host people1004.eqiad.wmnet with OS bookworm
[17:17:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet does not undo manual "systemctl mask $unit" - https://phabricator.wikimedia.org/T285425 (10jbond)
[17:18:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, and 2 others: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond)
[17:18:31] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Upgrade puppetboard to the latest version - https://phabricator.wikimedia.org/T292522 (10jbond) 05In progress→03Resolved
[17:18:46] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Upgrade puppetboard to the latest version - https://phabricator.wikimedia.org/T292522 (10jbond)
[17:18:49] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Decommission puppetboard[12]001 - https://phabricator.wikimedia.org/T296744 (10jbond) 05In progress→03Resolved
[17:19:47] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Hosts distribution across puppetmasters - https://phabricator.wikimedia.org/T291541 (10jbond) for the records with puppet 7 i plan to explore using srv records which may help with this
[17:20:07] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: update hiera order in production environment - https://phabricator.wikimedia.org/T301349 (10jbond) 05Open→03Resolved a:03jbond
[17:21:32] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Maps, 10netbox: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10jbond) @hnowlan i did some patches to add support for this with the puppetdb upgrade.  it no longer suports password changes but it dose all...
[17:21:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Maps, 10Puppet-Infrastructure, and 2 others: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10jbond)
[17:22:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye
[17:22:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye
[17:23:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-jbond: puppetmasters: update the puppet masters so they use them self for the puppet run - https://phabricator.wikimedia.org/T238093 (10jbond)
[17:23:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-jbond: missing CRL - https://phabricator.wikimedia.org/T235185 (10jbond)
[17:24:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen): Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10jbond)
[17:24:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10jbond)
[17:24:37] <wikibugs>	 10Puppet, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin: check if additional groups in systemd::sysuser conflicts with admin.yaml - https://phabricator.wikimedia.org/T308826 (10jbond)
[17:25:09] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: puppet documentation generation is missing some compnets - https://phabricator.wikimedia.org/T271909 (10jbond) 05Open→03Resolved a:03jbond
[17:25:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[17:26:20] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jbond) 05Open→03Resolved a:03jbond
[17:26:56] <wikibugs>	 10SRE, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10jbond)
[17:27:13] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) 05Open→03Resolved a:03jbond
[17:27:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10observability, 10User-jbond: Add monitoring for the puppet-netbox repository - https://phabricator.wikimedia.org/T310639 (10jbond)
[17:28:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[17:28:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond)
[17:28:48] <icinga-wm>	 RECOVERY - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[17:28:54] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Usual git mechanism for aborting commit does not work on the private puppet repo - https://phabricator.wikimedia.org/T211121 (10jbond) 05Open→03Resolved a:03jbond closing, but please re-open if its still an issue
[17:29:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Puppet-Infrastructure, 10observability: run-no-puppet leave puppet disabled on kill/crash - https://phabricator.wikimedia.org/T182228 (10jbond) 05Open→03Declined closing due to lack of response
[17:30:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10observability, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10jbond)
[17:31:24] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Fix autorestart and debclient dependency - https://phabricator.wikimedia.org/T324229 (10jbond)
[17:31:48] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb Investigate the expected bahaviour of the edges table - https://phabricator.wikimedia.org/T287673 (10jbond) 05Open→03Resolved a:03jbond closing this we have hopefully made it past the puppetdb issues
[17:31:53] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond)
[17:32:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10jbond)
[17:33:36] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10User-jbond: Audit puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10jbond) 05In progress→03Resolved a:03jbond going to resolve this i think the original question was answered
[17:33:45] <wikibugs>	 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond)
[17:35:21] <wikibugs>	 10Puppet, 10Puppet-Infrastructure, 10cloud-services-team: Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10jbond)
[17:38:05] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] poolcounter: use per-format throttling key [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/929394 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[17:38:15] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Thumbor: deploy various poolcounter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/929392 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[17:41:10] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:42:42] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:46:21] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] "This is a functional no-op, just moving and commenting on this "default" entry for clarity and visibility." [dns] - 10https://gerrit.wikimedia.org/r/929386 (https://phabricator.wikimedia.org/T337535) (owner: 10BBlack)
[17:50:52] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:51:28] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:03:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "tested and looks good to me now:" [puppet] - 10https://gerrit.wikimedia.org/r/928993 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper)
[18:04:06] <logmsgbot>	 !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host people1004.eqiad.wmnet with OS bookworm
[18:04:07] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[18:06:13] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:09:27] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:09:27] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:09:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors
[18:09:30] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors
[18:09:31] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host people1004.eqiad.wmnet
[18:14:17] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1004.eqiad.wmnet
[18:14:18] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[18:17:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:18:08] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host backup1010.eqiad.wmnet with OS bullseye
[18:20:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:21:46] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:21:47] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:21:47] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors
[18:21:50] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors
[18:21:52] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[18:24:28] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:24:55] <mutante>	 I have no idea why it first adds records and then removes them again
[18:24:58] <mutante>	 in the same cookbook run
[18:25:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:25:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:25:32] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors
[18:25:35] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors
[18:25:41] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host people1004.eqiad.wmnet
[18:26:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1004.eqiad.wmnet
[18:26:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[18:28:21] <volans>	 mutante: on failure it rollbacks the new assigned IP and related DNS records
[18:31:51] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "checked compiler output -> full catalog. looks good to me. this will add the rsync on 1003 to push to 2002 and it looks absented on 2002 i" [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney)
[18:32:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) @Papaul Your side is all set. We have some switch overs scheduled for the end of the month to finish up our side of the task too. Thanks fo...
[18:32:46] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:33:45] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:33:46] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:33:46] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors
[18:33:49] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors
[18:33:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[18:35:06] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:36:04] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:36:40] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:37:05] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people1004.eqiad.wmnet - dzahn@cumin1001"
[18:37:05] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:37:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors
[18:37:08] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors
[18:37:14] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host people1004.eqiad.wmnet
[18:37:56] <wikibugs>	 (03Abandoned) 10Sohom Datta: Add localhost:* to the beta wiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929000 (https://phabricator.wikimedia.org/T338790) (owner: 10Sohom Datta)
[18:39:25] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people2003.codfw.wmnet
[18:39:26] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[18:41:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people2003.codfw.wmnet - dzahn@cumin1001"
[18:42:20] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye
[18:42:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye
[18:42:47] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people2003.codfw.wmnet - dzahn@cumin1001"
[18:42:47] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:42:47] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people2003.codfw.wmnet on all recursors
[18:42:50] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people2003.codfw.wmnet on all recursors
[18:43:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people2003.codfw.wmnet - dzahn@cumin1001"
[18:44:15] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people2003.codfw.wmnet - dzahn@cumin1001"
[18:48:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:54:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Jclark-ctr) Replaced Failed Dimm DIMM_B6
[18:54:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Jclark-ctr) 05Open→03Resolved
[18:55:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) The Engineer is expected to arrive on 06/13/2023 09:00 AM to 06:00 PM
[19:03:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:05:38] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host people2003.codfw.wmnet with OS bookworm
[19:08:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:11:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye
[19:11:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye
[19:11:22] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye
[19:11:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err...
[19:14:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye
[19:14:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye
[19:14:59] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye
[19:15:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err...
[19:15:20] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:15:44] <wikibugs>	 (03PS1) 10Chad: deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775)
[19:16:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) (owner: 10Chad)
[19:18:22] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:18:32] <icinga-wm>	 RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:23:10] <icinga-wm>	 PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:27:42] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: puppetdb7 cross polonation - https://phabricator.wikimedia.org/T338811 (10MoritzMuehlenhoff)
[19:28:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149']
[19:32:12] <wikibugs>	 (03PS3) 10Samtar: prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy)
[19:33:08] <wikibugs>	 (03PS6) 10Samtar: Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy)
[19:33:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye
[19:33:25] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye
[19:33:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye
[19:33:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err...
[19:33:42] <wikibugs>	 (03PS2) 10Samtar: Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy)
[19:34:13] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 136 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:34:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Papaul)
[19:35:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Papaul) @Jhancock.wm you can proceed with the OS install
[19:35:44] <wikibugs>	 (03PS1) 10Aklapper: phabricator: "Automate yearly metrics for wikitech-l": Fix var typo [puppet] - 10https://gerrit.wikimedia.org/r/929402 (https://phabricator.wikimedia.org/T337388)
[19:35:44] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@fb9dba3]: repoint drafttopic ingestion to model specific stream
[19:35:54] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@fb9dba3]: repoint drafttopic ingestion to model specific stream (duration: 00m 10s)
[19:38:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1149']
[19:38:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye
[19:38:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye
[19:38:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: "Automate yearly metrics for wikitech-l": Fix var typo [puppet] - 10https://gerrit.wikimedia.org/r/929402 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper)
[19:39:19] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:41:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149']
[19:46:22] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41678/console" [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez)
[19:47:53] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1149']
[19:49:44] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41679/console" [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez)
[19:50:46] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez)
[19:51:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage
[19:54:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage
[19:57:52] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+1] fifo_log_demux: Fix systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T2000).
[20:00:05] <jouncebot>	 Daimona and Urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:28] <Daimona>	 o/
[20:00:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye
[20:00:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest1003.eqiad.wmnet with OS bullseye
[20:01:37] <urbanecm>	 i can deploy today
[20:01:37] * TheresNoTime will assume urbanecm will be doing the deploy window given their patches ^^
[20:01:43] * TheresNoTime assumed correctly
[20:01:51] * taavi was just about to assume that
[20:02:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy)
[20:02:36] * Daimona thanks Urbanecm
[20:02:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy)
[20:02:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy)
[20:03:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy)
[20:03:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy)
[20:03:12] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[20:03:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy)
[20:03:16] <wikibugs>	 (03Merged) 10jenkins-bot: prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925775 (https://phabricator.wikimedia.org/T336364) (owner: 10Daimona Eaytoy)
[20:03:25] <wikibugs>	 (03Merged) 10jenkins-bot: Remove references to $wgEnableLocalTimedText from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 (owner: 10Daimona Eaytoy)
[20:03:35] <brett>	 !log Roll restarting pybal on lvs2014 then lvs2013 - T863380
[20:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:40] <wikibugs>	 (03PS3) 10Urbanecm: Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy)
[20:03:44] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy)
[20:04:31] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused variable wmgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929318 (owner: 10Daimona Eaytoy)
[20:04:48] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:925775|prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336364)]], [[gerrit:802894|Remove references to $wgEnableLocalTimedText from CommonSettings]], [[gerrit:929318|Remove unused variable wmgEnableLocalTimedText]]
[20:04:51] <stashbot>	 T336364: Update prod to use the new tracking tools schema - https://phabricator.wikimedia.org/T336364
[20:04:52] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul)
[20:05:08] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:05:19] <wikibugs>	 (03PS2) 10Chad: deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775)
[20:06:15] <logmsgbot>	 !log urbanecm@deploy1002 daimona and urbanecm: Backport for [[gerrit:925775|prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336364)]], [[gerrit:802894|Remove references to $wgEnableLocalTimedText from CommonSettings]], [[gerrit:929318|Remove unused variable wmgEnableLocalTimedText]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codf
[20:06:15] <logmsgbot>	 w.wmnet
[20:06:26] <urbanecm>	 Daimona: can you test your patches at mwdebug1001?
[20:07:38] <Daimona>	 Hmmmm... First one should be a noop, so I can try and make sure that nothing explodes. No idea for the other two, though...
[20:08:49] <urbanecm>	 the other two seems no-ops too to me?
[20:08:52] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:08:55] <urbanecm>	 testing nothing explodes makes sense to me :)
[20:08:56] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:09:47] <Daimona>	 Yeah, they should all be noop actually
[20:10:15] <Daimona>	 And it's looking good to me on mwdebug1001
[20:10:42] <urbanecm>	 good, syncing
[20:11:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:14:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:14:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1010.eqiad.wmnet with OS bullseye
[20:14:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye completed: - back...
[20:15:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr)
[20:15:40] <wikibugs>	 (03PS1) 10BCornwall: Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942
[20:16:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (owner: 10BCornwall)
[20:16:15] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10Jclark-ctr) a:03Jclark-ctr
[20:16:21] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:925775|prod: Stop setting unused $wgCampaignEventsUseNewTrackingToolsSchema (T336364)]], [[gerrit:802894|Remove references to $wgEnableLocalTimedText from CommonSettings]], [[gerrit:929318|Remove unused variable wmgEnableLocalTimedText]] (duration: 11m 33s)
[20:16:25] <stashbot>	 T336364: Update prod to use the new tracking tools schema - https://phabricator.wikimedia.org/T336364
[20:16:28] <urbanecm>	 Daimona: deployed :)
[20:16:30] <urbanecm>	 anything else?
[20:16:42] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] Enable user impact refresh for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203)
[20:16:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [Growth] Enable user impact refresh for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm)
[20:16:52] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.25:443]) https://wikitech.wikimedia.org/wiki/PyBal
[20:16:56] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 77 connections established with conf2004.codfw.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal
[20:17:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm)
[20:17:24] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10Jclark-ctr) mw1492 T338566 Server down to failed Mainboard pending replacement
[20:17:42] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Enable user impact refresh for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm)
[20:17:57] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:928824|[Growth] Enable user impact refresh for rowiki (T336203)]]
[20:18:00] <stashbot>	 T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203
[20:18:37] <Daimona>	 Amazing, thank you :)
[20:18:49] <urbanecm>	 any time
[20:19:25] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:928824|[Growth] Enable user impact refresh for rowiki (T336203)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:19:46] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter"" [puppet] - 10https://gerrit.wikimedia.org/r/928943
[20:20:29] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10Jclark-ctr) 05Open→03Resolved Replaced cable on ganeti1031
[20:21:40] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 80 connections established with conf1007.eqiad.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal
[20:22:34] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host people2003.codfw.wmnet with OS bookworm
[20:22:34] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[20:23:45] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:23:58] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10Jclark-ctr) @ayounsi  Tomorrow i would like you assistance if available to clean fiber /replace optic
[20:24:07] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10Jclark-ctr) a:03Jclark-ctr
[20:24:50] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:928824|[Growth] Enable user impact refresh for rowiki (T336203)]] (duration: 06m 53s)
[20:25:00] <stashbot>	 T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203
[20:25:48] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[20:26:42] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.25:443]) https://wikitech.wikimedia.org/wiki/PyBal
[20:28:02] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people2003.codfw.wmnet - dzahn@cumin1001"
[20:28:10] <wikibugs>	 (03PS1) 10BCornwall: service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410
[20:28:29] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (owner: 10BCornwall)
[20:28:34] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] Enable new Impact module for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928825 (https://phabricator.wikimedia.org/T336203)
[20:28:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928825 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm)
[20:28:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (owner: 10BCornwall)
[20:28:53] <urbanecm>	 !log Run extensions/GrowthExperiments/maintenance/refreshUserImpactData.php for rowiki (T336203)
[20:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:06] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM people2003.codfw.wmnet - dzahn@cumin1001"
[20:29:06] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:29:06] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people2003.codfw.wmnet on all recursors
[20:29:09] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people2003.codfw.wmnet on all recursors
[20:29:10] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host people2003.codfw.wmnet
[20:29:34] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Enable new Impact module for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928825 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm)
[20:29:50] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:928825|[Growth] Enable new Impact module for rowiki (T336203)]]
[20:29:52] <wikibugs>	 (03PS2) 10BCornwall: service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (https://phabricator.wikimedia.org/T863380)
[20:30:15] <wikibugs>	 (03CR) 10BBlack: service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (https://phabricator.wikimedia.org/T863380) (owner: 10BCornwall)
[20:30:28] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.25:443]) https://wikitech.wikimedia.org/wiki/PyBal
[20:30:38] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 126 connections established with conf1007.eqiad.wmnet:4001 (min=127) https://wikitech.wikimedia.org/wiki/PyBal
[20:31:10] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41680/console" [puppet] - 10https://gerrit.wikimedia.org/r/929410 (https://phabricator.wikimedia.org/T863380) (owner: 10BCornwall)
[20:31:10] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:928825|[Growth] Enable new Impact module for rowiki (T336203)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:31:17] <stashbot>	 T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203
[20:31:27] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] service: Expect 302 response for prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/929410 (https://phabricator.wikimedia.org/T863380) (owner: 10BCornwall)
[20:31:52] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:33:24] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:33:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:36:57] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:928825|[Growth] Enable new Impact module for rowiki (T336203)]] (duration: 07m 06s)
[20:37:01] <stashbot>	 T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203
[20:38:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus-https_443: Servers prometheus2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:41:45] * urbanecm done
[20:44:15] <wikibugs>	 (03PS2) 10Dzahn: Revert "Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter"" [puppet] - 10https://gerrit.wikimedia.org/r/928943 (https://phabricator.wikimedia.org/T337387)
[20:46:19] <wikibugs>	 (03CR) 10Dzahn: "probably we should just use a single config file instead of repeating the same mysql metrics user for each script.. but let's do that sepa" [puppet] - 10https://gerrit.wikimedia.org/r/928943 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[20:48:20] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Switch VisualEditor to bypass RESTbase on all wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[20:50:58] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Enable analysis chain deduplication for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929411 (https://phabricator.wikimedia.org/T334194)
[20:56:03] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1003.eqiad.wmnet with OS bullseye
[20:56:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest1003.eqiad.wmnet with OS bullseye executed with errors: - srete...
[20:58:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Ladsgroup) Thanks! I'm setting the mysql up and making sure it's getting replicated.
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230612T2100)
[21:03:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter"" [puppet] - 10https://gerrit.wikimedia.org/r/928943 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[21:05:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Failed to parse calendar specification '*-1, 4, 7, 10-1 0:0:00': Invalid argument :/" [puppet] - 10https://gerrit.wikimedia.org/r/928943 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[21:09:55] <wikibugs>	 (03PS1) 10Dzahn: phabricator: try to fix month range format for quartely timer [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387)
[21:10:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: try to fix month range format for quartely timer [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[21:16:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster
[21:16:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster
[21:20:06] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[21:20:32] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[21:23:25] <wikibugs>	 (03PS1) 10Papaul: Add sretest1003 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/929418 (https://phabricator.wikimedia.org/T334393)
[21:24:24] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add sretest1003 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/929418 (https://phabricator.wikimedia.org/T334393) (owner: 10Papaul)
[21:28:34] <wikibugs>	 (03PS2) 10Dzahn: phabricator: try to fix month range format for quartely timer [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387)
[21:28:51] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye
[21:29:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest1003.eqiad.wmnet with OS bullseye
[21:30:00] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:30:14] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:31:24] <wikibugs>	 (03PS1) 10BCornwall: prometheus: Disable SNI support in Envoy tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/929421 (https://phabricator.wikimedia.org/T863380)
[21:32:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: try to fix month range format for quartely timer [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[21:33:34] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder)
[21:33:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) @ArielGlenn  i am still working on those servers after @MoritzMuehlenhoff  show me the fix on installing Buster on those servers i tried it o...
[21:34:42] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:36:14] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:37:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[phab1004:~] $ sudo systemctl status phabricator_stats_job_quarterly_metrics.timer" [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[21:38:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "@Aklapper: got it mostly solved but some numbers are missing:" [puppet] - 10https://gerrit.wikimedia.org/r/929417 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[21:39:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10MoritzMuehlenhoff) In the busybox shell, what does "uname -a" show as the running kernel version?
[21:42:09] <wikibugs>	 (03PS1) 10Dzahn: phabricator: replace cut with sed in quarterly_metrics.sh [puppet] - 10https://gerrit.wikimedia.org/r/929423 (https://phabricator.wikimedia.org/T337387)
[21:43:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) @MoritzMuehlenhoff  `  (initramfs) uname -a Linux (none) 4.19.0-24-amd64 #1 SMP Debian 4.19.282-1 (2023-04-29) x86_64 GNU/Linux (initramfs)
[21:44:02] <wikibugs>	 (03PS2) 10BCornwall: prometheus: Disable SNI support in Envoy tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/929421 (https://phabricator.wikimedia.org/T301944)
[21:44:59] <Dylsss_>	 Phab admin or acl*userdisable https://phabricator.wikimedia.org/p/Rule34Enjoyer/
[21:48:16] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338904 (10phaultfinder)
[21:50:31] <TheresNoTime>	 Disabled the account
[21:51:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: replace cut with sed in quarterly_metrics.sh [puppet] - 10https://gerrit.wikimedia.org/r/929423 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[21:51:14] <wikibugs>	 (03PS2) 10Dzahn: phabricator: replace cut with sed in quarterly_metrics.sh [puppet] - 10https://gerrit.wikimedia.org/r/929423 (https://phabricator.wikimedia.org/T337387)
[22:05:20] <icinga-wm>	 RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:10:18] <wikibugs>	 (03CR) 10Dzahn: "works now after this:" [puppet] - 10https://gerrit.wikimedia.org/r/929423 (https://phabricator.wikimedia.org/T337387) (owner: 10Dzahn)
[22:13:00] <icinga-wm>	 PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:15:14] <wikibugs>	 (03PS2) 10BCornwall: Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944)
[22:15:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall)
[22:16:54] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster
[22:16:54] <wikibugs>	 (03PS3) 10BCornwall: Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944)
[22:17:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e...
[22:17:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:18:51] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41681/console" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall)
[22:20:43] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] Revert "service::catalog: add prometheus-https" [puppet] - 10https://gerrit.wikimedia.org/r/928942 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall)
[22:21:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[22:22:32] <brett>	 !log Roll restarting pybal on lvs2014 to revert prometheus service rollout - T326657
[22:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:36] <stashbot>	 T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657
[22:22:40] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[22:23:06] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 80 connections established with conf1007.eqiad.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal
[22:23:44] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[22:23:48] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 77 connections established with conf2004.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal
[22:24:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[22:26:11] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[22:26:17] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 126 connections established with conf1007.eqiad.wmnet:4001 (min=126) https://wikitech.wikimedia.org/wiki/PyBal
[22:27:17] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.25:443]) https://wikitech.wikimedia.org/wiki/PyBal
[22:31:29] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:34:38] <wikibugs>	 (03CR) 10EoghanGaffney: admin: reserve gerrit uid/gid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T338470) (owner: 10Hashar)
[22:37:57] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:39:09] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:40:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:42:38] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[22:42:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bullseye
[22:42:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest1003.eqiad.wmnet with OS bullseye completed: - sretest1003 (**P...
[22:43:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jhancock.wm)
[22:46:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jhancock.wm) 05Open→03Resolved @jbond or @Volans finished this. all yours.
[22:46:57] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:48:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:49:47] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:54:11] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:57:15] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:58:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) 05Open→03Resolved
[23:03:31] <icinga-wm>	 RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:05:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye
[23:05:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye
[23:08:09] <icinga-wm>	 PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye
[23:14:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye
[23:17:35] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye
[23:17:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db...
[23:36:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bullseye
[23:36:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye
[23:46:29] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:48:01] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:49:17] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy1027.eqiad.wmnet with OS bullseye
[23:49:20] <wikibugs>	 (03Abandoned) 10Chad: WIP: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/915701 (https://phabricator.wikimedia.org/T320390) (owner: 10Chad)
[23:52:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye
[23:52:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye