[00:06:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:38:34] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938332
[00:38:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938332 (owner: 10TrainBranchBot)
[00:53:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:54:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938332 (owner: 10TrainBranchBot)
[01:08:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:58:23] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:08:23] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:22] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:05:37] <wikibugs>	 (03PS3) 10Hashar: wm-checks-api: check undefined real_author [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938318 (https://phabricator.wikimedia.org/T328484) (owner: 10Paladox)
[04:07:18] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Thank you Paladox, that got noticed by Timo as well on T328484  and only happens on old changes. I have slightly amended the commit messag" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938318 (https://phabricator.wikimedia.org/T328484) (owner: 10Paladox)
[04:07:50] <wikibugs>	 (03Merged) 10jenkins-bot: wm-checks-api: check undefined real_author [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938318 (https://phabricator.wikimedia.org/T328484) (owner: 10Paladox)
[04:08:51] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@cad3002]: wm-checks-api: check undefined real_author - T328484
[04:08:55] <stashbot>	 T328484: [wm-checks-api] error: changeMessage.real_author is undefined - https://phabricator.wikimedia.org/T328484
[04:08:59] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@cad3002]: wm-checks-api: check undefined real_author - T328484 (duration: 00m 08s)
[04:30:42] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: check undefined real_author (2) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938472 (https://phabricator.wikimedia.org/T328484)
[04:32:09] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] wm-checks-api: check undefined real_author (2) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938472 (https://phabricator.wikimedia.org/T328484) (owner: 10Hashar)
[04:32:39] <wikibugs>	 (03Merged) 10jenkins-bot: wm-checks-api: check undefined real_author (2) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938472 (https://phabricator.wikimedia.org/T328484) (owner: 10Hashar)
[04:33:08] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@1153a16]: wm-checks-api: check undefined real_author (2) - T328484
[04:33:12] <stashbot>	 T328484: [wm-checks-api] error: changeMessage.real_author is undefined - https://phabricator.wikimedia.org/T328484
[04:33:16] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@1153a16]: wm-checks-api: check undefined real_author (2) - T328484 (duration: 00m 08s)
[04:35:06] <hashar>	 isn't the `!log Started`  and `!log Finished`  something new?
[04:35:39] <hashar>	 nop
[04:35:44] <hashar>	 always went in pair apparently.
[05:17:33] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Marostegui)
[05:31:06] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:58:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:11:06] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:24:05] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+2] "Deployed maven artefact and debian package" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer)
[06:33:23] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and taavi: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T0700).
[07:00:04] <jouncebot>	 sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:02:32] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow reimage pc1015, pc1016 [puppet] - 10https://gerrit.wikimedia.org/r/938477 (https://phabricator.wikimedia.org/T341271)
[07:04:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage pc1015, pc1016 [puppet] - 10https://gerrit.wikimedia.org/r/938477 (https://phabricator.wikimedia.org/T341271) (owner: 10Marostegui)
[07:05:06] <sergi0>	 hello, I had a backport scheduled but I'm just seeing the link in the schedule is wrong and didn't cherry-pick. I'm gonna amend it now.
[07:12:42] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow reimage pc2015, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/938478 (https://phabricator.wikimedia.org/T341270)
[07:14:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage pc2015, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/938478 (https://phabricator.wikimedia.org/T341270) (owner: 10Marostegui)
[07:30:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859)
[07:30:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859)
[07:31:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:31:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:48:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859)
[07:48:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859)
[07:49:17] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) AIUI the only thing talking to the ev...
[07:49:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:50:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:52:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] udp2log: run mw-log-cleanup after logrotate [puppet] - 10https://gerrit.wikimedia.org/r/938228 (https://phabricator.wikimedia.org/T341691) (owner: 10Filippo Giunchedi)
[07:54:38] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859)
[07:54:40] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859)
[07:55:36] <wikibugs>	 (03CR) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[07:55:41] <wikibugs>	 (03CR) 10JMeybohm: Testing hack: Update ipoid to certmanager (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[07:55:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:56:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[07:56:41] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859)
[07:56:43] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859)
[07:57:04] <wikibugs>	 (03PS1) 10Marostegui: report_users: Remove 10.64.0.134 [software] - 10https://gerrit.wikimedia.org/r/938647
[07:57:38] <wikibugs>	 (03PS6) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033)
[07:57:40] <wikibugs>	 (03PS6) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033)
[07:57:42] <wikibugs>	 (03PS6) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[07:58:38] <wikibugs>	 (03PS2) 10Marostegui: report_users: Remove 10.64.0.13[45] [software] - 10https://gerrit.wikimedia.org/r/938647
[07:59:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] report_users: Remove 10.64.0.13[45] [software] - 10https://gerrit.wikimedia.org/r/938647 (owner: 10Marostegui)
[08:01:03] <wikibugs>	 (03Merged) 10jenkins-bot: report_users: Remove 10.64.0.13[45] [software] - 10https://gerrit.wikimedia.org/r/938647 (owner: 10Marostegui)
[08:24:01] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Last Friday we've done some troubleshooting and tested a lot of different configurations, thanks @SLyngshede-WMF again! In...
[08:27:34] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) (owner: 10Ryan Kemper)
[08:28:16] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the cumin alias for analytics-airflow [puppet] - 10https://gerrit.wikimedia.org/r/929702 (https://phabricator.wikimedia.org/T333697) (owner: 10Btullis)
[08:30:12] <fabfur>	 !log disable puppet on all cp* hosts in eqsin to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938002 (T340983)
[08:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:22] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[08:31:54] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[08:37:09] <fabfur>	 !log enable puppet on cp5024 and cp5032 to deploy 938002
[08:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:59] <icinga-wm>	 PROBLEM - puppet last run on kafka-test1006 is CRITICAL: CRITICAL: Puppet has been disabled for 604825 seconds, message: Elukey - elukey, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:51:50] <fabfur>	 !log enable puppet on A:cp-eqsin to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938002 (T340983)
[08:51:52] <elukey>	 running puppet --^
[08:51:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:54] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[08:53:10] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:54:06] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:55:21] <wikibugs>	 (03PS1) 10Btullis: Deploy airflow version 2.6.3 to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/938803 (https://phabricator.wikimedia.org/T336286)
[08:56:29] <icinga-wm>	 RECOVERY - puppet last run on kafka-test1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:01:10] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet
[09:02:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet
[09:06:26] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [software/homer] (gnmi) - 10https://gerrit.wikimedia.org/r/927736 (owner: 10Volans)
[09:08:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding fo gNMI support [software/homer] (gnmi) - 10https://gerrit.wikimedia.org/r/927736 (owner: 10Volans)
[09:12:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: set better resourcequotas for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/938252 (owner: 10Elukey)
[09:16:55] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Deploy airflow version 2.6.3 to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/938803 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis)
[09:17:24] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet
[09:17:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[09:17:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[09:18:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[09:18:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet
[09:18:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[09:19:30] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet
[09:22:03] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet
[09:22:37] <jinxer-wm>	 (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors
[09:24:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Ignore LAGs from test_port_block_consistency (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 (owner: 10Ayounsi)
[09:26:07] <wikibugs>	 (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983)
[09:26:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet
[09:27:10] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet
[09:27:37] <jinxer-wm>	 (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors
[09:29:08] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet
[09:30:55] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet
[09:31:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: base: bump cadvisor rollout to 45% in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/938810 (https://phabricator.wikimedia.org/T108027)
[09:32:05] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1007 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_xmldumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:45] <godog>	 looking for a signoff on https://gerrit.wikimedia.org/r/c/operations/puppet/+/938810
[09:32:46] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42496/console" [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:35:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet
[09:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:37:51] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:37:55] <wikibugs>	 (03PS2) 10Fabfur: hiera: apply silent-drop on port 80 to ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983)
[09:38:15] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I guess that is how I should have written it in the first place :)" [puppet] - 10https://gerrit.wikimedia.org/r/937978 (owner: 10Jbond)
[09:38:32] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet
[09:38:53] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet
[09:39:03] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet
[09:39:11] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:40:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: apply silent-drop on port 80 to ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:42:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/938226 (https://phabricator.wikimedia.org/T341045) (owner: 10ArielGlenn)
[09:42:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1001.eqiad.wmnet
[09:42:42] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-stretch1001.eqiad.wmnet
[09:43:19] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] add jebe and xcollazo to nagios command access [puppet] - 10https://gerrit.wikimedia.org/r/938226 (https://phabricator.wikimedia.org/T341045) (owner: 10ArielGlenn)
[09:43:23] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1001.eqiad.wmnet
[09:43:51] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10ayounsi)
[09:44:45] <fabfur>	 !log disabled puppet on A:cp hosts in ulsfo to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938807 (T340983)
[09:44:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:49] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[09:45:10] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] hiera: apply silent-drop on port 80 to ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:46:00] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet
[09:48:35] <fabfur>	 !log enabled puppet on A:cp hosts in ulsfo to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938807 (T340983) (hosts will run puppet with the usual schedule)
[09:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:08] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch1001.eqiad.wmnet
[09:50:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "I don't know the tool itself, but as long as the rollback is easy I'd say +1" [puppet] - 10https://gerrit.wikimedia.org/r/938810 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[09:50:55] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10ayounsi)
[09:51:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] base: bump cadvisor rollout to 45% in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/938810 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[09:51:48] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1002.eqiad.wmnet
[09:56:08] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi)
[09:57:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1010 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:58:15] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1011 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:59:22] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:59:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch1002.eqiad.wmnet
[09:59:45] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1049 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1000)
[10:04:19] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2001.codfw.wmnet
[10:08:08] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet
[10:09:19] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:09:56] <icinga-wm>	 PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
[10:10:30] <godog>	 mmhh cadvisor failures are me
[10:10:39] <icinga-wm>	 RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms
[10:10:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2001.codfw.wmnet
[10:11:41] <wikibugs>	 (03PS1) 10Elukey: profile::services_proxy::envoy: increase timeout for inference [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479)
[10:11:44] <wikibugs>	 (03PS1) 10ArielGlenn: make sure job watcher and exception checker do not run on spare NFS dumps shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232)
[10:11:45] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] make sure job watcher and exception checker do not run on spare NFS dumps shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[10:13:12] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "60s LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[10:13:57] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:13:59] <icinga-wm>	 PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100%
[10:14:35] <icinga-wm>	 RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms
[10:14:39] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:42] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet
[10:15:19] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet
[10:17:07] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:17:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: noc: stop using script to populate database data URIs [puppet] - 10https://gerrit.wikimedia.org/r/938818 (https://phabricator.wikimedia.org/T341859)
[10:18:05] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1012 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:03] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:20:04] <wikibugs>	 (03PS2) 10ArielGlenn: make sure job watcher and exception checker do not run on spare NFS dumps shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232)
[10:20:09] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:20:57] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:17] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:36] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet
[10:22:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet
[10:23:45] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1065 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:23:50] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::services_proxy::envoy: increase timeout for inference [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[10:23:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2002.codfw.wmnet
[10:24:01] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet
[10:24:06] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet
[10:30:27] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet
[10:31:43] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2002.codfw.wmnet
[10:31:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet
[10:32:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1010.eqiad.wmnet
[10:33:23] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:33:37] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:33:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet
[10:33:53] <wikibugs>	 (03PS3) 10ArielGlenn: make sure job watcher and exception checker do not run on spare NFS dumps shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232)
[10:33:55] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet
[10:34:31] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[10:36:21] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:37:27] <wikibugs>	 (03CR) 10Volans: [C: 03+2] irc: small refactor to cleanup the code [software/pywmflib] - 10https://gerrit.wikimedia.org/r/937499 (owner: 10Volans)
[10:38:21] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:39:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1010.eqiad.wmnet
[10:41:17] <wikibugs>	 (03Merged) 10jenkins-bot: irc: small refactor to cleanup the code [software/pywmflib] - 10https://gerrit.wikimedia.org/r/937499 (owner: 10Volans)
[10:41:25] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet
[10:44:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1070.eqiad.wmnet with OS bullseye
[10:45:04] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet
[10:45:18] <wikibugs>	 (03CR) 10ArielGlenn: "ppc output looks good, see https://puppet-compiler.wmflabs.org/output/938816/42498/ and especially the output for dumpsdata1007, the one s" [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[10:45:32] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: CR: cloud-host: allow return traffic for PDNS servers [homer/public] - 10https://gerrit.wikimedia.org/r/938819 (https://phabricator.wikimedia.org/T341966)
[10:45:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet
[10:48:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:49:52] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet
[10:50:11] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[10:50:47] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1652 days) https://wikitech.wikimedia.org/wiki/Logs
[10:51:50] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet
[10:52:36] <wikibugs>	 (03PS1) 10Elukey: WIP: ml-services: set knative concurrency values for ml pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820
[10:52:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet
[10:53:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:53:59] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] sre.hosts.decommission: fix call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937508 (owner: 10Volans)
[10:54:16] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet
[10:54:20] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet
[10:55:52] <wikibugs>	 10SRE-tools, 10Spicerack: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) p:05Triage→03Medium
[10:58:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikireplicas: relocate some hardcoded data into hiera [puppet] - 10https://gerrit.wikimedia.org/r/938238 (owner: 10Arturo Borrero Gonzalez)
[11:01:16] <wikibugs>	 (03PS2) 10Elukey: WIP: ml-services: set knative concurrency values for ml pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820
[11:04:27] <wikibugs>	 (03CR) 10Volans: "This change is ready for review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/938821 (owner: 10Volans)
[11:07:31] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1070.eqiad.wmnet with reason: host reimage
[11:07:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Interactive firmware prompts on Bullseye with some Broadcom NICs - https://phabricator.wikimedia.org/T308106 (10BTullis) @MoritzMuehlenhoff - I've just bumped into this issue on upgrading the first prod hadoop worker and I found this bug reference, which seems highly releva...
[11:08:19] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet
[11:08:27] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet
[11:08:32] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: eqiad1: decomission cloudcontrol1005.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/938235 (https://phabricator.wikimedia.org/T341495)
[11:08:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet
[11:10:10] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet
[11:10:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1070.eqiad.wmnet with reason: host reimage
[11:11:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] eqiad1: decomission cloudcontrol1005.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/938235 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez)
[11:12:16] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1005.wikimedia.org
[11:14:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:15:14] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet
[11:15:35] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet
[11:17:02] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet
[11:18:43] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:19:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:20:03] <wikibugs>	 (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T332314) (owner: 10Bking)
[11:22:15] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] profile::services_proxy::envoy: increase timeout for inference [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[11:22:35] <wikibugs>	 10SRE: Cannot download large (3GB) PDF files from commons - https://phabricator.wikimedia.org/T341755 (10Aklapper)
[11:22:44] <wikibugs>	 (03CR) 10Btullis: "I see that the functionality is good, but I don't see why you need to make two new profiles for this task." [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[11:23:23] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet
[11:23:24] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: nova fullstack: updated harcoded access to the list of controllers [puppet] - 10https://gerrit.wikimedia.org/r/938831 (https://phabricator.wikimedia.org/T341495)
[11:26:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova fullstack: updated harcoded access to the list of controllers [puppet] - 10https://gerrit.wikimedia.org/r/938831 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez)
[11:26:46] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001"
[11:28:06] <wikibugs>	 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia Enterprise: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10Aklapper) a:05RBrounley_WMF→03None Removing inactive task assignee. (Please do so as part of the team's offboardin...
[11:29:42] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet
[11:29:53] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet
[11:30:16] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet
[11:30:20] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet
[11:33:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:35:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[11:35:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1070.eqiad.wmnet with OS bullseye
[11:35:52] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1168 days) https://wikitech.wikimedia.org/wiki/Logs
[11:36:36] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet
[11:36:57] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet
[11:38:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:38:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet
[11:39:00] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet
[11:45:42] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet
[11:45:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet
[11:46:01] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet
[11:46:06] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet
[11:51:15] <wikibugs>	 (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938840
[11:52:44] <wikibugs>	 (03PS1) 10Ladsgroup: realm: Add two new private tables of CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076)
[11:53:02] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet
[11:54:26] <wikibugs>	 (03CR) 10Marostegui: "Does this live in x1?" [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup)
[11:56:24] <wikibugs>	 (03CR) 10Ladsgroup: realm: Add two new private tables of CheckUser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup)
[11:56:47] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42499/console" [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur)
[12:01:03] <wikibugs>	 (03PS2) 10Fabfur: hiera: apply silent-drop on port 80 to codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938840
[12:01:20] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001"
[12:01:20] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:01:21] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1005.wikimedia.org
[12:03:52] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42500/console" [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur)
[12:23:27] <wikibugs>	 (03CR) 10Marostegui: "Ok then this needs a mariadb restart" [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup)
[12:24:40] <wikibugs>	 (03CR) 10Marostegui: "Considering what happened last time we restarted the hosts, I would suggest to:" [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup)
[12:26:51] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm)
[12:27:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm)
[12:30:05] <icinga-wm>	 PROBLEM - Host ms-be2056 is DOWN: PING CRITICAL - Packet loss = 100%
[12:30:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:30:37] <icinga-wm>	 RECOVERY - Host ms-be2056 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms
[12:34:04] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Jclark-ctr)
[12:34:08] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Jclark-ctr) 05Open→03Resolved
[12:34:39] <icinga-wm>	 PROBLEM - Host ms-be2056 is DOWN: PING CRITICAL - Packet loss = 100%
[12:35:03] <icinga-wm>	 RECOVERY - Host ms-be2056 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms
[12:35:13] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jbond) [[ https://docs.google.com/document/d/1L2s9QqJRhKpngmWHyoCJdr6eHK5z3tm6i4zroJKJt-g/edit | Notes from in pe...
[12:37:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Jclark-ctr)
[12:37:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Jclark-ctr) 05Open→03Resolved
[12:37:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/937509 (owner: 10Volans)
[12:38:29] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Jclark-ctr) 05Open→03Resolved
[12:39:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Uninstall Diamond everywhere [puppet] - 10https://gerrit.wikimedia.org/r/935103 (https://phabricator.wikimedia.org/T317032) (owner: 10Majavah)
[12:40:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Jclark-ctr)
[12:41:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Jclark-ctr) 05Open→03Resolved
[12:42:15] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet
[12:43:46] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jcrespo) @jbond To answer the question #1:  This is the configuration from a client, where encryption and decrypt...
[12:44:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10Jclark-ctr) 05Open→03Resolved
[12:46:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 (10Jclark-ctr) 05Open→03Resolved
[12:47:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::services_proxy::envoy: increase timeout for inference [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[12:49:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur)
[12:50:26] <wikibugs>	 10SRE, 10Traffic: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) @Isaac @Htriedman @Jcross could you confirm that this is working as expected and can be closed?
[12:50:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 (10Jclark-ctr) 05Open→03Resolved
[12:50:52] <wikibugs>	 (03PS1) 10LSobanski: Updated GitLabCIPipelineErrors description to match the updated threshold of 0.7. [alerts] - 10https://gerrit.wikimedia.org/r/938846 (https://phabricator.wikimedia.org/T341927)
[12:52:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Jclark-ctr) 05Open→03Resolved
[12:53:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:54:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:54:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:55:29] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] Updated GitLabCIPipelineErrors description to match the updated threshold of 0.7. [alerts] - 10https://gerrit.wikimedia.org/r/938846 (https://phabricator.wikimedia.org/T341927) (owner: 10LSobanski)
[12:55:31] <wikibugs>	 (03PS3) 10Fabfur: hiera: apply silent-drop on port 80 to codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938840
[12:55:57] <wikibugs>	 (03CR) 10Fabfur: hiera: apply silent-drop on port 80 to codfw cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur)
[12:56:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Jclark-ctr) 05Open→03Resolved
[12:56:38] <wikibugs>	 (03CR) 10Jbond: "some minor nits inline (some pre-existing), the main im concerned about is:" [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez)
[12:56:47] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] hiera: apply silent-drop on port 80 to codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur)
[12:56:53] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:41] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) As requested, @jbond your two users on the replica and production dumped via API (`curl "https://gitlab-replica.wikimedia.o...
[12:57:54] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Jclark-ctr)
[12:58:05] <fabfur>	 !log disabled puppet on A:cp-codfw to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938840 (T340983)
[12:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:08] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[12:58:26] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Jclark-ctr) 05Open→03Resolved
[12:59:16] <wikibugs>	 (03PS1) 10Ssingh: depool esams: router migration [dns] - 10https://gerrit.wikimedia.org/r/938847 (https://phabricator.wikimedia.org/T337997)
[12:59:40] <wikibugs>	 (03PS3) 10Elukey: ml-services: set knative concurrency values for ml pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 (https://phabricator.wikimedia.org/T341479)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1300).
[13:00:04] <jouncebot>	 sergi0 and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <sergi0>	 hi
[13:00:15] <aanzx>	 o/
[13:00:27] <Lucas_WMDE>	 I’ll be in a meeting for a bit longer, anyone else around to deploy?
[13:00:32] <taavi>	 I can deploy
[13:00:33] <Lucas_WMDE>	 (otherwise I can probably do it in 30 minutes or so)
[13:00:34] <sukhe>	 heads-up: esams depooling shortly
[13:00:34] <Lucas_WMDE>	 thanks!
[13:00:50] <taavi>	 sukhe: just to confirm, can we continue with the deployment as normal?
[13:00:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1014.eqiad.wmnet - https://phabricator.wikimedia.org/T341782 (10Jclark-ctr) 05Open→03Resolved
[13:01:00] <sukhe>	 sorry, and yes, you can continue, but just a heads-up
[13:01:06] <taavi>	 ack
[13:01:09] <sukhe>	 for the channel mostly
[13:01:13] <Lucas_WMDE>	 it’s just depooled from traffic, but deployments still go to it?
[13:01:27] <zabe>	 it has no appservers
[13:01:27] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "deploying" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938306 (https://phabricator.wikimedia.org/T341865) (owner: 10Urbanecm)
[13:01:32] <sukhe>	 Lucas_WMDE: edge site
[13:01:34] <taavi>	 Lucas_WMDE: no appservers in esams
[13:01:34] <Lucas_WMDE>	 oh right
[13:01:47] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/938847 (https://phabricator.wikimedia.org/T337997) (owner: 10Ssingh)
[13:02:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] depool esams: router migration [dns] - 10https://gerrit.wikimedia.org/r/938847 (https://phabricator.wikimedia.org/T337997) (owner: 10Ssingh)
[13:02:40] <taavi>	 aanzx: going to deploy your config changes while we wait for CI for sergi0's backport
[13:02:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938677 (https://phabricator.wikimedia.org/T341940) (owner: 10Anzx)
[13:02:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938315 (https://phabricator.wikimedia.org/T341926) (owner: 10Anzx)
[13:02:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938324 (https://phabricator.wikimedia.org/T341958) (owner: 10Anzx)
[13:02:56] <aanzx>	 ok
[13:02:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10Jclark-ctr)
[13:02:57] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10Jclark-ctr) 05Open→03Resolved
[13:03:28] <sukhe>	 !log run authdns-update to depool esams
[13:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:13] <wikibugs>	 (03Merged) 10jenkins-bot: change wgExtraNamespaces , wgNamespaceAliases for mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938677 (https://phabricator.wikimedia.org/T341940) (owner: 10Anzx)
[13:04:28] <wikibugs>	 (03PS4) 10Majavah: Add appendix namespace aliases on huwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938315 (https://phabricator.wikimedia.org/T341926) (owner: 10Anzx)
[13:04:33] <wikibugs>	 (03PS4) 10Majavah: robots.txt: Disable indexing draft-related pages on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938324 (https://phabricator.wikimedia.org/T341958) (owner: 10Anzx)
[13:04:38] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938315 (https://phabricator.wikimedia.org/T341926) (owner: 10Anzx)
[13:04:40] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938324 (https://phabricator.wikimedia.org/T341958) (owner: 10Anzx)
[13:04:40] <fabfur>	 !log run puppet on cp2027 to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/938840 (T340983)
[13:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:43] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[13:04:44] <taavi>	 right. should have seen that coming
[13:06:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add appendix namespace aliases on huwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938315 (https://phabricator.wikimedia.org/T341926) (owner: 10Anzx)
[13:06:18] <wikibugs>	 (03Merged) 10jenkins-bot: robots.txt: Disable indexing draft-related pages on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938324 (https://phabricator.wikimedia.org/T341958) (owner: 10Anzx)
[13:06:30] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frpig1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340128 (10Jclark-ctr)
[13:06:50] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:938677|change wgExtraNamespaces , wgNamespaceAliases for mnwwiktionary (T341940)]], [[gerrit:938315|Add appendix namespace aliases on huwiktionary (T341926)]], [[gerrit:938324|robots.txt: Disable indexing draft-related pages on knwiki (T341958)]]
[13:06:53] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frpig1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340128 (10Jclark-ctr) 05Open→03Resolved
[13:06:56] <stashbot>	 T341940: Remains to be translated into Mon - https://phabricator.wikimedia.org/T341940
[13:06:57] <stashbot>	 T341958: robots.txt: Disable indexing draft-related pages on knwiki - https://phabricator.wikimedia.org/T341958
[13:06:57] <stashbot>	 T341926: Add Appendix as a namespace alias on huwiktionary - https://phabricator.wikimedia.org/T341926
[13:07:16] <fabfur>	 !log enabled puppet on A:cp-codfw to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938840 (T340983) (hosts will run puppet with the usual schedule)
[13:07:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:20] <wikibugs>	 (03PS1) 10Elukey: ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850
[13:08:57] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet
[13:09:01] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet
[13:10:26] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) I did some more testing today and can confirm that the required config is `cas.authn.oidc.id-token.include-id-token-claims=...
[13:12:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ores2003.codfw.wmnet with reason: DCops working on it
[13:13:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10Jclark-ctr) a:03Jclark-ctr
[13:13:11] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ores2003.codfw.wmnet with reason: DCops working on it
[13:13:22] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10Jclark-ctr) 05Open→03Resolved
[13:15:06] <wikibugs>	 (03CR) 10Ladsgroup: realm: Add two new private tables of CheckUser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup)
[13:15:56] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet
[13:15:57] <wikibugs>	 (03PS2) 10Elukey: ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850
[13:16:25] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) >>! In T297314#9018519, @JMeyb...
[13:16:37] <logmsgbot>	 !log taavi@deploy1002 taavi and anzx: Backport for [[gerrit:938677|change wgExtraNamespaces , wgNamespaceAliases for mnwwiktionary (T341940)]], [[gerrit:938315|Add appendix namespace aliases on huwiktionary (T341926)]], [[gerrit:938324|robots.txt: Disable indexing draft-related pages on knwiki (T341958)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqia
[13:16:37] <logmsgbot>	 d.wmnet
[13:16:43] <stashbot>	 T341940: Remains to be translated into Mon - https://phabricator.wikimedia.org/T341940
[13:16:43] <stashbot>	 T341958: robots.txt: Disable indexing draft-related pages on knwiki - https://phabricator.wikimedia.org/T341958
[13:16:44] <stashbot>	 T341926: Add Appendix as a namespace alias on huwiktionary - https://phabricator.wikimedia.org/T341926
[13:16:52] <taavi>	 aanzx: please test
[13:17:06] <aanzx>	 ok
[13:17:16] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet
[13:18:51] <aanzx>	 taavi huwiktionary and mnwwiktionary good , nothing to test on knwiki
[13:19:50] <wikibugs>	 (03Merged) 10jenkins-bot: NewImpact: fix undefined log function [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938306 (https://phabricator.wikimedia.org/T341865) (owner: 10Urbanecm)
[13:20:02] <taavi>	 ok, syncing
[13:20:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Jclark-ctr)
[13:20:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Jclark-ctr) 05Open→03Resolved
[13:20:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging per promise to effie and hugh." [deployment-charts] - 10https://gerrit.wikimedia.org/r/938001 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli)
[13:20:54] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 (10Jclark-ctr) 05Open→03Resolved
[13:21:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet
[13:21:15] <aanzx>	 after sync please run namespaceDupes.php for both hu , mnw wiktionary , @taavi
[13:21:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Jclark-ctr)
[13:21:23] <taavi>	 ack
[13:21:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Jclark-ctr) 05Open→03Resolved
[13:22:03] <wikibugs>	 (03PS1) 10Ssingh: Revert "depool esams: router migration" [dns] - 10https://gerrit.wikimedia.org/r/938678
[13:22:25] <wikibugs>	 (03CR) 10Ssingh: "DO NOT MERGE. Emergency patch." [dns] - 10https://gerrit.wikimedia.org/r/938678 (owner: 10Ssingh)
[13:23:06] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: Bye bye nutcracker! [deployment-charts] - 10https://gerrit.wikimedia.org/r/938001 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli)
[13:23:27] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[13:24:48] <wikibugs>	 (03CR) 10Marostegui: realm: Add two new private tables of CheckUser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup)
[13:24:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] realm: Add two new private tables of CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup)
[13:24:57] <wikibugs>	 (03PS1) 10Ayounsi: Add cookbook to manage users SSH keys on SONiC devices [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028)
[13:25:44] <taavi>	 !log taavi@deploy1002 ~ $ mwscript namespaceDupes.php --wiki mnwwiktionary --fix # T341940
[13:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:47] <stashbot>	 T341940: Remains to be translated into Mon - https://phabricator.wikimedia.org/T341940
[13:26:04] <taavi>	 1660 links to fix, 1660 were resolvable, 0 were deleted.
[13:26:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:26:39] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:938677|change wgExtraNamespaces , wgNamespaceAliases for mnwwiktionary (T341940)]], [[gerrit:938315|Add appendix namespace aliases on huwiktionary (T341926)]], [[gerrit:938324|robots.txt: Disable indexing draft-related pages on knwiki (T341958)]] (duration: 19m 48s)
[13:26:44] <stashbot>	 T341958: robots.txt: Disable indexing draft-related pages on knwiki - https://phabricator.wikimedia.org/T341958
[13:26:45] <stashbot>	 T341926: Add Appendix as a namespace alias on huwiktionary - https://phabricator.wikimedia.org/T341926
[13:26:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) @Dwisehaupt hello is this ok now?
[13:27:12] <wikibugs>	 (03CR) 10Kaleem Bhatti: [C: 03+1] "anyone please submit review for this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[13:27:14] <wikibugs>	 (03PS25) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594)
[13:27:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:27:28] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:938306|NewImpact: fix undefined log function (T341865)]]
[13:27:31] <stashbot>	 T341865: log is not a function. - https://phabricator.wikimedia.org/T341865
[13:27:52] <taavi>	 !log taavi@mwmaint1002 ~ $ mwscript namespaceDupes.php --wiki huwiktionary --fix # T341926
[13:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm I will check and let you know
[13:28:52] <logmsgbot>	 !log taavi@deploy1002 taavi and urbanecm: Backport for [[gerrit:938306|NewImpact: fix undefined log function (T341865)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:29:01] <taavi>	 sergi0: please test
[13:29:08] <sergi0>	 testing now
[13:29:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[13:29:55] <taavi>	 Amir1: can you quickly tell what's wrong with namespaceDupes.php? https://phabricator.wikimedia.org/P49565
[13:30:31] <Amir1>	 have to go to meeting but that thing breaks constently 
[13:30:59] <Amir1>	 sigh, it's linkmigration piece again
[13:31:05] <Lucas_WMDE>	 oh, that again
[13:31:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:31:50] <sergi0>	 taavi: looking good from my side, the error is not present in the analytics requests anymore
[13:31:52] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) > from our side we will need to check if cas.authn.oidc.id-token.include-id-token-claims=true is ok to enable globally or i...
[13:31:56] <taavi>	 sergi0: ok, syncing
[13:32:09] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet
[13:32:11] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10Aklapper) > Feel free to try ssh to these hosts now. phab1004.eqiad.wmnet is prod phab, phab-test1001.eqiad.wmnet is the test machine, phab200...
[13:32:22] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi)
[13:32:29] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10SecTeam-Processed, 10Security, 10Vuln-Infoleak: Etherpad deletion 9NXnJ9N1vJP8YuBOyY6V - https://phabricator.wikimedia.org/T341903 (10sbassett)
[13:32:34] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10SecTeam-Processed, 10Security, 10Vuln-Infoleak: Etherpad deletion 9NXnJ9N1vJP8YuBOyY6V - https://phabricator.wikimedia.org/T341903 (10sbassett) p:05Triage→03Low
[13:33:15] <icinga-wm>	 PROBLEM - Host ms-be2057 is DOWN: PING CRITICAL - Packet loss = 100%
[13:33:31] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1072.eqiad.wmnet with OS bullseye
[13:34:04] <wikibugs>	 (03PS26) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594)
[13:35:20] <wikibugs>	 (03CR) 10Cory Massaro: [C: 03+1] wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[13:35:23] <icinga-wm>	 RECOVERY - Host ms-be2057 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms
[13:35:45] <taavi>	 Amir1: Lucas_WMDE: I think the issue is that LinksMigration::getLinksConditions() won't create a new LinkTarget if none exists (instead it'll just return a query that never matches), but namespaceDupes.php expects it would
[13:36:41] <Amir1>	 honestly, it shouldn't do anything for those cases, it should just reparse the page and let the logic handle it instead of redoing the logic
[13:37:23] <wikibugs>	 (03PS2) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[13:37:47] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:938306|NewImpact: fix undefined log function (T341865)]] (duration: 10m 19s)
[13:37:51] <stashbot>	 T341865: log is not a function. - https://phabricator.wikimedia.org/T341865
[13:37:55] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:03] <icinga-wm>	 RECOVERY - Host ores2003 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[13:38:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[13:38:14] <wikibugs>	 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez)
[13:38:23] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet
[13:38:51] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet
[13:39:24] <wikibugs>	 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) @ayounsi @cmooney could you let DCops know which racks would be better for these boxes? Thanks!
[13:39:51] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[13:40:59] <fabfur>	 !log reimaging cp4037 as preparatory test for knams migration
[13:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:09] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170)
[13:42:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:42:29] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[13:42:34] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[13:42:41] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet
[13:42:54] <taavi>	 manually worked around that by purging the affected page by hand
[13:43:31] <akosiaris>	 !log deploy removal of nutcracker from thumbor. T318695
[13:43:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:35] <stashbot>	 T318695: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695
[13:43:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet
[13:43:47] <taavi>	 T341993
[13:43:47] <stashbot>	 T341993: namespaceDupes.php can fail if new target does not have a linktarget entry - https://phabricator.wikimedia.org/T341993
[13:43:56] <Lucas_WMDE>	 was about to say, purging the pages worked last time https://phabricator.wikimedia.org/T334277#8775922
[13:44:04] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9019527, @Jdforrester-...
[13:44:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: add network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) (owner: 10Jbond)
[13:44:24] <Lucas_WMDE>	 taavi: are you done deploying now?
[13:44:32] <taavi>	 yes
[13:44:35] <Lucas_WMDE>	 ok thanks
[13:44:38] <Lucas_WMDE>	 then I’ll do a security fix
[13:45:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[13:46:02] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[13:46:32] <wikibugs>	 (03PS3) 10Ladsgroup: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[13:46:41] <wikibugs>	 (03PS4) 10Ladsgroup: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[13:47:00] <sergi0>	 taavi: thank you for the assistance
[13:47:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:47:30] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet
[13:47:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[13:47:43] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet
[13:48:23] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice, it seems more tidy this way!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 (owner: 10Elukey)
[13:48:36] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye
[13:50:30] <logmsgbot>	 !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host analytics1072.eqiad.wmnet with OS bullseye
[13:51:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:52:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet
[13:52:23] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300
[13:53:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[13:53:41] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1072.eqiad.wmnet with OS bullseye
[13:54:24] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet
[13:54:49] <logmsgbot>	 !log lucaswerkmeister-wmde Deployed security patch for T340217
[13:54:55] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Thanks for troubleshooting this more! I can confirm existing users have `cas3` in the `identities` section. This leads to a...
[13:55:12] <wikibugs>	 (03PS1) 10Jbond: tox: drop the minor version requierment on admin checks [puppet] - 10https://gerrit.wikimedia.org/r/938858
[13:55:47] * Lucas_WMDE done
[13:56:00] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: deploy models for simplewiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/938859 (https://phabricator.wikimedia.org/T319170)
[13:56:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:56:57] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:59:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:01:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:02:38] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet
[14:03:24] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet
[14:04:09] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet
[14:07:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:08:08] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet
[14:08:23] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:08:27] <wikibugs>	 (03PS2) 10DCausse: Link to new repo to build docker dev image [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/937949
[14:08:34] <wikibugs>	 (03CR) 10DCausse: [V: 03+2 C: 03+2] Link to new repo to build docker dev image [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/937949 (owner: 10DCausse)
[14:08:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:11] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[14:09:35] <wikibugs>	 10ops-codfw, 10Machine-Learning-Team: ManagementSSHDown - https://phabricator.wikimedia.org/T341648 (10Jhancock.wm) 05Open→03Resolved replaced idrac card and coms battery. updated idrac IP info. BAT0002 alert has cleared and the server is reachable by ssh
[14:10:13] <wikibugs>	 (03PS4) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300
[14:10:15] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Set evaluator local URLs per T297314#9019664 [deployment-charts] - 10https://gerrit.wikimedia.org/r/938861 (https://phabricator.wikimedia.org/T297314)
[14:10:16] <elukey>	 !log start kafka partitions rebalance for main-codfw (long running maintenance, see https://phabricator.wikimedia.org/T341558)
[14:10:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:27] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1072.eqiad.wmnet with reason: host reimage
[14:11:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[14:11:34] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet
[14:11:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet
[14:12:19] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet
[14:12:23] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet
[14:12:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[14:13:54] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1072.eqiad.wmnet with reason: host reimage
[14:13:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) @Jclark-ctr sorry missed one. edited my previous comment with the additional to keep all the info together.
[14:14:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:14:22] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:14:29] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10akosiaris)
[14:15:33] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:16:12] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10akosiaris) @hnowlan @jijiki. nutcracker removal merged and deployed. I am gonna let you have the pleasure of resolving this task :-)
[14:16:22] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS bullseye
[14:17:59] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye
[14:19:31] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi)
[14:20:19] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet
[14:20:31] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet
[14:20:33] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet
[14:20:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:20:40] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet
[14:21:58] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/pooled=no; selector: name=ores2003.codfw.wmnet
[14:22:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999)
[14:22:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999)
[14:22:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: disable ferm rules from etcd in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/938868
[14:22:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[14:24:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:24:44] <wikibugs>	 (03PS2) 10Filippo Giunchedi: New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999)
[14:24:46] <logmsgbot>	 !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ores2003.codfw.wmnet
[14:24:46] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999)
[14:24:49] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: disable ferm rules from etcd in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/938868
[14:26:00] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jcrespo) Regarding the last question, one important thing is that sometimes a recovery may need multiple backup s...
[14:26:34] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[14:26:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10User-jbond: wmf-styleguide checks: unable to ignore violations inside roles - https://phabricator.wikimedia.org/T280353 (10jbond)
[14:27:03] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[14:27:53] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Nuyaml_backend does not allow binary Hiera data - https://phabricator.wikimedia.org/T113328 (10jbond) 05Open→03Resolved a:03jbond no update
[14:28:30] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server, 10User-jbond: Ensure puppet sends the correct ircd signals to update config and motd - https://phabricator.wikimedia.org/T284052 (10jbond) 05Open→03Resolved a:03jbond fixed with last patch
[14:29:34] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet
[14:30:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: disable ferm rules from etcd in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/938868 (owner: 10Filippo Giunchedi)
[14:30:46] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet
[14:30:55] <wikibugs>	 (03PS3) 10Filippo Giunchedi: hieradata: disable ferm rules from etcd in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/938868
[14:31:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Tested on pontoon-titan-01.monitoring.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[14:32:47] <wikibugs>	 (03CR) 10JMeybohm: noc: add script to dump etcd db config (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[14:33:27] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Updated GitLabCIPipelineErrors description to match the updated threshold of 0.7. [alerts] - 10https://gerrit.wikimedia.org/r/938846 (https://phabricator.wikimedia.org/T341927) (owner: 10LSobanski)
[14:33:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10netbox: Netbox missing physical device in PuppetDB when Puppet disabled for too long - https://phabricator.wikimedia.org/T254986 (10joanna_borun)
[14:33:51] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105 (10jbond) 05Open→03Declined going to close this as declined.  [[ https://ger...
[14:33:58] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[14:34:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:34:38] <wikibugs>	 (03Merged) 10jenkins-bot: Updated GitLabCIPipelineErrors description to match the updated threshold of 0.7. [alerts] - 10https://gerrit.wikimedia.org/r/938846 (https://phabricator.wikimedia.org/T341927) (owner: 10LSobanski)
[14:35:15] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[14:36:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10jbond)
[14:36:19] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1652 days) https://wikitech.wikimedia.org/wiki/Logs
[14:36:55] <elukey>	 !log restart rsyslog on centrallog1002 ("peer did not provide a certificate, not permitted to talk to it")
[14:36:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:18] <elukey>	 godog: --^
[14:37:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10jbond)
[14:37:31] <elukey>	 errors seem related to some tcp conns to prometheus1006
[14:38:16] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10netops, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10jbond)
[14:38:21] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage
[14:39:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:39:08] <godog>	 elukey: ack, thank you! yeah we've seen the problem from time to time with the gtls listener, haven't had a change to dig deep yet though (and it recovers)
[14:39:23] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet
[14:40:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond)
[14:41:48] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage
[14:41:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jbond)
[14:42:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) I think we are ready for this cloudweb2002-dev move today, assuming no IP change, just a poweroff-poweron oper...
[14:50:28] <wikibugs>	 10SRE, 10Traffic: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Htriedman) @Vgutierrez this feature has been working as expected, and this ticket can be closed!
[14:50:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Documentation, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10jbond)
[14:51:35] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration management tooling - https://phabricator.wikimedia.org/T321874 (10joanna_borun) 05Open→03Declined There are no specific actions we can take regarding this ticket. If additional discussion is needed, we can schedule a dedicated meeting.
[14:52:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:52:20] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1072.eqiad.wmnet with OS bullseye
[14:52:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Puppet-Core, and 3 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10jbond)
[14:52:42] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[14:53:20] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 (owner: 10Elukey)
[14:54:08] <wikibugs>	 10SRE, 10Traffic: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez @Htriedman awesome, Thanks for the prompt response.   DP has been deployed and running happily since February 6th, 2023.
[14:54:14] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[14:57:40] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:57:44] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:59:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377 (10jbond)
[14:59:57] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet
[15:02:18] <sukhe>	 !log dns5003 upgrade to pdns-rec 4.8.4: T341611
[15:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:22] <stashbot>	 T341611: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611
[15:02:46] <icinga-wm>	 PROBLEM - Host ms-be1064 is DOWN: PING CRITICAL - Packet loss = 100%
[15:03:10] <icinga-wm>	 RECOVERY - Host ms-be1064 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms
[15:04:08] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1064 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:04:40] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: in puppet 6 some core types have been moved to external modules.  check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10jbond) 05Open→03Resolved a:03jbond This has been handled as part of the puppet7 migration
[15:04:46] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4037.ulsfo.wmnet with OS bullseye
[15:04:49] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond)
[15:07:53] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM🫰" [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[15:08:13] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet
[15:09:02] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1064 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:09:07] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet
[15:09:55] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet
[15:10:02] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Performance Issue: Investigate mysterious_sysctl settings and figure out what to do with them - https://phabricator.wikimedia.org/T118812 (10jbond) 05Open→03Resolved a:03jbond
[15:10:03] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet
[15:10:14] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ores: use envoy proxy for Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937453 (https://phabricator.wikimedia.org/T319170)
[15:10:17] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Technical-Debt: "Setting templatedir is deprecated" warning issued on self-hosted puppetmaster - https://phabricator.wikimedia.org/T95158 (10jbond) 05Open→03Resolved a:03jbond templatedir setting is now removed
[15:12:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:13:36] <dancy>	 jouncebot now
[15:13:36] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 16 minute(s)
[15:14:01] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.54.0" for 605 hosts
[15:18:16] <wikibugs>	 (03CR) 10EllenR: [C: 03+1] "LGTM + has been merged" [puppet] - 10https://gerrit.wikimedia.org/r/886119 (https://phabricator.wikimedia.org/T257893) (owner: 10Phuedx)
[15:19:07] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet
[15:22:09] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet
[15:23:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: set knative concurrency values for ml pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[15:24:46] <wikibugs>	 (03PS3) 10Elukey: ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850
[15:25:14] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064 (10jbond)
[15:25:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:25:59] <elukey>	 this is me --^
[15:26:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:26:15] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10jbond)
[15:29:26] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet
[15:29:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet
[15:29:55] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet
[15:30:04] <jouncebot>	 jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1530).
[15:30:14] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel)
[15:31:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:32:13] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: puppetlabs: create puppet 7 environment in WMCS to test code - https://phabricator.wikimedia.org/T294841 (10jbond)
[15:33:51] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet
[15:34:04] <wikibugs>	 10SRE, 10Observability-Alerting: Setup some alert mechanism when some 'critical' cron jobs fail - https://phabricator.wikimedia.org/T187101 (10jbond) Im not sure if this is still valid however i have removed the puppet tag as this would be better done in the alertmanager repo now
[15:35:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:36:30] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[15:37:12] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet
[15:37:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) server move complete
[15:39:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Proposal: Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10jbond) >  Edit taskgen to support finding tox.ini files in each module instead of a single universal one with conditional changes filters. This is an idea...
[15:39:48] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet
[15:40:14] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update ores-legacy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/938881 (https://phabricator.wikimedia.org/T341479)
[15:40:18] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team (Radar): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond)
[15:40:51] <wikibugs>	 10Puppet, 10SRE, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548 (10jbond)
[15:41:16] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T335684 (10Jclark-ctr) 05Open→03Resolved psu alerts have not returned closing ticket
[15:41:40] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team (Radar): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) 05Open→03Resolved a:03jbond im going to close this, I think with the `auto` keyword this is c...
[15:42:16] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update ores-legacy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/938881 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos)
[15:42:23] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: disable alerts for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/937572 (https://phabricator.wikimedia.org/T332314) (owner: 10Ryan Kemper)
[15:43:11] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update ores-legacy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/938881 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos)
[15:45:11] <wikibugs>	 (03PS4) 10Elukey: ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850
[15:45:13] <wikibugs>	 (03PS1) 10Elukey: ml-services: fix the container concurrency setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/938882
[15:45:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 (owner: 10Elukey)
[15:45:34] <wikibugs>	 (03CR) 10Cory Massaro: wikifunctions: Add AppArmor profile usage (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[15:45:39] <wikibugs>	 (03PS2) 10Elukey: ml-services: fix the container concurrency setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/938882
[15:46:39] <wikibugs>	 (03PS2) 10Cory Massaro: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[15:47:35] <wikibugs>	 (03CR) 10Cory Massaro: Add AppArmor configuration for the deployed function-evaluator service. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro)
[15:47:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: fix the container concurrency setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/938882 (owner: 10Elukey)
[15:48:38] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel)
[15:48:57] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[15:49:30] <logmsgbot>	 !log fabfur@cumin1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[15:49:42] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel)
[15:49:50] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[15:49:51] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet
[15:49:59] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet
[15:50:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[15:50:13] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[15:53:25] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: puppet lint check for resource names - https://phabricator.wikimedia.org/T93231 (10jbond) @fgiunchedi I'm tempted to close this as invalid as i don't see any issue with having spaces in resource titles and in some cases (e.g. notify, exec) it can be desirable....
[15:53:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI: puppet lint check for resource names - https://phabricator.wikimedia.org/T93231 (10jbond)
[15:54:55] <wikibugs>	 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543 (10jbond)
[15:54:58] <wikibugs>	 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): puppet7: drop instances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10jbond)
[15:55:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Documentation, 10Puppet (Puppet 7.0): Puppet7: Update documentation - https://phabricator.wikimedia.org/T341095 (10jbond)
[15:55:58] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet
[15:56:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond)
[15:56:08] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet
[15:56:30] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[15:56:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Configure SRV records for new puppet infrastructure - https://phabricator.wikimedia.org/T341053 (10jbond) 05In progress→03Resolved a:03jbond
[15:56:40] <sukhe>	 hmm
[15:56:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[15:56:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): remove puppet::expose_agent_certs from puppetdb classes - https://phabricator.wikimedia.org/T341374 (10jbond) p:05Triage→03High
[15:57:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): remove puppet::expose_agent_certs from puppetdb classes - https://phabricator.wikimedia.org/T341374 (10jbond) p:05High→03Medium
[15:57:11] <vgutierrez>	 sukhe: drmrs is munching traffic compared to its usual p95 :)
[15:57:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[15:57:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) p:05Triage→03Medium
[15:57:51] <sukhe>	 yep
[15:57:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond)
[15:57:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[15:58:00] <sukhe>	 we are still fine here https://librenms.wikimedia.org/graphs/to=1689606600/id=23134/type=port_bits/from=1689520200/
[15:58:32] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet
[16:00:29] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jbond) p:05Triage→03Medium
[16:00:39] <jbond>	 !oncall
[16:02:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[16:04:19] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet
[16:04:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[16:05:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[16:05:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[16:06:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall great! And I appreciated you only bumped a patch level given we retain full backwards compatibility 😄 But maybe in this case at le" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[16:07:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[16:08:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:08:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[16:08:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[16:09:12] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42504/console" [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[16:09:46] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Thanks for this -- LGTM although still learning the details of cfssl.  Let's try with a controlled rollout to centrallog2002 first" [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[16:10:55] <wikibugs>	 (03PS1) 10Urbanecm: Fix UserDatabaseHelper::hasMainspaceEdits [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938680 (https://phabricator.wikimedia.org/T341994)
[16:12:12] <urbanecm>	 jouncebot: nowandnext
[16:12:13] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 47 minute(s)
[16:12:13] <jouncebot>	 In 0 hour(s) and 47 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1700)
[16:12:13] <jouncebot>	 In 0 hour(s) and 47 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1700)
[16:12:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI: puppet lint check for resource names - https://phabricator.wikimedia.org/T93231 (10fgiunchedi) 05Open→03Invalid Totally fair to mark invalid (done) @jbond, tbh I don't remember what the issue was!
[16:12:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fix UserDatabaseHelper::hasMainspaceEdits [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938680 (https://phabricator.wikimedia.org/T341994) (owner: 10Urbanecm)
[16:12:48] <elukey>	 !log stop kafka-main codfw maintenance - T341558
[16:12:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:52] <stashbot>	 T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558
[16:14:21] <wikibugs>	 (03CR) 10Elukey: "Looks great, do you mind to add a use case in the .fixtures? So we can see a diff etc.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[16:15:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI: puppet lint check for resource names - https://phabricator.wikimedia.org/T93231 (10jhathaway) Yeah I agree we should allow those, as you mention they are sometimes useful:  `     exec { '/usr/bin/cat /etc/os-release': logoutput => true }     notify { "Fact ${...
[16:17:20] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: fix ores-legacy app [deployment-charts] - 10https://gerrit.wikimedia.org/r/938888
[16:18:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:20:03] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170)
[16:22:35] <wikibugs>	 (03PS1) 10EoghanGaffney: Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435)
[16:23:44] <wikibugs>	 (03PS2) 10EoghanGaffney: Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435)
[16:24:10] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix ores-legacy app [deployment-charts] - 10https://gerrit.wikimedia.org/r/938888 (owner: 10Ilias Sarantopoulos)
[16:25:13] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix ores-legacy app [deployment-charts] - 10https://gerrit.wikimedia.org/r/938888 (owner: 10Ilias Sarantopoulos)
[16:28:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "No blocker for me, just a suggestion for the status filter and a couple of inline question." [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[16:28:50] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[16:29:18] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[16:29:37] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[16:29:48] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10TBurmeister)
[16:30:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938680 (https://phabricator.wikimedia.org/T341994) (owner: 10Urbanecm)
[16:30:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: confd: allow running multiple instances (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[16:31:13] <wikibugs>	 (03PS10) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669)
[16:31:47] <wikibugs>	 (03CR) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[16:32:57] <wikibugs>	 (03Merged) 10jenkins-bot: Fix UserDatabaseHelper::hasMainspaceEdits [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938680 (https://phabricator.wikimedia.org/T341994) (owner: 10Urbanecm)
[16:33:13] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:938680|Fix UserDatabaseHelper::hasMainspaceEdits (T341994)]]
[16:33:17] <stashbot>	 T341994: New version of Special:Impact returns "0 edits so far" at some wikis even when edits have been made - https://phabricator.wikimedia.org/T341994
[16:33:20] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/936762 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[16:34:40] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:938680|Fix UserDatabaseHelper::hasMainspaceEdits (T341994)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[16:34:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42506/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[16:35:07] <wikibugs>	 (03PS1) 10Bking: search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705)
[16:35:10] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[16:35:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[16:35:53] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "sounds good -- I'll try merging this tomorrow morning eastern, time permitting" [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond)
[16:35:56] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi)
[16:36:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "tested locally" [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi)
[16:36:24] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) (owner: 10Cwhite)
[16:38:16] <wikibugs>	 (03CR) 10DCausse: search-zk: Provision hostnames for new ZK cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[16:41:56] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:938680|Fix UserDatabaseHelper::hasMainspaceEdits (T341994)]] (duration: 08m 43s)
[16:42:02] * urbanecm done
[16:42:04] <stashbot>	 T341994: New version of Special:Impact returns "0 edits so far" at some wikis even when edits have been made - https://phabricator.wikimedia.org/T341994
[16:44:44] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I'm ok with the approach, left a comment for a couple of errors inline, I didn't review it in all details yet, so not sure it does exactly" [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[16:48:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr) @Andrew  usually we use the raid controller to configure os drives.  I do not know if our Os install would recognize the correct drives...
[16:50:35] <wikibugs>	 (03PS2) 10Bking: search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705)
[16:50:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[16:51:16] <wikibugs>	 (03PS3) 10Bking: search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705)
[16:51:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[16:55:56] <wikibugs>	 (03CR) 10Bking: search-zk: Provision hostnames for new ZK cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[16:56:27] <wikibugs>	 (03PS4) 10Bking: flink-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705)
[16:58:01] <wikibugs>	 (03PS1) 10Jbond: vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064)
[16:58:03] <wikibugs>	 (03PS1) 10Jbond: kerberos: fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938895 (https://phabricator.wikimedia.org/T95064)
[16:58:05] <wikibugs>	 (03PS1) 10Jbond: kerberos: Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938896 (https://phabricator.wikimedia.org/T95064)
[16:58:07] <wikibugs>	 (03PS1) 10Jbond: monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064)
[16:58:09] <wikibugs>	 (03PS1) 10Jbond: install_server: updaate to use bash [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064)
[16:58:11] <wikibugs>	 (03PS1) 10Jbond: kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064)
[16:59:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Papaul) @Jclark-ctr @Andrew even with the SW raid you still need the controller to be able to see the drives.
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1700)
[17:00:04] <jouncebot>	 ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1700).
[17:02:55] <wikibugs>	 (03CR) 10Bking: flink-zk: Provision hostnames for new ZK cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[17:18:08] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[17:19:58] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 01m 50s)
[17:23:57] <wikibugs>	 (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to drmrs cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938902
[17:30:02] <wikibugs>	 (03PS19) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497)
[17:30:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nftables: spec: introduce service tests (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez)
[17:31:25] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[17:31:36] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42507/console" [puppet] - 10https://gerrit.wikimedia.org/r/938902 (owner: 10Fabfur)
[17:32:24] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] flink-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[17:33:57] <wikibugs>	 (03CR) 10Bking: [C: 03+2] flink-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking)
[17:34:06] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 02m 41s)
[17:52:43] <wikibugs>	 (03CR) 10ArielGlenn: make sure job watcher and exception checker do not run on spare NFS dumps shares (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[17:52:52] <wikibugs>	 (03PS4) 10ArielGlenn: make sure certain systemd jobs run only on the primary xml dumps NFS shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232)
[18:03:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:18:23] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:19:59] <wikibugs>	 (03PS1) 10Ssingh: dnsrecursor: remove redundant parameter install_from_component [puppet] - 10https://gerrit.wikimedia.org/r/938913 (https://phabricator.wikimedia.org/T341611)
[18:20:24] <wikibugs>	 (03PS5) 10ArielGlenn: make sure certain systemd jobs run only on the primary xml dumps NFS shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232)
[18:20:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] sre.hosts.decommission: fix call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937508 (owner: 10Volans)
[18:20:57] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42509/console" [puppet] - 10https://gerrit.wikimedia.org/r/938913 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh)
[18:22:36] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[18:23:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/937509 (owner: 10Volans)
[18:23:44] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[18:25:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10Dwisehaupt) @MoritzMuehlenhoff I was granted the ability to do the authdns-update in T244901. We are part of the `fr-tech-admins` group that I believ...
[18:25:28] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: remove redundant parameter install_from_component [puppet] - 10https://gerrit.wikimedia.org/r/938913 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh)
[18:27:09] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Dashboard for wdqs update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) (owner: 10Ryan Kemper)
[18:27:21] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Dashboard for wdqs update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) (owner: 10Ryan Kemper)
[18:29:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10Dwisehaupt) >>! In T341440#9020951, @Dwisehaupt wrote: > @MoritzMuehlenhoff I was granted the ability to do the authdns-update in T244901. We are par...
[18:30:13] <wikibugs>	 (03PS6) 10ArielGlenn: make sure certain systemd jobs run only on the primary xml dumps NFS shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232)
[18:30:52] <wikibugs>	 (03PS2) 10Jbond: tox: drop the minor version requierment on admin checks [puppet] - 10https://gerrit.wikimedia.org/r/938858
[18:39:28] <wikibugs>	 (03CR) 10Jbond: "lgtm but we should remove the debug of cfssl_cmd.stdout" [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[18:41:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt) @Papaul Sorry for the delay, I was out last week. This appears to have fixed it up and the host is starting to build. Thanks!
[18:44:14] <wikibugs>	 (03PS1) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[18:44:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez)
[18:45:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi)
[18:47:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rsyslog::receiver: update docs and add types [puppet] - 10https://gerrit.wikimedia.org/r/936762 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[18:48:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:50:50] <wikibugs>	 (03PS2) 10Jbond: profile::cassandra: Add spec test [puppet] - 10https://gerrit.wikimedia.org/r/937979
[18:50:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] monkey_patch: fix up monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/937978 (owner: 10Jbond)
[18:52:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile::cassandra: Add spec test [puppet] - 10https://gerrit.wikimedia.org/r/937979 (owner: 10Jbond)
[18:57:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/937952 (owner: 10Hashar)
[18:57:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Rakefile: add tasks to run a global shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/937952 (owner: 10Hashar)
[18:58:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) @Dwisehaupt you welcome
[18:58:26] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[18:58:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul)
[18:58:56] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:59:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] tox: drop the minor version requierment on admin checks [puppet] - 10https://gerrit.wikimedia.org/r/938858 (owner: 10Jbond)
[19:05:58] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on Beta Cluster role::beta::docker_services boxes - https://phabricator.wikimedia.org/T342038 (10Jdforrester-WMF)
[19:06:32] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on Beta Cluster role::beta::docker_services boxes - https://phabricator.wikimedia.org/T342038 (10Jdforrester-WMF)
[19:16:12] <wikibugs>	 (03PS5) 10Jdlrobson: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260)
[19:16:51] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] bnwikiquote: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938351 (https://phabricator.wikimedia.org/T341910) (owner: 10Stang)
[19:33:07] <wikibugs>	 (03PS1) 10Eevans: cassandra: transition 3.11.14 from 'dev' to '3.x' [puppet] - 10https://gerrit.wikimedia.org/r/938917
[19:36:17] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938917 (owner: 10Eevans)
[19:37:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[19:38:43] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint
[19:42:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[19:42:11] <icinga-wm>	 PROBLEM - AQS root url on aqs1010 is CRITICAL: connect to address 10.64.0.40 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:47:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:49:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[19:52:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:54:02] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra: uninstall cassandra-twcs deployment repository [puppet] - 10https://gerrit.wikimedia.org/r/937528 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans)
[19:57:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[19:58:54] <wikibugs>	 (03PS1) 10Eevans: Revert "cassandra: uninstall cassandra-twcs deployment repository" [puppet] - 10https://gerrit.wikimedia.org/r/938681
[19:59:37] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Revert "cassandra: uninstall cassandra-twcs deployment repository" [puppet] - 10https://gerrit.wikimedia.org/r/938681 (owner: 10Eevans)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T2000).
[20:00:05] <jouncebot>	 koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:38] <koi>	 o/
[20:02:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:03:26] <taavi>	 hey. I can deploy
[20:03:35] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:03:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1001.eqiad.wmnet
[20:03:44] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[20:04:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938351 (https://phabricator.wikimedia.org/T341910) (owner: 10Stang)
[20:05:06] <wikibugs>	 (03Merged) 10jenkins-bot: bnwikiquote: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938351 (https://phabricator.wikimedia.org/T341910) (owner: 10Stang)
[20:05:22] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:938351|bnwikiquote: Update wordmark (T341910)]]
[20:05:27] <stashbot>	 T341910: Update Bengali wikiquote wordmark - https://phabricator.wikimedia.org/T341910
[20:06:45] <logmsgbot>	 !log taavi@deploy1002 taavi and stang: Backport for [[gerrit:938351|bnwikiquote: Update wordmark (T341910)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:06:49] <taavi>	 koi: please test
[20:06:54] <koi>	 looking
[20:07:51] <koi>	 taavi, tested on vector-2022 and LGTM
[20:07:54] <taavi>	 syncing
[20:12:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:13:57] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:938351|bnwikiquote: Update wordmark (T341910)]] (duration: 08m 34s)
[20:14:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:14:01] <stashbot>	 T341910: Update Bengali wikiquote wordmark - https://phabricator.wikimedia.org/T341910
[20:17:29] <koi>	 hi taavi, could you please purge "static/images/mobile/copyright/wikiquote-wordmark-bn.svg"? thx
[20:17:44] <wikibugs>	 (03PS2) 10Cory Massaro: wikifunctions: Set evaluator local URLs per T297314#9019664 [deployment-charts] - 10https://gerrit.wikimedia.org/r/938861 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester)
[20:17:46] <wikibugs>	 (03PS5) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[20:18:13] <taavi>	 koi: I think {{done}}, could you double-check?
[20:18:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[20:18:41] <koi>	 it's done now :)
[20:19:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:19:05] <taavi>	 !log taavi@mwmaint1002 ~ $ echo "https://en.wikipedia.org/static/images/mobile/copyright/wikiquote-wordmark-bn.svg" | mwscript purgeList.php --wiki enwiki
[20:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:07] <taavi>	 awesome
[20:21:38] <wikibugs>	 (03PS6) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[20:26:32] <wikibugs>	 (03PS7) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[20:26:40] <wikibugs>	 (03CR) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[20:34:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:37:57] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[20:43:30] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[20:51:06] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001"
[20:59:43] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001"
[20:59:43] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:59:44] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1001.eqiad.wmnet on all recursors
[20:59:47] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1001.eqiad.wmnet on all recursors
[21:00:06] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Dear deployers, time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T2100).
[21:00:12] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1001.eqiad.wmnet - bking@cumin1001"
[21:00:58] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1001.eqiad.wmnet - bking@cumin1001"
[21:01:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1001.eqiad.wmnet with OS bookworm
[21:01:39] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm
[21:05:56] <wikibugs>	 (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.11-8 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220)
[21:07:01] <wikibugs>	 (03PS2) 10Ahmon Dancy: Use buildkit wmf-v0.11-8 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220)
[21:07:18] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[21:07:55] <wikibugs>	 (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[21:08:42] <wikibugs>	 (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Set evaluator local URLs per T297314#9019664 [deployment-charts] - 10https://gerrit.wikimedia.org/r/938861 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester)
[21:09:30] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Set evaluator local URLs per T297314#9019664 [deployment-charts] - 10https://gerrit.wikimedia.org/r/938861 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester)
[21:09:33] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[21:09:55] <dancy>	 jouncebot nowandnext
[21:09:55] <jouncebot>	 For the next 1 hour(s) and 50 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T2100)
[21:09:55] <jouncebot>	 In 4 hour(s) and 50 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0200)
[21:15:12] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, any reason for the inconsistency in using brackets around variables in string interpolation? I would probably just always use " [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[21:15:47] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[21:16:35] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 47s)
[21:17:06] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/938895 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[21:21:00] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good other than one minor issue" [puppet] - 10https://gerrit.wikimedia.org/r/938896 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[21:32:05] <wikibugs>	 10SRE, 10Traffic: Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI - https://phabricator.wikimedia.org/T341606 (10BCornwall)
[21:39:51] <wikibugs>	 (03PS3) 10Ahmon Dancy: Use buildkit wmf-v0.11-8 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220)
[21:40:11] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[21:42:05] <wikibugs>	 (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/938931/2062/" [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[21:43:04] <wikibugs>	 (03PS1) 10Ahmon Dancy: Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220)
[21:43:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[21:44:51] <wikibugs>	 (03PS2) 10Ahmon Dancy: Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220)
[21:47:11] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[21:50:33] <wikibugs>	 (03PS3) 10Ahmon Dancy: Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220)
[21:52:58] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1001.eqiad.wmnet with OS bookworm
[21:52:58] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1001.eqiad.wmnet
[21:53:04] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm executed w...
[21:53:29] <icinga-wm>	 RECOVERY - AQS root url on aqs1010 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[21:53:30] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[21:54:11] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:55:29] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint (duration: 136m 46s)
[21:55:31] <wikibugs>	 (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/938939/2064/" [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[21:55:32] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint
[21:57:42] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint (duration: 02m 10s)
[22:00:31] <wikibugs>	 (03PS1) 10Jdlrobson: Limit client error alerts to "unknown" channel [puppet] - 10https://gerrit.wikimedia.org/r/938945
[22:04:22] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:04:32] <wikibugs>	 (03CR) 10JHathaway: [C: 04-1] install_server: updaate to use bash (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[22:18:23] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:35:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt)
[22:46:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt) 05Open→03Resolved a:03Dwisehaupt Host is installed and has a base config. Further work will be tracked in T342064.
[23:05:12] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks ok per PCC." [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[23:05:26] <wikibugs>	 (03PS4) 10Jforrester: Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219)
[23:05:28] <wikibugs>	 (03PS6) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945)
[23:05:30] <wikibugs>	 (03PS6) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945)
[23:05:32] <wikibugs>	 (03PS5) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945)
[23:05:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[23:06:25] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[23:10:01] <wikibugs>	 (03PS2) 10Cwhite: Limit client error alerts to "unknown" channel [puppet] - 10https://gerrit.wikimedia.org/r/938945 (owner: 10Jdlrobson)
[23:19:52] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Limit client error alerts to "unknown" channel [puppet] - 10https://gerrit.wikimedia.org/r/938945 (owner: 10Jdlrobson)
[23:24:48] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[23:25:32] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[23:25:54] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[23:26:01] <wikibugs>	 (03PS2) 10Cwhite: logstash: remove thanos log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937604 (https://phabricator.wikimedia.org/T234565)
[23:27:06] <wikibugs>	 (03PS2) 10Cwhite: logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565)
[23:27:14] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] mediawiki: Reduce the frequency of flaggedrevs updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup)
[23:27:18] <wikibugs>	 (03PS2) 10Cwhite: logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565)
[23:27:34] <wikibugs>	 (03PS2) 10Cwhite: logstash: remove haproxy log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937601 (https://phabricator.wikimedia.org/T234565)
[23:35:58] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[23:37:56] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!!" [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[23:38:33] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[23:38:53] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)