[00:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938332 [00:38:40] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938332 (owner: 10TrainBranchBot) [00:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:54:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938332 (owner: 10TrainBranchBot) [01:08:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:35:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:58:23] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:08:23] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:22] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:05:37] (03PS3) 10Hashar: wm-checks-api: check undefined real_author [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938318 (https://phabricator.wikimedia.org/T328484) (owner: 10Paladox) [04:07:18] (03CR) 10Hashar: [C: 03+2] "Thank you Paladox, that got noticed by Timo as well on T328484 and only happens on old changes. I have slightly amended the commit messag" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938318 (https://phabricator.wikimedia.org/T328484) (owner: 10Paladox) [04:07:50] (03Merged) 10jenkins-bot: wm-checks-api: check undefined real_author [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938318 (https://phabricator.wikimedia.org/T328484) (owner: 10Paladox) [04:08:51] !log hashar@deploy1002 Started deploy [gerrit/gerrit@cad3002]: wm-checks-api: check undefined real_author - T328484 [04:08:55] T328484: [wm-checks-api] error: changeMessage.real_author is undefined - https://phabricator.wikimedia.org/T328484 [04:08:59] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@cad3002]: wm-checks-api: check undefined real_author - T328484 (duration: 00m 08s) [04:30:42] (03PS1) 10Hashar: wm-checks-api: check undefined real_author (2) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938472 (https://phabricator.wikimedia.org/T328484) [04:32:09] (03CR) 10Hashar: [C: 03+2] wm-checks-api: check undefined real_author (2) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938472 (https://phabricator.wikimedia.org/T328484) (owner: 10Hashar) [04:32:39] (03Merged) 10jenkins-bot: wm-checks-api: check undefined real_author (2) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/938472 (https://phabricator.wikimedia.org/T328484) (owner: 10Hashar) [04:33:08] !log hashar@deploy1002 Started deploy [gerrit/gerrit@1153a16]: wm-checks-api: check undefined real_author (2) - T328484 [04:33:12] T328484: [wm-checks-api] error: changeMessage.real_author is undefined - https://phabricator.wikimedia.org/T328484 [04:33:16] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@1153a16]: wm-checks-api: check undefined real_author (2) - T328484 (duration: 00m 08s) [04:35:06] isn't the `!log Started` and `!log Finished` something new? [04:35:39] nop [04:35:44] always went in pair apparently. [05:17:33] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Marostegui) [05:31:06] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:35:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:58:25] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:11:06] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:24:05] (03CR) 10Peter Fischer: [C: 03+2] "Deployed maven artefact and debian package" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [06:33:23] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Amir1, Urbanecm, and taavi: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T0700). [07:00:04] sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:32] (03PS1) 10Marostegui: install_server: Allow reimage pc1015, pc1016 [puppet] - 10https://gerrit.wikimedia.org/r/938477 (https://phabricator.wikimedia.org/T341271) [07:04:08] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage pc1015, pc1016 [puppet] - 10https://gerrit.wikimedia.org/r/938477 (https://phabricator.wikimedia.org/T341271) (owner: 10Marostegui) [07:05:06] hello, I had a backport scheduled but I'm just seeing the link in the schedule is wrong and didn't cherry-pick. I'm gonna amend it now. [07:12:42] (03PS1) 10Marostegui: install_server: Allow reimage pc2015, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/938478 (https://phabricator.wikimedia.org/T341270) [07:14:53] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage pc2015, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/938478 (https://phabricator.wikimedia.org/T341270) (owner: 10Marostegui) [07:30:39] (03PS1) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) [07:30:42] (03PS1) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) [07:31:16] (03CR) 10CI reject: [V: 04-1] noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:31:23] (03CR) 10CI reject: [V: 04-1] noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:48:45] (03PS2) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) [07:48:47] (03PS2) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) [07:49:17] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) AIUI the only thing talking to the ev... [07:49:23] (03CR) 10CI reject: [V: 04-1] noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:50:03] (03CR) 10CI reject: [V: 04-1] noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:52:13] (03CR) 10Filippo Giunchedi: [C: 03+2] udp2log: run mw-log-cleanup after logrotate [puppet] - 10https://gerrit.wikimedia.org/r/938228 (https://phabricator.wikimedia.org/T341691) (owner: 10Filippo Giunchedi) [07:54:38] (03PS3) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) [07:54:40] (03PS3) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) [07:55:36] (03CR) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [07:55:41] (03CR) 10JMeybohm: Testing hack: Update ipoid to certmanager (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [07:55:59] (03CR) 10CI reject: [V: 04-1] noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:56:02] (03CR) 10CI reject: [V: 04-1] noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [07:56:41] (03PS4) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) [07:56:43] (03PS4) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) [07:57:04] (03PS1) 10Marostegui: report_users: Remove 10.64.0.134 [software] - 10https://gerrit.wikimedia.org/r/938647 [07:57:38] (03PS6) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) [07:57:40] (03PS6) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) [07:57:42] (03PS6) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [07:58:38] (03PS2) 10Marostegui: report_users: Remove 10.64.0.13[45] [software] - 10https://gerrit.wikimedia.org/r/938647 [07:59:41] (03CR) 10Marostegui: [C: 03+2] report_users: Remove 10.64.0.13[45] [software] - 10https://gerrit.wikimedia.org/r/938647 (owner: 10Marostegui) [08:01:03] (03Merged) 10jenkins-bot: report_users: Remove 10.64.0.13[45] [software] - 10https://gerrit.wikimedia.org/r/938647 (owner: 10Marostegui) [08:24:01] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Last Friday we've done some troubleshooting and tested a lot of different configurations, thanks @SLyngshede-WMF again! In... [08:27:34] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) (owner: 10Ryan Kemper) [08:28:16] (03CR) 10Btullis: [C: 03+2] Update the cumin alias for analytics-airflow [puppet] - 10https://gerrit.wikimedia.org/r/929702 (https://phabricator.wikimedia.org/T333697) (owner: 10Btullis) [08:30:12] !log disable puppet on all cp* hosts in eqsin to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938002 (T340983) [08:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:22] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [08:31:54] (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: apply silent-drop on port 80 to all eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938002 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [08:37:09] !log enable puppet on cp5024 and cp5032 to deploy 938002 [08:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:59] PROBLEM - puppet last run on kafka-test1006 is CRITICAL: CRITICAL: Puppet has been disabled for 604825 seconds, message: Elukey - elukey, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:50] !log enable puppet on A:cp-eqsin to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938002 (T340983) [08:51:52] running puppet --^ [08:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:54] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [08:53:10] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:54:06] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:55:21] (03PS1) 10Btullis: Deploy airflow version 2.6.3 to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/938803 (https://phabricator.wikimedia.org/T336286) [08:56:29] RECOVERY - puppet last run on kafka-test1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:01:10] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet [09:02:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet [09:06:26] (03CR) 10Ayounsi: "recheck" [software/homer] (gnmi) - 10https://gerrit.wikimedia.org/r/927736 (owner: 10Volans) [09:08:25] (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding fo gNMI support [software/homer] (gnmi) - 10https://gerrit.wikimedia.org/r/927736 (owner: 10Volans) [09:12:44] (03CR) 10Elukey: [C: 03+2] admin_ng: set better resourcequotas for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/938252 (owner: 10Elukey) [09:16:55] (03CR) 10Btullis: [C: 03+2] Deploy airflow version 2.6.3 to analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/938803 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [09:17:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet [09:17:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:17:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:18:03] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:18:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet [09:18:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:19:30] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet [09:22:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet [09:22:37] (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [09:24:39] (03CR) 10Ayounsi: [C: 03+2] Ignore LAGs from test_port_block_consistency (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/932400 (owner: 10Ayounsi) [09:26:07] (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983) [09:26:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet [09:27:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet [09:27:37] (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [09:29:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet [09:30:55] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet [09:31:41] (03PS1) 10Filippo Giunchedi: base: bump cadvisor rollout to 45% in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/938810 (https://phabricator.wikimedia.org/T108027) [09:32:05] PROBLEM - Check systemd state on dumpsdata1007 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_xmldumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:45] looking for a signoff on https://gerrit.wikimedia.org/r/c/operations/puppet/+/938810 [09:32:46] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42496/console" [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:35:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet [09:35:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:37:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:37:55] (03PS2) 10Fabfur: hiera: apply silent-drop on port 80 to ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983) [09:38:15] (03CR) 10Hashar: [C: 03+1] "I guess that is how I should have written it in the first place :)" [puppet] - 10https://gerrit.wikimedia.org/r/937978 (owner: 10Jbond) [09:38:32] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet [09:38:53] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet [09:39:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet [09:39:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:40:57] (03CR) 10Vgutierrez: [C: 03+1] hiera: apply silent-drop on port 80 to ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:42:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/938226 (https://phabricator.wikimedia.org/T341045) (owner: 10ArielGlenn) [09:42:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1001.eqiad.wmnet [09:42:42] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-stretch1001.eqiad.wmnet [09:43:19] (03CR) 10ArielGlenn: [C: 03+2] add jebe and xcollazo to nagios command access [puppet] - 10https://gerrit.wikimedia.org/r/938226 (https://phabricator.wikimedia.org/T341045) (owner: 10ArielGlenn) [09:43:23] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1001.eqiad.wmnet [09:43:51] 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10ayounsi) [09:44:45] !log disabled puppet on A:cp hosts in ulsfo to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938807 (T340983) [09:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:49] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [09:45:10] (03CR) 10Fabfur: [C: 03+2] hiera: apply silent-drop on port 80 to ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938807 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [09:46:00] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet [09:48:35] !log enabled puppet on A:cp hosts in ulsfo to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938807 (T340983) (hosts will run puppet with the usual schedule) [09:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch1001.eqiad.wmnet [09:50:20] (03CR) 10Ayounsi: [C: 03+1] "I don't know the tool itself, but as long as the rollback is easy I'd say +1" [puppet] - 10https://gerrit.wikimedia.org/r/938810 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [09:50:55] 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10ayounsi) [09:51:26] (03CR) 10Filippo Giunchedi: [C: 03+2] base: bump cadvisor rollout to 45% in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/938810 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [09:51:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1002.eqiad.wmnet [09:56:08] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [09:57:17] PROBLEM - Check systemd state on ms-fe1010 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:15] PROBLEM - Check systemd state on ms-fe1011 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:22] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch1002.eqiad.wmnet [09:59:45] PROBLEM - Check systemd state on ms-be1049 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1000) [10:04:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2001.codfw.wmnet [10:08:08] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet [10:09:19] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:56] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [10:10:30] mmhh cadvisor failures are me [10:10:39] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [10:10:59] PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2001.codfw.wmnet [10:11:41] (03PS1) 10Elukey: profile::services_proxy::envoy: increase timeout for inference [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) [10:11:44] (03PS1) 10ArielGlenn: make sure job watcher and exception checker do not run on spare NFS dumps shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) [10:11:45] RECOVERY - Check systemd state on ms-fe1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:06] (03CR) 10CI reject: [V: 04-1] make sure job watcher and exception checker do not run on spare NFS dumps shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [10:13:12] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "60s LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [10:13:57] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:59] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [10:14:35] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [10:14:39] PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:42] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet [10:15:19] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet [10:17:07] RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:39] (03PS1) 10Giuseppe Lavagetto: noc: stop using script to populate database data URIs [puppet] - 10https://gerrit.wikimedia.org/r/938818 (https://phabricator.wikimedia.org/T341859) [10:18:05] PROBLEM - Check systemd state on ms-fe1012 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:04] (03PS2) 10ArielGlenn: make sure job watcher and exception checker do not run on spare NFS dumps shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) [10:20:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:57] RECOVERY - Check systemd state on ms-fe1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:17] RECOVERY - Check systemd state on ms-be1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:36] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet [10:22:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet [10:23:45] PROBLEM - Check systemd state on ms-be1065 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:50] (03CR) 10Klausman: [C: 03+1] profile::services_proxy::envoy: increase timeout for inference [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [10:23:53] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2002.codfw.wmnet [10:24:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet [10:24:06] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet [10:30:27] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet [10:31:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2002.codfw.wmnet [10:31:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet [10:32:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1010.eqiad.wmnet [10:33:23] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:37] RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet [10:33:53] (03PS3) 10ArielGlenn: make sure job watcher and exception checker do not run on spare NFS dumps shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) [10:33:55] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet [10:34:31] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [10:36:21] RECOVERY - Check systemd state on ms-fe1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:27] (03CR) 10Volans: [C: 03+2] irc: small refactor to cleanup the code [software/pywmflib] - 10https://gerrit.wikimedia.org/r/937499 (owner: 10Volans) [10:38:21] RECOVERY - Check systemd state on ms-be1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1010.eqiad.wmnet [10:41:17] (03Merged) 10jenkins-bot: irc: small refactor to cleanup the code [software/pywmflib] - 10https://gerrit.wikimedia.org/r/937499 (owner: 10Volans) [10:41:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet [10:44:05] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1070.eqiad.wmnet with OS bullseye [10:45:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [10:45:18] (03CR) 10ArielGlenn: "ppc output looks good, see https://puppet-compiler.wmflabs.org/output/938816/42498/ and especially the output for dumpsdata1007, the one s" [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [10:45:32] (03PS1) 10Arturo Borrero Gonzalez: CR: cloud-host: allow return traffic for PDNS servers [homer/public] - 10https://gerrit.wikimedia.org/r/938819 (https://phabricator.wikimedia.org/T341966) [10:45:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet [10:48:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:49:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet [10:50:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [10:50:47] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1652 days) https://wikitech.wikimedia.org/wiki/Logs [10:51:50] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet [10:52:36] (03PS1) 10Elukey: WIP: ml-services: set knative concurrency values for ml pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 [10:52:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [10:53:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:53:59] (03CR) 10Ladsgroup: [C: 03+1] sre.hosts.decommission: fix call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937508 (owner: 10Volans) [10:54:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet [10:54:20] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet [10:55:52] 10SRE-tools, 10Spicerack: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) p:05Triage→03Medium [10:58:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikireplicas: relocate some hardcoded data into hiera [puppet] - 10https://gerrit.wikimedia.org/r/938238 (owner: 10Arturo Borrero Gonzalez) [11:01:16] (03PS2) 10Elukey: WIP: ml-services: set knative concurrency values for ml pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 [11:04:27] (03CR) 10Volans: "This change is ready for review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/938821 (owner: 10Volans) [11:07:31] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1070.eqiad.wmnet with reason: host reimage [11:07:58] 10SRE, 10Infrastructure-Foundations: Interactive firmware prompts on Bullseye with some Broadcom NICs - https://phabricator.wikimedia.org/T308106 (10BTullis) @MoritzMuehlenhoff - I've just bumped into this issue on upgrading the first prod hadoop worker and I found this bug reference, which seems highly releva... [11:08:19] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet [11:08:27] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet [11:08:32] (03PS2) 10Arturo Borrero Gonzalez: eqiad1: decomission cloudcontrol1005.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/938235 (https://phabricator.wikimedia.org/T341495) [11:08:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet [11:10:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet [11:10:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1070.eqiad.wmnet with reason: host reimage [11:11:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] eqiad1: decomission cloudcontrol1005.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/938235 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [11:12:16] !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1005.wikimedia.org [11:14:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:14] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet [11:15:35] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet [11:17:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet [11:18:43] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:19:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:03] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T332314) (owner: 10Bking) [11:22:15] (03CR) 10JMeybohm: [C: 03+1] profile::services_proxy::envoy: increase timeout for inference [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [11:22:35] 10SRE: Cannot download large (3GB) PDF files from commons - https://phabricator.wikimedia.org/T341755 (10Aklapper) [11:22:44] (03CR) 10Btullis: "I see that the functionality is good, but I don't see why you need to make two new profiles for this task." [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [11:23:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet [11:23:24] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova fullstack: updated harcoded access to the list of controllers [puppet] - 10https://gerrit.wikimedia.org/r/938831 (https://phabricator.wikimedia.org/T341495) [11:26:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova fullstack: updated harcoded access to the list of controllers [puppet] - 10https://gerrit.wikimedia.org/r/938831 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [11:26:46] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001" [11:28:06] 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia Enterprise: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10Aklapper) a:05RBrounley_WMF→03None Removing inactive task assignee. (Please do so as part of the team's offboardin... [11:29:42] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet [11:29:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet [11:30:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet [11:30:20] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet [11:33:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:35:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [11:35:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1070.eqiad.wmnet with OS bullseye [11:35:52] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1168 days) https://wikitech.wikimedia.org/wiki/Logs [11:36:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet [11:36:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet [11:38:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet [11:39:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet [11:45:42] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet [11:45:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet [11:46:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet [11:46:06] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet [11:51:15] (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938840 [11:52:44] (03PS1) 10Ladsgroup: realm: Add two new private tables of CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) [11:53:02] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet [11:54:26] (03CR) 10Marostegui: "Does this live in x1?" [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup) [11:56:24] (03CR) 10Ladsgroup: realm: Add two new private tables of CheckUser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup) [11:56:47] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42499/console" [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur) [12:01:03] (03PS2) 10Fabfur: hiera: apply silent-drop on port 80 to codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938840 [12:01:20] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1005.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001" [12:01:20] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:01:21] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1005.wikimedia.org [12:03:52] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42500/console" [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur) [12:23:27] (03CR) 10Marostegui: "Ok then this needs a mariadb restart" [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup) [12:24:40] (03CR) 10Marostegui: "Considering what happened last time we restarted the hosts, I would suggest to:" [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup) [12:26:51] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) [12:27:01] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) [12:30:05] PROBLEM - Host ms-be2056 is DOWN: PING CRITICAL - Packet loss = 100% [12:30:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:30:37] RECOVERY - Host ms-be2056 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [12:34:04] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Jclark-ctr) [12:34:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission dbproxy1013.eqiad.wmnet - https://phabricator.wikimedia.org/T341711 (10Jclark-ctr) 05Open→03Resolved [12:34:39] PROBLEM - Host ms-be2056 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:03] RECOVERY - Host ms-be2056 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [12:35:13] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jbond) [[ https://docs.google.com/document/d/1L2s9QqJRhKpngmWHyoCJdr6eHK5z3tm6i4zroJKJt-g/edit | Notes from in pe... [12:37:09] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Jclark-ctr) [12:37:16] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1067.eqiad.wmnet - https://phabricator.wikimedia.org/T341207 (10Jclark-ctr) 05Open→03Resolved [12:37:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/937509 (owner: 10Volans) [12:38:29] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1068.eqiad.wmnet - https://phabricator.wikimedia.org/T341208 (10Jclark-ctr) 05Open→03Resolved [12:39:52] (03CR) 10Filippo Giunchedi: [C: 03+2] Uninstall Diamond everywhere [puppet] - 10https://gerrit.wikimedia.org/r/935103 (https://phabricator.wikimedia.org/T317032) (owner: 10Majavah) [12:40:35] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Jclark-ctr) [12:41:59] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1058.eqiad.wmnet - https://phabricator.wikimedia.org/T338227 (10Jclark-ctr) 05Open→03Resolved [12:42:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet [12:43:46] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jcrespo) @jbond To answer the question #1: This is the configuration from a client, where encryption and decrypt... [12:44:25] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1065.eqiad.wmnet - https://phabricator.wikimedia.org/T341205 (10Jclark-ctr) 05Open→03Resolved [12:46:49] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1059.eqiad.wmnet - https://phabricator.wikimedia.org/T338408 (10Jclark-ctr) 05Open→03Resolved [12:47:57] (03CR) 10Elukey: [C: 03+2] profile::services_proxy::envoy: increase timeout for inference [puppet] - 10https://gerrit.wikimedia.org/r/938815 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [12:49:27] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur) [12:50:26] 10SRE, 10Traffic: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) @Isaac @Htriedman @Jcross could you confirm that this is working as expected and can be closed? [12:50:31] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1060.eqiad.wmnet - https://phabricator.wikimedia.org/T338409 (10Jclark-ctr) 05Open→03Resolved [12:50:52] (03PS1) 10LSobanski: Updated GitLabCIPipelineErrors description to match the updated threshold of 0.7. [alerts] - 10https://gerrit.wikimedia.org/r/938846 (https://phabricator.wikimedia.org/T341927) [12:52:35] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1064.eqiad.wmnet - https://phabricator.wikimedia.org/T341204 (10Jclark-ctr) 05Open→03Resolved [12:53:10] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:54:37] !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:54:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:55:29] (03CR) 10EoghanGaffney: [C: 03+1] Updated GitLabCIPipelineErrors description to match the updated threshold of 0.7. [alerts] - 10https://gerrit.wikimedia.org/r/938846 (https://phabricator.wikimedia.org/T341927) (owner: 10LSobanski) [12:55:31] (03PS3) 10Fabfur: hiera: apply silent-drop on port 80 to codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938840 [12:55:57] (03CR) 10Fabfur: hiera: apply silent-drop on port 80 to codfw cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur) [12:56:01] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1069.eqiad.wmnet - https://phabricator.wikimedia.org/T341209 (10Jclark-ctr) 05Open→03Resolved [12:56:38] (03CR) 10Jbond: "some minor nits inline (some pre-existing), the main im concerned about is:" [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [12:56:47] (03CR) 10Fabfur: [C: 03+2] hiera: apply silent-drop on port 80 to codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938840 (owner: 10Fabfur) [12:56:53] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:41] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) As requested, @jbond your two users on the replica and production dumped via API (`curl "https://gitlab-replica.wikimedia.o... [12:57:54] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Jclark-ctr) [12:58:05] !log disabled puppet on A:cp-codfw to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938840 (T340983) [12:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:08] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [12:58:26] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1066.eqiad.wmnet - https://phabricator.wikimedia.org/T341206 (10Jclark-ctr) 05Open→03Resolved [12:59:16] (03PS1) 10Ssingh: depool esams: router migration [dns] - 10https://gerrit.wikimedia.org/r/938847 (https://phabricator.wikimedia.org/T337997) [12:59:40] (03PS3) 10Elukey: ml-services: set knative concurrency values for ml pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 (https://phabricator.wikimedia.org/T341479) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1300). [13:00:04] sergi0 and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] hi [13:00:15] o/ [13:00:27] I’ll be in a meeting for a bit longer, anyone else around to deploy? [13:00:32] I can deploy [13:00:33] (otherwise I can probably do it in 30 minutes or so) [13:00:34] heads-up: esams depooling shortly [13:00:34] thanks! [13:00:50] sukhe: just to confirm, can we continue with the deployment as normal? [13:00:59] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1014.eqiad.wmnet - https://phabricator.wikimedia.org/T341782 (10Jclark-ctr) 05Open→03Resolved [13:01:00] sorry, and yes, you can continue, but just a heads-up [13:01:06] ack [13:01:09] for the channel mostly [13:01:13] it’s just depooled from traffic, but deployments still go to it? [13:01:27] it has no appservers [13:01:27] (03CR) 10Majavah: [C: 03+2] "deploying" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938306 (https://phabricator.wikimedia.org/T341865) (owner: 10Urbanecm) [13:01:32] Lucas_WMDE: edge site [13:01:34] Lucas_WMDE: no appservers in esams [13:01:34] oh right [13:01:47] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/938847 (https://phabricator.wikimedia.org/T337997) (owner: 10Ssingh) [13:02:28] (03CR) 10Ssingh: [C: 03+2] depool esams: router migration [dns] - 10https://gerrit.wikimedia.org/r/938847 (https://phabricator.wikimedia.org/T337997) (owner: 10Ssingh) [13:02:40] aanzx: going to deploy your config changes while we wait for CI for sergi0's backport [13:02:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938677 (https://phabricator.wikimedia.org/T341940) (owner: 10Anzx) [13:02:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938315 (https://phabricator.wikimedia.org/T341926) (owner: 10Anzx) [13:02:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938324 (https://phabricator.wikimedia.org/T341958) (owner: 10Anzx) [13:02:56] ok [13:02:57] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10Jclark-ctr) [13:02:57] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:25] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10Jclark-ctr) 05Open→03Resolved [13:03:28] !log run authdns-update to depool esams [13:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:13] (03Merged) 10jenkins-bot: change wgExtraNamespaces , wgNamespaceAliases for mnwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938677 (https://phabricator.wikimedia.org/T341940) (owner: 10Anzx) [13:04:28] (03PS4) 10Majavah: Add appendix namespace aliases on huwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938315 (https://phabricator.wikimedia.org/T341926) (owner: 10Anzx) [13:04:33] (03PS4) 10Majavah: robots.txt: Disable indexing draft-related pages on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938324 (https://phabricator.wikimedia.org/T341958) (owner: 10Anzx) [13:04:38] (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938315 (https://phabricator.wikimedia.org/T341926) (owner: 10Anzx) [13:04:40] (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938324 (https://phabricator.wikimedia.org/T341958) (owner: 10Anzx) [13:04:40] !log run puppet on cp2027 to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/938840 (T340983) [13:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:43] T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 [13:04:44] right. should have seen that coming [13:06:14] (03Merged) 10jenkins-bot: Add appendix namespace aliases on huwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938315 (https://phabricator.wikimedia.org/T341926) (owner: 10Anzx) [13:06:18] (03Merged) 10jenkins-bot: robots.txt: Disable indexing draft-related pages on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938324 (https://phabricator.wikimedia.org/T341958) (owner: 10Anzx) [13:06:30] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frpig1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340128 (10Jclark-ctr) [13:06:50] !log taavi@deploy1002 Started scap: Backport for [[gerrit:938677|change wgExtraNamespaces , wgNamespaceAliases for mnwwiktionary (T341940)]], [[gerrit:938315|Add appendix namespace aliases on huwiktionary (T341926)]], [[gerrit:938324|robots.txt: Disable indexing draft-related pages on knwiki (T341958)]] [13:06:53] 10SRE, 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frpig1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340128 (10Jclark-ctr) 05Open→03Resolved [13:06:56] T341940: Remains to be translated into Mon - https://phabricator.wikimedia.org/T341940 [13:06:57] T341958: robots.txt: Disable indexing draft-related pages on knwiki - https://phabricator.wikimedia.org/T341958 [13:06:57] T341926: Add Appendix as a namespace alias on huwiktionary - https://phabricator.wikimedia.org/T341926 [13:07:16] !log enabled puppet on A:cp-codfw to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938840 (T340983) (hosts will run puppet with the usual schedule) [13:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:20] (03PS1) 10Elukey: ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 [13:08:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [13:09:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet [13:10:26] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) I did some more testing today and can confirm that the required config is `cas.authn.oidc.id-token.include-id-token-claims=... [13:12:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ores2003.codfw.wmnet with reason: DCops working on it [13:13:09] 10SRE, 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10Jclark-ctr) a:03Jclark-ctr [13:13:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ores2003.codfw.wmnet with reason: DCops working on it [13:13:22] 10SRE, 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10Jclark-ctr) 05Open→03Resolved [13:15:06] (03CR) 10Ladsgroup: realm: Add two new private tables of CheckUser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup) [13:15:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet [13:15:57] (03PS2) 10Elukey: ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 [13:16:25] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) >>! In T297314#9018519, @JMeyb... [13:16:37] !log taavi@deploy1002 taavi and anzx: Backport for [[gerrit:938677|change wgExtraNamespaces , wgNamespaceAliases for mnwwiktionary (T341940)]], [[gerrit:938315|Add appendix namespace aliases on huwiktionary (T341926)]], [[gerrit:938324|robots.txt: Disable indexing draft-related pages on knwiki (T341958)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqia [13:16:37] d.wmnet [13:16:43] T341940: Remains to be translated into Mon - https://phabricator.wikimedia.org/T341940 [13:16:43] T341958: robots.txt: Disable indexing draft-related pages on knwiki - https://phabricator.wikimedia.org/T341958 [13:16:44] T341926: Add Appendix as a namespace alias on huwiktionary - https://phabricator.wikimedia.org/T341926 [13:16:52] aanzx: please test [13:17:06] ok [13:17:16] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet [13:18:51] taavi huwiktionary and mnwwiktionary good , nothing to test on knwiki [13:19:50] (03Merged) 10jenkins-bot: NewImpact: fix undefined log function [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938306 (https://phabricator.wikimedia.org/T341865) (owner: 10Urbanecm) [13:20:02] ok, syncing [13:20:08] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Jclark-ctr) [13:20:17] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1061.eqiad.wmnet - https://phabricator.wikimedia.org/T339199 (10Jclark-ctr) 05Open→03Resolved [13:20:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging per promise to effie and hugh." [deployment-charts] - 10https://gerrit.wikimedia.org/r/938001 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [13:20:54] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1062.eqiad.wmnet - https://phabricator.wikimedia.org/T339200 (10Jclark-ctr) 05Open→03Resolved [13:21:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet [13:21:15] after sync please run namespaceDupes.php for both hu , mnw wiktionary , @taavi [13:21:17] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Jclark-ctr) [13:21:23] ack [13:21:30] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: decommission analytics1063.eqiad.wmnet - https://phabricator.wikimedia.org/T339201 (10Jclark-ctr) 05Open→03Resolved [13:22:03] (03PS1) 10Ssingh: Revert "depool esams: router migration" [dns] - 10https://gerrit.wikimedia.org/r/938678 [13:22:25] (03CR) 10Ssingh: "DO NOT MERGE. Emergency patch." [dns] - 10https://gerrit.wikimedia.org/r/938678 (owner: 10Ssingh) [13:23:06] (03Merged) 10jenkins-bot: thumbor: Bye bye nutcracker! [deployment-charts] - 10https://gerrit.wikimedia.org/r/938001 (https://phabricator.wikimedia.org/T318695) (owner: 10Effie Mouzeli) [13:23:27] (03CR) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [13:24:48] (03CR) 10Marostegui: realm: Add two new private tables of CheckUser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup) [13:24:51] (03CR) 10Marostegui: [C: 03+1] realm: Add two new private tables of CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup) [13:24:57] (03PS1) 10Ayounsi: Add cookbook to manage users SSH keys on SONiC devices [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) [13:25:44] !log taavi@deploy1002 ~ $ mwscript namespaceDupes.php --wiki mnwwiktionary --fix # T341940 [13:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:47] T341940: Remains to be translated into Mon - https://phabricator.wikimedia.org/T341940 [13:26:04] 1660 links to fix, 1660 were resolvable, 0 were deleted. [13:26:34] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:39] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:938677|change wgExtraNamespaces , wgNamespaceAliases for mnwwiktionary (T341940)]], [[gerrit:938315|Add appendix namespace aliases on huwiktionary (T341926)]], [[gerrit:938324|robots.txt: Disable indexing draft-related pages on knwiki (T341958)]] (duration: 19m 48s) [13:26:44] T341958: robots.txt: Disable indexing draft-related pages on knwiki - https://phabricator.wikimedia.org/T341958 [13:26:45] T341926: Add Appendix as a namespace alias on huwiktionary - https://phabricator.wikimedia.org/T341926 [13:26:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) @Dwisehaupt hello is this ok now? [13:27:12] (03CR) 10Kaleem Bhatti: [C: 03+1] "anyone please submit review for this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [13:27:14] (03PS25) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [13:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:27:28] !log taavi@deploy1002 Started scap: Backport for [[gerrit:938306|NewImpact: fix undefined log function (T341865)]] [13:27:31] T341865: log is not a function. - https://phabricator.wikimedia.org/T341865 [13:27:52] !log taavi@mwmaint1002 ~ $ mwscript namespaceDupes.php --wiki huwiktionary --fix # T341926 [13:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm I will check and let you know [13:28:52] !log taavi@deploy1002 taavi and urbanecm: Backport for [[gerrit:938306|NewImpact: fix undefined log function (T341865)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:29:01] sergi0: please test [13:29:08] testing now [13:29:30] (03CR) 10CI reject: [V: 04-1] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [13:29:55] Amir1: can you quickly tell what's wrong with namespaceDupes.php? https://phabricator.wikimedia.org/P49565 [13:30:31] have to go to meeting but that thing breaks constently [13:30:59] sigh, it's linkmigration piece again [13:31:05] oh, that again [13:31:34] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:50] taavi: looking good from my side, the error is not present in the analytics requests anymore [13:31:52] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) > from our side we will need to check if cas.authn.oidc.id-token.include-id-token-claims=true is ok to enable globally or i... [13:31:56] sergi0: ok, syncing [13:32:09] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet [13:32:11] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting access to release-engineering for aklapper - https://phabricator.wikimedia.org/T341749 (10Aklapper) > Feel free to try ssh to these hosts now. phab1004.eqiad.wmnet is prod phab, phab-test1001.eqiad.wmnet is the test machine, phab200... [13:32:22] (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi) [13:32:29] 10SRE, 10Wikimedia-Etherpad, 10SecTeam-Processed, 10Security, 10Vuln-Infoleak: Etherpad deletion 9NXnJ9N1vJP8YuBOyY6V - https://phabricator.wikimedia.org/T341903 (10sbassett) [13:32:34] 10SRE, 10Wikimedia-Etherpad, 10SecTeam-Processed, 10Security, 10Vuln-Infoleak: Etherpad deletion 9NXnJ9N1vJP8YuBOyY6V - https://phabricator.wikimedia.org/T341903 (10sbassett) p:05Triage→03Low [13:33:15] PROBLEM - Host ms-be2057 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:31] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1072.eqiad.wmnet with OS bullseye [13:34:04] (03PS26) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) [13:35:20] (03CR) 10Cory Massaro: [C: 03+1] wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [13:35:23] RECOVERY - Host ms-be2057 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [13:35:45] Amir1: Lucas_WMDE: I think the issue is that LinksMigration::getLinksConditions() won't create a new LinkTarget if none exists (instead it'll just return a query that never matches), but namespaceDupes.php expects it would [13:36:41] honestly, it shouldn't do anything for those cases, it should just reparse the page and let the logic handle it instead of redoing the logic [13:37:23] (03PS2) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [13:37:47] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:938306|NewImpact: fix undefined log function (T341865)]] (duration: 10m 19s) [13:37:51] T341865: log is not a function. - https://phabricator.wikimedia.org/T341865 [13:37:55] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:03] RECOVERY - Host ores2003 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [13:38:11] (03CR) 10CI reject: [V: 04-1] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [13:38:14] 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) [13:38:23] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet [13:38:51] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet [13:39:24] 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) @ayounsi @cmooney could you let DCops know which racks would be better for these boxes? Thanks! [13:39:51] (03CR) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [13:40:59] !log reimaging cp4037 as preparatory test for knams migration [13:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:09] (03PS1) 10Ilias Sarantopoulos: ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) [13:42:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:29] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [13:42:34] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:42:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [13:42:54] manually worked around that by purging the affected page by hand [13:43:31] !log deploy removal of nutcracker from thumbor. T318695 [13:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:35] T318695: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 [13:43:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet [13:43:47] T341993 [13:43:47] T341993: namespaceDupes.php can fail if new target does not have a linktarget entry - https://phabricator.wikimedia.org/T341993 [13:43:56] was about to say, purging the pages worked last time https://phabricator.wikimedia.org/T334277#8775922 [13:44:04] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9019527, @Jdforrester-... [13:44:07] (03CR) 10Jbond: [C: 03+2] pki: add network devices CA [puppet] - 10https://gerrit.wikimedia.org/r/938218 (https://phabricator.wikimedia.org/T334594) (owner: 10Jbond) [13:44:24] taavi: are you done deploying now? [13:44:32] yes [13:44:35] ok thanks [13:44:38] then I’ll do a security fix [13:45:20] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:46:02] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:46:32] (03PS3) 10Ladsgroup: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [13:46:41] (03PS4) 10Ladsgroup: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [13:47:00] taavi: thank you for the assistance [13:47:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:30] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet [13:47:37] (03CR) 10CI reject: [V: 04-1] sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [13:47:43] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet [13:48:23] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice, it seems more tidy this way!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 (owner: 10Elukey) [13:48:36] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [13:50:30] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host analytics1072.eqiad.wmnet with OS bullseye [13:51:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet [13:52:23] (03PS3) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 [13:53:18] (03CR) 10CI reject: [V: 04-1] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [13:53:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1072.eqiad.wmnet with OS bullseye [13:54:24] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet [13:54:49] !log lucaswerkmeister-wmde Deployed security patch for T340217 [13:54:55] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Thanks for troubleshooting this more! I can confirm existing users have `cas3` in the `identities` section. This leads to a... [13:55:12] (03PS1) 10Jbond: tox: drop the minor version requierment on admin checks [puppet] - 10https://gerrit.wikimedia.org/r/938858 [13:55:47] * Lucas_WMDE done [13:56:00] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy models for simplewiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/938859 (https://phabricator.wikimedia.org/T319170) [13:56:34] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:57] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:24] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:02:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet [14:03:24] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet [14:04:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet [14:07:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:08:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet [14:08:23] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:27] (03PS2) 10DCausse: Link to new repo to build docker dev image [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/937949 [14:08:34] (03CR) 10DCausse: [V: 03+2 C: 03+2] Link to new repo to build docker dev image [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/937949 (owner: 10DCausse) [14:08:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:11] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:09:35] 10ops-codfw, 10Machine-Learning-Team: ManagementSSHDown - https://phabricator.wikimedia.org/T341648 (10Jhancock.wm) 05Open→03Resolved replaced idrac card and coms battery. updated idrac IP info. BAT0002 alert has cleared and the server is reachable by ssh [14:10:13] (03PS4) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 [14:10:15] (03PS1) 10Jforrester: wikifunctions: Set evaluator local URLs per T297314#9019664 [deployment-charts] - 10https://gerrit.wikimedia.org/r/938861 (https://phabricator.wikimedia.org/T297314) [14:10:16] !log start kafka partitions rebalance for main-codfw (long running maintenance, see https://phabricator.wikimedia.org/T341558) [14:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:27] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1072.eqiad.wmnet with reason: host reimage [14:11:20] (03CR) 10CI reject: [V: 04-1] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [14:11:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet [14:11:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet [14:12:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet [14:12:23] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet [14:12:35] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:13:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1072.eqiad.wmnet with reason: host reimage [14:13:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) @Jclark-ctr sorry missed one. edited my previous comment with the additional to keep all the info together. [14:14:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:14:22] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:14:29] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10akosiaris) [14:15:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:16:12] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10akosiaris) @hnowlan @jijiki. nutcracker removal merged and deployed. I am gonna let you have the pleasure of resolving this task :-) [14:16:22] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS bullseye [14:17:59] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [14:19:31] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi) [14:20:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet [14:20:31] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [14:20:33] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet [14:20:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:20:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet [14:21:58] !log klausman@puppetmaster1001 conftool action : set/pooled=no; selector: name=ores2003.codfw.wmnet [14:22:14] (03PS1) 10Filippo Giunchedi: New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) [14:22:16] (03PS1) 10Filippo Giunchedi: prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) [14:22:18] (03PS1) 10Filippo Giunchedi: hieradata: disable ferm rules from etcd in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/938868 [14:22:38] (03CR) 10CI reject: [V: 04-1] New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [14:24:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:24:44] (03PS2) 10Filippo Giunchedi: New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) [14:24:46] !log klausman@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ores2003.codfw.wmnet [14:24:46] (03PS2) 10Filippo Giunchedi: prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) [14:24:49] (03PS2) 10Filippo Giunchedi: hieradata: disable ferm rules from etcd in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/938868 [14:26:00] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jcrespo) Regarding the last question, one important thing is that sometimes a recovery may need multiple backup s... [14:26:34] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [14:26:55] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10User-jbond: wmf-styleguide checks: unable to ignore violations inside roles - https://phabricator.wikimedia.org/T280353 (10jbond) [14:27:03] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [14:27:53] 10Puppet, 10Infrastructure-Foundations: Nuyaml_backend does not allow binary Hiera data - https://phabricator.wikimedia.org/T113328 (10jbond) 05Open→03Resolved a:03jbond no update [14:28:30] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-IRC-RC-Server, 10User-jbond: Ensure puppet sends the correct ircd signals to update config and motd - https://phabricator.wikimedia.org/T284052 (10jbond) 05Open→03Resolved a:03jbond fixed with last patch [14:29:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet [14:30:43] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: disable ferm rules from etcd in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/938868 (owner: 10Filippo Giunchedi) [14:30:46] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [14:30:55] (03PS3) 10Filippo Giunchedi: hieradata: disable ferm rules from etcd in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/938868 [14:31:22] (03CR) 10Filippo Giunchedi: "Tested on pontoon-titan-01.monitoring.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [14:32:47] (03CR) 10JMeybohm: noc: add script to dump etcd db config (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:33:27] (03CR) 10Jelto: [C: 03+2] Updated GitLabCIPipelineErrors description to match the updated threshold of 0.7. [alerts] - 10https://gerrit.wikimedia.org/r/938846 (https://phabricator.wikimedia.org/T341927) (owner: 10LSobanski) [14:33:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10netbox: Netbox missing physical device in PuppetDB when Puppet disabled for too long - https://phabricator.wikimedia.org/T254986 (10joanna_borun) [14:33:51] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105 (10jbond) 05Open→03Declined going to close this as declined. [[ https://ger... [14:33:58] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:34:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:38] (03Merged) 10jenkins-bot: Updated GitLabCIPipelineErrors description to match the updated threshold of 0.7. [alerts] - 10https://gerrit.wikimedia.org/r/938846 (https://phabricator.wikimedia.org/T341927) (owner: 10LSobanski) [14:35:15] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [14:36:13] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10jbond) [14:36:19] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1652 days) https://wikitech.wikimedia.org/wiki/Logs [14:36:55] !log restart rsyslog on centrallog1002 ("peer did not provide a certificate, not permitted to talk to it") [14:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:18] godog: --^ [14:37:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10jbond) [14:37:31] errors seem related to some tcp conns to prometheus1006 [14:38:16] 10Puppet, 10Infrastructure-Foundations, 10netops, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10jbond) [14:38:21] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [14:39:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:08] elukey: ack, thank you! yeah we've seen the problem from time to time with the gtls listener, haven't had a change to dig deep yet though (and it recovers) [14:39:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet [14:40:45] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond) [14:41:48] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [14:41:58] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jbond) [14:42:41] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) I think we are ready for this cloudweb2002-dev move today, assuming no IP change, just a poweroff-poweron oper... [14:50:28] 10SRE, 10Traffic: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Htriedman) @Vgutierrez this feature has been working as expected, and this ticket can be closed! [14:50:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Documentation, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10jbond) [14:51:35] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration management tooling - https://phabricator.wikimedia.org/T321874 (10joanna_borun) 05Open→03Declined There are no specific actions we can take regarding this ticket. If additional discussion is needed, we can schedule a dedicated meeting. [14:52:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:52:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1072.eqiad.wmnet with OS bullseye [14:52:32] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Puppet-Core, and 3 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10jbond) [14:52:42] (03CR) 10Klausman: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:53:20] (03CR) 10Klausman: [C: 03+1] ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 (owner: 10Elukey) [14:54:08] 10SRE, 10Traffic: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez @Htriedman awesome, Thanks for the prompt response. DP has been deployed and running happily since February 6th, 2023. [14:54:14] (03CR) 10Klausman: [C: 03+1] ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [14:57:40] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:57:44] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:59:23] 10SRE, 10Infrastructure-Foundations, 10Puppet CI: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377 (10jbond) [14:59:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [15:02:18] !log dns5003 upgrade to pdns-rec 4.8.4: T341611 [15:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:22] T341611: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 [15:02:46] PROBLEM - Host ms-be1064 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:10] RECOVERY - Host ms-be1064 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [15:04:08] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1064 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:04:40] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: in puppet 6 some core types have been moved to external modules. check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10jbond) 05Open→03Resolved a:03jbond This has been handled as part of the puppet7 migration [15:04:46] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4037.ulsfo.wmnet with OS bullseye [15:04:49] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:07:53] (03CR) 10Herron: [C: 03+1] "LGTM🫰" [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [15:08:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet [15:09:02] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1064 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:09:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [15:09:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet [15:10:02] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Performance Issue: Investigate mysterious_sysctl settings and figure out what to do with them - https://phabricator.wikimedia.org/T118812 (10jbond) 05Open→03Resolved a:03jbond [15:10:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [15:10:14] (03PS2) 10Ilias Sarantopoulos: ores: use envoy proxy for Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937453 (https://phabricator.wikimedia.org/T319170) [15:10:17] 10Puppet, 10Infrastructure-Foundations, 10Technical-Debt: "Setting templatedir is deprecated" warning issued on self-hosted puppetmaster - https://phabricator.wikimedia.org/T95158 (10jbond) 05Open→03Resolved a:03jbond templatedir setting is now removed [15:12:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:13:36] jouncebot now [15:13:36] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [15:14:01] !log dancy@deploy1002 Installing scap version "4.54.0" for 605 hosts [15:18:16] (03CR) 10EllenR: [C: 03+1] "LGTM + has been merged" [puppet] - 10https://gerrit.wikimedia.org/r/886119 (https://phabricator.wikimedia.org/T257893) (owner: 10Phuedx) [15:19:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet [15:22:09] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet [15:23:33] (03CR) 10Elukey: [C: 03+2] ml-services: set knative concurrency values for ml pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/938820 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [15:24:46] (03PS3) 10Elukey: ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 [15:25:14] 10Puppet, 10Infrastructure-Foundations: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064 (10jbond) [15:25:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:25:59] this is me --^ [15:26:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, and 4 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10jbond) [15:29:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet [15:29:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [15:29:55] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [15:30:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1530). [15:30:14] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) [15:31:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:32:13] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10User-jbond: puppetlabs: create puppet 7 environment in WMCS to test code - https://phabricator.wikimedia.org/T294841 (10jbond) [15:33:51] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [15:34:04] 10SRE, 10Observability-Alerting: Setup some alert mechanism when some 'critical' cron jobs fail - https://phabricator.wikimedia.org/T187101 (10jbond) Im not sure if this is still valid however i have removed the puppet tag as this would be better done in the alertmanager repo now [15:35:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:36:30] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:37:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [15:37:46] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) server move complete [15:39:16] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Proposal: Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10jbond) > Edit taskgen to support finding tox.ini files in each module instead of a single universal one with conditional changes filters. This is an idea... [15:39:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [15:40:14] (03PS1) 10Ilias Sarantopoulos: ml-services: update ores-legacy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/938881 (https://phabricator.wikimedia.org/T341479) [15:40:18] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team (Radar): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) [15:40:51] 10Puppet, 10SRE, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548 (10jbond) [15:41:16] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T335684 (10Jclark-ctr) 05Open→03Resolved psu alerts have not returned closing ticket [15:41:40] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team (Radar): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) 05Open→03Resolved a:03jbond im going to close this, I think with the `auto` keyword this is c... [15:42:16] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update ores-legacy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/938881 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [15:42:23] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: disable alerts for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/937572 (https://phabricator.wikimedia.org/T332314) (owner: 10Ryan Kemper) [15:43:11] (03Merged) 10jenkins-bot: ml-services: update ores-legacy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/938881 (https://phabricator.wikimedia.org/T341479) (owner: 10Ilias Sarantopoulos) [15:45:11] (03PS4) 10Elukey: ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 [15:45:13] (03PS1) 10Elukey: ml-services: fix the container concurrency setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/938882 [15:45:28] (03CR) 10Elukey: [C: 03+2] ml-services: add more scaling options to model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/938850 (owner: 10Elukey) [15:45:34] (03CR) 10Cory Massaro: wikifunctions: Add AppArmor profile usage (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [15:45:39] (03PS2) 10Elukey: ml-services: fix the container concurrency setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/938882 [15:46:39] (03PS2) 10Cory Massaro: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [15:47:35] (03CR) 10Cory Massaro: Add AppArmor configuration for the deployed function-evaluator service. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro) [15:47:40] (03CR) 10Elukey: [C: 03+2] ml-services: fix the container concurrency setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/938882 (owner: 10Elukey) [15:48:38] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) [15:48:57] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:49:30] !log fabfur@cumin1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [15:49:42] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) [15:49:50] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:49:51] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [15:49:59] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet [15:50:05] (03CR) 10Herron: [C: 03+1] prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [15:50:13] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:53:25] 10Puppet, 10SRE, 10Infrastructure-Foundations: puppet lint check for resource names - https://phabricator.wikimedia.org/T93231 (10jbond) @fgiunchedi I'm tempted to close this as invalid as i don't see any issue with having spaces in resource titles and in some cases (e.g. notify, exec) it can be desirable.... [15:53:41] 10SRE, 10Infrastructure-Foundations, 10Puppet CI: puppet lint check for resource names - https://phabricator.wikimedia.org/T93231 (10jbond) [15:54:55] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543 (10jbond) [15:54:58] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): puppet7: drop instances of :undef in erb files - https://phabricator.wikimedia.org/T341071 (10jbond) [15:55:38] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Documentation, 10Puppet (Puppet 7.0): Puppet7: Update documentation - https://phabricator.wikimedia.org/T341095 (10jbond) [15:55:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [15:56:03] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond) [15:56:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [15:56:30] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [15:56:37] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Configure SRV records for new puppet infrastructure - https://phabricator.wikimedia.org/T341053 (10jbond) 05In progress→03Resolved a:03jbond [15:56:40] hmm [15:56:48] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [15:56:51] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): remove puppet::expose_agent_certs from puppetdb classes - https://phabricator.wikimedia.org/T341374 (10jbond) p:05Triage→03High [15:57:04] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): remove puppet::expose_agent_certs from puppetdb classes - https://phabricator.wikimedia.org/T341374 (10jbond) p:05High→03Medium [15:57:11] sukhe: drmrs is munching traffic compared to its usual p95 :) [15:57:18] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:57:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) p:05Triage→03Medium [15:57:51] yep [15:57:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond) [15:57:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:58:00] we are still fine here https://librenms.wikimedia.org/graphs/to=1689606600/id=23134/type=port_bits/from=1689520200/ [15:58:32] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet [16:00:29] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Migrate bacula to pki.discovery.wmnet - https://phabricator.wikimedia.org/T341664 (10jbond) p:05Triage→03Medium [16:00:39] !oncall [16:02:05] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:04:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [16:04:45] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [16:05:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:05:58] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [16:06:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall great! And I appreciated you only bumped a patch level given we retain full backwards compatibility 😄 But maybe in this case at le" [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [16:07:23] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:08:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:08:20] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:08:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:09:12] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42504/console" [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [16:09:46] (03CR) 10Herron: [C: 03+1] "Thanks for this -- LGTM although still learning the details of cfssl. Let's try with a controlled rollout to centrallog2002 first" [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [16:10:55] (03PS1) 10Urbanecm: Fix UserDatabaseHelper::hasMainspaceEdits [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938680 (https://phabricator.wikimedia.org/T341994) [16:12:12] jouncebot: nowandnext [16:12:13] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [16:12:13] In 0 hour(s) and 47 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1700) [16:12:13] In 0 hour(s) and 47 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1700) [16:12:31] 10SRE, 10Infrastructure-Foundations, 10Puppet CI: puppet lint check for resource names - https://phabricator.wikimedia.org/T93231 (10fgiunchedi) 05Open→03Invalid Totally fair to mark invalid (done) @jbond, tbh I don't remember what the issue was! [16:12:43] (03CR) 10Urbanecm: [C: 03+2] Fix UserDatabaseHelper::hasMainspaceEdits [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938680 (https://phabricator.wikimedia.org/T341994) (owner: 10Urbanecm) [16:12:48] !log stop kafka-main codfw maintenance - T341558 [16:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:52] T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 [16:14:21] (03CR) 10Elukey: "Looks great, do you mind to add a use case in the .fixtures? So we can see a diff etc.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [16:15:41] 10SRE, 10Infrastructure-Foundations, 10Puppet CI: puppet lint check for resource names - https://phabricator.wikimedia.org/T93231 (10jhathaway) Yeah I agree we should allow those, as you mention they are sometimes useful: ` exec { '/usr/bin/cat /etc/os-release': logoutput => true } notify { "Fact ${... [16:17:20] (03PS1) 10Ilias Sarantopoulos: ml-services: fix ores-legacy app [deployment-charts] - 10https://gerrit.wikimedia.org/r/938888 [16:18:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:20:03] (03PS2) 10Ilias Sarantopoulos: ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) [16:22:35] (03PS1) 10EoghanGaffney: Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) [16:23:44] (03PS2) 10EoghanGaffney: Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) [16:24:10] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix ores-legacy app [deployment-charts] - 10https://gerrit.wikimedia.org/r/938888 (owner: 10Ilias Sarantopoulos) [16:25:13] (03Merged) 10jenkins-bot: ml-services: fix ores-legacy app [deployment-charts] - 10https://gerrit.wikimedia.org/r/938888 (owner: 10Ilias Sarantopoulos) [16:28:12] (03CR) 10Volans: [C: 03+1] "No blocker for me, just a suggestion for the status filter and a couple of inline question." [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [16:28:50] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [16:29:18] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [16:29:37] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [16:29:48] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10TBurmeister) [16:30:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938680 (https://phabricator.wikimedia.org/T341994) (owner: 10Urbanecm) [16:30:52] (03CR) 10Giuseppe Lavagetto: confd: allow running multiple instances (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [16:31:13] (03PS10) 10Giuseppe Lavagetto: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) [16:31:47] (03CR) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [16:32:57] (03Merged) 10jenkins-bot: Fix UserDatabaseHelper::hasMainspaceEdits [extensions/GrowthExperiments] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938680 (https://phabricator.wikimedia.org/T341994) (owner: 10Urbanecm) [16:33:13] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:938680|Fix UserDatabaseHelper::hasMainspaceEdits (T341994)]] [16:33:17] T341994: New version of Special:Impact returns "0 edits so far" at some wikis even when edits have been made - https://phabricator.wikimedia.org/T341994 [16:33:20] (03CR) 10Herron: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/936762 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [16:34:40] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:938680|Fix UserDatabaseHelper::hasMainspaceEdits (T341994)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:34:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42506/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [16:35:07] (03PS1) 10Bking: search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) [16:35:10] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [16:35:30] (03CR) 10CI reject: [V: 04-1] search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [16:35:53] (03CR) 10Herron: [C: 03+1] "sounds good -- I'll try merging this tomorrow morning eastern, time permitting" [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [16:35:56] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi) [16:36:02] (03CR) 10Volans: [C: 03+1] "tested locally" [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi) [16:36:24] (03CR) 10Herron: [C: 03+1] Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) (owner: 10Cwhite) [16:38:16] (03CR) 10DCausse: search-zk: Provision hostnames for new ZK cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [16:41:56] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:938680|Fix UserDatabaseHelper::hasMainspaceEdits (T341994)]] (duration: 08m 43s) [16:42:02] * urbanecm done [16:42:04] T341994: New version of Special:Impact returns "0 edits so far" at some wikis even when edits have been made - https://phabricator.wikimedia.org/T341994 [16:44:44] (03CR) 10Volans: [C: 04-1] "I'm ok with the approach, left a comment for a couple of errors inline, I didn't review it in all details yet, so not sure it does exactly" [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [16:48:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr) @Andrew usually we use the raid controller to configure os drives. I do not know if our Os install would recognize the correct drives... [16:50:35] (03PS2) 10Bking: search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) [16:50:58] (03CR) 10CI reject: [V: 04-1] search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [16:51:16] (03PS3) 10Bking: search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) [16:51:41] (03CR) 10CI reject: [V: 04-1] search-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [16:55:56] (03CR) 10Bking: search-zk: Provision hostnames for new ZK cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [16:56:27] (03PS4) 10Bking: flink-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) [16:58:01] (03PS1) 10Jbond: vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064) [16:58:03] (03PS1) 10Jbond: kerberos: fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938895 (https://phabricator.wikimedia.org/T95064) [16:58:05] (03PS1) 10Jbond: kerberos: Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938896 (https://phabricator.wikimedia.org/T95064) [16:58:07] (03PS1) 10Jbond: monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) [16:58:09] (03PS1) 10Jbond: install_server: updaate to use bash [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) [16:58:11] (03PS1) 10Jbond: kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) [16:59:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Papaul) @Jclark-ctr @Andrew even with the SW raid you still need the controller to be able to see the drives. [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1700) [17:00:04] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T1700). [17:02:55] (03CR) 10Bking: flink-zk: Provision hostnames for new ZK cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [17:18:08] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [17:19:58] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 01m 50s) [17:23:57] (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to drmrs cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938902 [17:30:02] (03PS19) 10Arturo Borrero Gonzalez: nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) [17:30:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nftables: spec: introduce service tests (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [17:31:25] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [17:31:36] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42507/console" [puppet] - 10https://gerrit.wikimedia.org/r/938902 (owner: 10Fabfur) [17:32:24] (03CR) 10Ebernhardson: [C: 03+1] flink-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [17:33:57] (03CR) 10Bking: [C: 03+2] flink-zk: Provision hostnames for new ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/938890 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [17:34:06] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 02m 41s) [17:52:43] (03CR) 10ArielGlenn: make sure job watcher and exception checker do not run on spare NFS dumps shares (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:52:52] (03PS4) 10ArielGlenn: make sure certain systemd jobs run only on the primary xml dumps NFS shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) [18:03:25] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:23] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:19:59] (03PS1) 10Ssingh: dnsrecursor: remove redundant parameter install_from_component [puppet] - 10https://gerrit.wikimedia.org/r/938913 (https://phabricator.wikimedia.org/T341611) [18:20:24] (03PS5) 10ArielGlenn: make sure certain systemd jobs run only on the primary xml dumps NFS shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) [18:20:32] (03CR) 10Jbond: [C: 03+1] sre.hosts.decommission: fix call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937508 (owner: 10Volans) [18:20:57] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42509/console" [puppet] - 10https://gerrit.wikimedia.org/r/938913 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh) [18:22:36] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [18:23:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/937509 (owner: 10Volans) [18:23:44] (03CR) 10Andrea Denisse: [C: 03+1] prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [18:25:11] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10Dwisehaupt) @MoritzMuehlenhoff I was granted the ability to do the authdns-update in T244901. We are part of the `fr-tech-admins` group that I believ... [18:25:28] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: remove redundant parameter install_from_component [puppet] - 10https://gerrit.wikimedia.org/r/938913 (https://phabricator.wikimedia.org/T341611) (owner: 10Ssingh) [18:27:09] (03CR) 10Ryan Kemper: [C: 03+2] Dashboard for wdqs update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) (owner: 10Ryan Kemper) [18:27:21] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Dashboard for wdqs update lag [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/933172 (https://phabricator.wikimedia.org/T324811) (owner: 10Ryan Kemper) [18:29:55] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10Dwisehaupt) >>! In T341440#9020951, @Dwisehaupt wrote: > @MoritzMuehlenhoff I was granted the ability to do the authdns-update in T244901. We are par... [18:30:13] (03PS6) 10ArielGlenn: make sure certain systemd jobs run only on the primary xml dumps NFS shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) [18:30:52] (03PS2) 10Jbond: tox: drop the minor version requierment on admin checks [puppet] - 10https://gerrit.wikimedia.org/r/938858 [18:39:28] (03CR) 10Jbond: "lgtm but we should remove the debug of cfssl_cmd.stdout" [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [18:41:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt) @Papaul Sorry for the delay, I was out last week. This appears to have fixed it up and the host is starting to build. Thanks! [18:44:14] (03PS1) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [18:44:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez) [18:45:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi) [18:47:19] (03CR) 10Jbond: [C: 03+2] rsyslog::receiver: update docs and add types [puppet] - 10https://gerrit.wikimedia.org/r/936762 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [18:48:19] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:50:50] (03PS2) 10Jbond: profile::cassandra: Add spec test [puppet] - 10https://gerrit.wikimedia.org/r/937979 [18:50:58] (03CR) 10Jbond: [C: 03+2] monkey_patch: fix up monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/937978 (owner: 10Jbond) [18:52:42] (03CR) 10Jbond: [C: 03+2] profile::cassandra: Add spec test [puppet] - 10https://gerrit.wikimedia.org/r/937979 (owner: 10Jbond) [18:57:06] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/937952 (owner: 10Hashar) [18:57:10] (03CR) 10Jbond: [C: 03+2] Rakefile: add tasks to run a global shellcheck [puppet] - 10https://gerrit.wikimedia.org/r/937952 (owner: 10Hashar) [18:58:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) @Dwisehaupt you welcome [18:58:26] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [18:58:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) [18:58:56] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:59:51] (03CR) 10Jbond: [C: 03+2] tox: drop the minor version requierment on admin checks [puppet] - 10https://gerrit.wikimedia.org/r/938858 (owner: 10Jbond) [19:05:58] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on Beta Cluster role::beta::docker_services boxes - https://phabricator.wikimedia.org/T342038 (10Jdforrester-WMF) [19:06:32] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on Beta Cluster role::beta::docker_services boxes - https://phabricator.wikimedia.org/T342038 (10Jdforrester-WMF) [19:16:12] (03PS5) 10Jdlrobson: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) [19:16:51] (03CR) 10Jdlrobson: [C: 03+1] bnwikiquote: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938351 (https://phabricator.wikimedia.org/T341910) (owner: 10Stang) [19:33:07] (03PS1) 10Eevans: cassandra: transition 3.11.14 from 'dev' to '3.x' [puppet] - 10https://gerrit.wikimedia.org/r/938917 [19:36:17] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938917 (owner: 10Eevans) [19:37:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:38:43] !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint [19:42:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:42:11] PROBLEM - AQS root url on aqs1010 is CRITICAL: connect to address 10.64.0.40 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:47:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:49:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:54:02] (03CR) 10Eevans: [C: 03+2] cassandra: uninstall cassandra-twcs deployment repository [puppet] - 10https://gerrit.wikimedia.org/r/937528 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans) [19:57:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:58:54] (03PS1) 10Eevans: Revert "cassandra: uninstall cassandra-twcs deployment repository" [puppet] - 10https://gerrit.wikimedia.org/r/938681 [19:59:37] (03CR) 10Eevans: [C: 03+2] Revert "cassandra: uninstall cassandra-twcs deployment repository" [puppet] - 10https://gerrit.wikimedia.org/r/938681 (owner: 10Eevans) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T2000). [20:00:05] koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:38] o/ [20:02:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:03:26] hey. I can deploy [20:03:35] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:43] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1001.eqiad.wmnet [20:03:44] !log bking@cumin1001 START - Cookbook sre.dns.netbox [20:04:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938351 (https://phabricator.wikimedia.org/T341910) (owner: 10Stang) [20:05:06] (03Merged) 10jenkins-bot: bnwikiquote: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938351 (https://phabricator.wikimedia.org/T341910) (owner: 10Stang) [20:05:22] !log taavi@deploy1002 Started scap: Backport for [[gerrit:938351|bnwikiquote: Update wordmark (T341910)]] [20:05:27] T341910: Update Bengali wikiquote wordmark - https://phabricator.wikimedia.org/T341910 [20:06:45] !log taavi@deploy1002 taavi and stang: Backport for [[gerrit:938351|bnwikiquote: Update wordmark (T341910)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:06:49] koi: please test [20:06:54] looking [20:07:51] taavi, tested on vector-2022 and LGTM [20:07:54] syncing [20:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:13:57] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:938351|bnwikiquote: Update wordmark (T341910)]] (duration: 08m 34s) [20:14:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:14:01] T341910: Update Bengali wikiquote wordmark - https://phabricator.wikimedia.org/T341910 [20:17:29] hi taavi, could you please purge "static/images/mobile/copyright/wikiquote-wordmark-bn.svg"? thx [20:17:44] (03PS2) 10Cory Massaro: wikifunctions: Set evaluator local URLs per T297314#9019664 [deployment-charts] - 10https://gerrit.wikimedia.org/r/938861 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [20:17:46] (03PS5) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [20:18:13] koi: I think {{done}}, could you double-check? [20:18:26] (03CR) 10CI reject: [V: 04-1] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [20:18:41] it's done now :) [20:19:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:19:05] !log taavi@mwmaint1002 ~ $ echo "https://en.wikipedia.org/static/images/mobile/copyright/wikiquote-wordmark-bn.svg" | mwscript purgeList.php --wiki enwiki [20:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:07] awesome [20:21:38] (03PS6) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [20:26:32] (03PS7) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [20:26:40] (03CR) 10Cory Massaro: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [20:34:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:37:57] (03CR) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [20:43:30] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:51:06] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001" [20:59:43] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001" [20:59:43] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:59:44] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1001.eqiad.wmnet on all recursors [20:59:47] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1001.eqiad.wmnet on all recursors [21:00:06] Reedy, sbassett, Maryum, and manfredi: Dear deployers, time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T2100). [21:00:12] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1001.eqiad.wmnet - bking@cumin1001" [21:00:58] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1001.eqiad.wmnet - bking@cumin1001" [21:01:32] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1001.eqiad.wmnet with OS bookworm [21:01:39] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm [21:05:56] (03PS1) 10Ahmon Dancy: Use buildkit wmf-v0.11-8 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) [21:07:01] (03PS2) 10Ahmon Dancy: Use buildkit wmf-v0.11-8 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) [21:07:18] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [21:07:55] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [21:08:42] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Set evaluator local URLs per T297314#9019664 [deployment-charts] - 10https://gerrit.wikimedia.org/r/938861 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [21:09:30] (03Merged) 10jenkins-bot: wikifunctions: Set evaluator local URLs per T297314#9019664 [deployment-charts] - 10https://gerrit.wikimedia.org/r/938861 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [21:09:33] (03Merged) 10jenkins-bot: wikifunctions: Attempt to write out our main config as JSON [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [21:09:55] jouncebot nowandnext [21:09:55] For the next 1 hour(s) and 50 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230717T2100) [21:09:55] In 4 hour(s) and 50 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0200) [21:15:12] (03CR) 10JHathaway: [C: 03+1] "looks good, any reason for the inconsistency in using brackets around variables in string interpolation? I would probably just always use " [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [21:15:47] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [21:16:35] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 47s) [21:17:06] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/938895 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [21:21:00] (03CR) 10JHathaway: [C: 03+1] "looks good other than one minor issue" [puppet] - 10https://gerrit.wikimedia.org/r/938896 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [21:32:05] 10SRE, 10Traffic: Investigate why Traffic SLO Grafana dashboard has negative values on combined SLI - https://phabricator.wikimedia.org/T341606 (10BCornwall) [21:39:51] (03PS3) 10Ahmon Dancy: Use buildkit wmf-v0.11-8 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) [21:40:11] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [21:42:05] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/938931/2062/" [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [21:43:04] (03PS1) 10Ahmon Dancy: Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) [21:43:28] (03CR) 10CI reject: [V: 04-1] Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [21:44:51] (03PS2) 10Ahmon Dancy: Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) [21:47:11] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [21:50:33] (03PS3) 10Ahmon Dancy: Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) [21:52:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1001.eqiad.wmnet with OS bookworm [21:52:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1001.eqiad.wmnet [21:53:04] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm executed w... [21:53:29] RECOVERY - AQS root url on aqs1010 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [21:53:30] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [21:54:11] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:55:29] !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint (duration: 136m 46s) [21:55:31] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/938939/2064/" [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [21:55:32] !log btullis@deploy1002 Started deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint [21:57:42] !log btullis@deploy1002 Finished deploy [analytics/aqs/deploy@91f8d92] (aqs-next): Deploying new AQS endpoint (duration: 02m 10s) [22:00:31] (03PS1) 10Jdlrobson: Limit client error alerts to "unknown" channel [puppet] - 10https://gerrit.wikimedia.org/r/938945 [22:04:22] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:32] (03CR) 10JHathaway: [C: 04-1] install_server: updaate to use bash (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [22:18:23] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:35:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt) [22:46:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Dwisehaupt) 05Open→03Resolved a:03Dwisehaupt Host is installed and has a base config. Further work will be tracked in T342064. [23:05:12] (03CR) 10Cwhite: [C: 03+1] "Looks ok per PCC." [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [23:05:26] (03PS4) 10Jforrester: Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) [23:05:28] (03PS6) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) [23:05:30] (03PS6) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [23:05:32] (03PS5) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [23:05:48] (03CR) 10Cwhite: [C: 03+1] New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [23:06:25] (03CR) 10Cwhite: [C: 03+1] prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [23:10:01] (03PS2) 10Cwhite: Limit client error alerts to "unknown" channel [puppet] - 10https://gerrit.wikimedia.org/r/938945 (owner: 10Jdlrobson) [23:19:52] (03CR) 10Cwhite: [C: 03+2] Limit client error alerts to "unknown" channel [puppet] - 10https://gerrit.wikimedia.org/r/938945 (owner: 10Jdlrobson) [23:24:48] (03CR) 10Gergő Tisza: [C: 03+1] IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [23:25:32] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:25:54] (03CR) 10Gergő Tisza: [C: 03+1] IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [23:26:01] (03PS2) 10Cwhite: logstash: remove thanos log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937604 (https://phabricator.wikimedia.org/T234565) [23:27:06] (03PS2) 10Cwhite: logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) [23:27:14] (03CR) 10Krinkle: [C: 03+1] mediawiki: Reduce the frequency of flaggedrevs updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup) [23:27:18] (03PS2) 10Cwhite: logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) [23:27:34] (03PS2) 10Cwhite: logstash: remove haproxy log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937601 (https://phabricator.wikimedia.org/T234565) [23:35:58] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:37:56] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!!" [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:38:33] (03CR) 10Andrea Denisse: [C: 03+1] logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:38:53] (03CR) 10Andrea Denisse: [C: 03+1] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)