[00:04:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye [00:04:48] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [00:10:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [00:10:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [00:39:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928915 [00:39:29] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928915 (owner: 10TrainBranchBot) [00:59:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/928915 (owner: 10TrainBranchBot) [01:03:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T338915 (10phaultfinder) [01:17:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:18:53] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T0200) [02:00:29] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:39] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:51] RECOVERY - Check systemd state on analytics1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.13 [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/928916 (https://phabricator.wikimedia.org/T337527) [02:07:49] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.13 [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/928916 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [02:26:47] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.13 [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/928916 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [02:27:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T0300) [03:01:20] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929434 (https://phabricator.wikimedia.org/T337527) [03:01:22] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929434 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [03:02:11] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929434 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [03:02:42] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.13 refs T337527 [03:02:46] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [03:16:25] RECOVERY - Check systemd state on db1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:09] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.13 refs T337527 (duration: 49m 27s) [03:52:13] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [03:54:24] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.11 (duration: 02m 13s) [04:13:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [04:27:13] RECOVERY - dump of db_inventory in eqiad on backupmon1001 is OK: Last dump for db_inventory at eqiad (db1215) taken on 2023-06-13 04:05:37 (90 KiB, -4.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:01:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:28:48] (03PS1) 10KartikMistry: Update MinT to 2023-06-12-125157-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929439 (https://phabricator.wikimedia.org/T337656) [05:31:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:33:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:35:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:48:40] !log dbmaint Deploy schema change on x1 eqiad with replication T337940 [05:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:44] T337940: Drop campaign_events.event_tracking_tool_id and campaign_events.event_tracking_tool_event_id - https://phabricator.wikimedia.org/T337940 [05:50:39] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:52:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:53:51] (03PS1) 10Marostegui: filtered_tables.txt: Remove event_tracking_tool_* [puppet] - 10https://gerrit.wikimedia.org/r/929604 (https://phabricator.wikimedia.org/T337940) [05:54:32] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Remove event_tracking_tool_* [puppet] - 10https://gerrit.wikimedia.org/r/929604 (https://phabricator.wikimedia.org/T337940) (owner: 10Marostegui) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T0600) [06:00:05] kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T0600) [06:08:48] (03PS2) 10KartikMistry: Update cxserver to 2023-06-13-054849-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929036 (https://phabricator.wikimedia.org/T338123) [06:15:00] marostegui: safe to deploy cxserver/MinT? [06:15:08] kart_: yep [06:15:30] cool. Thanks! [06:15:59] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-06-13-054849-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929036 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [06:16:58] (03Merged) 10jenkins-bot: Update cxserver to 2023-06-13-054849-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929036 (https://phabricator.wikimedia.org/T338123) (owner: 10KartikMistry) [06:17:41] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:18:02] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:26:20] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:26:55] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:27:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:38:27] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:38:44] (03PS2) 10Alexandros Kosiaris: shellbox: Add service mesh envoy retries [puppet] - 10https://gerrit.wikimedia.org/r/929354 (https://phabricator.wikimedia.org/T292633) (owner: 10Effie Mouzeli) [06:39:11] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:40:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] shellbox: Add service mesh envoy retries [puppet] - 10https://gerrit.wikimedia.org/r/929354 (https://phabricator.wikimedia.org/T292633) (owner: 10Effie Mouzeli) [06:40:08] (03PS2) 10KartikMistry: Update MinT to 2023-06-13-061519-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929439 (https://phabricator.wikimedia.org/T337656) [06:41:15] !log Updated cxserver to 2023-06-13-054849-production (T338123, T338146, T337834) [06:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:21] T337834: Enable MinT, Content and Section Translation for a 3rd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337834 [06:41:21] T338123: Enable MinT, Content and Section Translation for a 4th group of languages previously lacking machine translation - https://phabricator.wikimedia.org/T338123 [06:41:21] T338146: Enable MinT for some languages with potential issues with current translation services - https://phabricator.wikimedia.org/T338146 [06:46:48] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-06-13-061519-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929439 (https://phabricator.wikimedia.org/T337656) (owner: 10KartikMistry) [06:47:48] (03Merged) 10jenkins-bot: Update MinT to 2023-06-13-061519-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/929439 (https://phabricator.wikimedia.org/T337656) (owner: 10KartikMistry) [06:48:34] (HelmReleaseBadStatus) firing: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:48:37] (03PS1) 10Elukey: Move varnishkafka instances on cp4037 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/929619 (https://phabricator.wikimedia.org/T337825) [06:48:52] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:50:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41682/console" [puppet] - 10https://gerrit.wikimedia.org/r/929619 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [06:51:14] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:53:13] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6004.drmrs.wmnet [06:53:34] (HelmReleaseBadStatus) resolved: Helm release machinetranslation/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=machinetranslation - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:54:07] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6012.drmrs.wmnet [06:55:05] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:55:09] !log rebooting cp6004.drmrs.wmnet and cp6012.drmrs.wmnet for upgrade [06:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:42] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [06:55:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [06:56:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [06:57:05] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:57:48] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move varnishkafka instances on cp4037 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/929619 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [06:58:24] (03CR) 10Hashar: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/929324 (owner: 10Jbond) [06:58:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:59:17] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:59:53] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:00:05] Amir1, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:45] (03CR) 10Hashar: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/929324 (owner: 10Jbond) [07:02:18] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6004.drmrs.wmnet [07:03:01] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6012.drmrs.wmnet [07:03:49] (03CR) 10Hashar: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/929324 (owner: 10Jbond) [07:04:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow3002.esams.wmnet with OS bookworm [07:05:01] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow3002.esams.wmnet with OS bookworm [07:06:23] (03CR) 10Hashar: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/929324 (owner: 10Jbond) [07:08:09] My MinT deployment still on.. [07:08:34] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:08:43] Now done :) [07:08:44] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [07:09:24] !log Updated MinT to 2023-06-13-061519-production (T337656, T334465) [07:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:28] T337656: Explore using IndicTrans2 - better model supporting 22 Indic languages - https://phabricator.wikimedia.org/T337656 [07:09:29] T334465: MinT: Detect language of source content automatically - https://phabricator.wikimedia.org/T334465 [07:10:08] !log move varnishkafka instances on cp4037 to PKI TLS certs - T337825 [07:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:12] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [07:11:56] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) All vk instances running on cp4037, next steps: 1) Monitor cp4037 to verify that nothing explodes. 2) Extend the change to ulsfo and monitor. 3) Extend the change to a... [07:12:03] (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 3 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929621 (https://phabricator.wikimedia.org/T338123) [07:12:27] (03CR) 10Hashar: "Sorry for the spam. All changes to homer were broken in CI cause of an overlap in the repository name with operations/software/homer/publi" [software/homer] - 10https://gerrit.wikimedia.org/r/929324 (owner: 10Jbond) [07:13:30] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10MoritzMuehlenhoff) >>! In T338468#8917166, @SLyngshede-WMF wrote: > User have been added to the LDAP NDA group, we're holding off processing the rest until after... [07:14:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/929338 (owner: 10Slyngshede) [07:21:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [07:23:11] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6013.drmrs.wmnet [07:23:11] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6005.drmrs.wmnet [07:23:43] !log rebooting cp6005.drmrs.wmnet and cp6013.drmrs.wmnet for upgrade [07:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:55] (03PS1) 10Slyngshede: data.yaml: Add user superpes to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/929623 (https://phabricator.wikimedia.org/T338468) [07:26:38] (03CR) 10CI reject: [V: 04-1] data.yaml: Add user superpes to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/929623 (https://phabricator.wikimedia.org/T338468) (owner: 10Slyngshede) [07:26:46] (03PS2) 10Slyngshede: data.yaml: Add user superpes to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/929623 (https://phabricator.wikimedia.org/T338468) [07:27:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow3002.esams.wmnet with reason: host reimage [07:32:01] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6013.drmrs.wmnet [07:32:05] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6005.drmrs.wmnet [07:32:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow3002.esams.wmnet with reason: host reimage [07:34:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Wikimedia account link, clear banner on linking. [software/bitu] - 10https://gerrit.wikimedia.org/r/929338 (owner: 10Slyngshede) [07:35:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10SLyngshede-WMF) [07:37:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove netflow2002 from Kafka config [puppet] - 10https://gerrit.wikimedia.org/r/929340 (https://phabricator.wikimedia.org/T330884) (owner: 10Muehlenhoff) [07:38:37] (03PS2) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [07:40:53] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:42:08] (03PS3) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [07:43:46] (03PS1) 10Fabfur: admin: Add second ssh key for user fabfur [puppet] - 10https://gerrit.wikimedia.org/r/929625 [07:44:25] (03PS2) 10Fabfur: admin: Add second ssh key for user fabfur [puppet] - 10https://gerrit.wikimedia.org/r/929625 [07:48:33] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [07:51:55] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [07:52:21] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6006.drmrs.wmnet [07:52:27] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6014.drmrs.wmnet [07:53:00] !log reboot cp6006.drmrs.wmnet and cp6014.drmrs.wmnet for kernel upgrade (T335835) [07:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:21] (03CR) 10Jcrespo: [C: 03+1] "Wikibooks now fixed." [puppet] - 10https://gerrit.wikimedia.org/r/928595 (owner: 10Clément Goubert) [07:55:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (syntax- an procedure-wise)" [puppet] - 10https://gerrit.wikimedia.org/r/929625 (owner: 10Fabfur) [07:55:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41683/console" [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:55:38] (03CR) 10Jcrespo: [C: 03+2] check_legal_html: Update for cc-by-sa 4.0 [puppet] - 10https://gerrit.wikimedia.org/r/928595 (owner: 10Clément Goubert) [07:56:03] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Provide a custom error message for plaintext requests [puppet] - 10https://gerrit.wikimedia.org/r/928632 (https://phabricator.wikimedia.org/T338481) (owner: 10Majavah) [07:56:17] Am I too late for a backport to wmf.13? [07:56:43] (nevermind, I'll leave it for another time.) [07:58:17] (03CR) 10Fabfur: [C: 03+2] admin: Add second ssh key for user fabfur [puppet] - 10https://gerrit.wikimedia.org/r/929625 (owner: 10Fabfur) [08:00:05] jnuche and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T0800). [08:00:52] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6014.drmrs.wmnet [08:00:59] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6006.drmrs.wmnet [08:02:39] PROBLEM - Host asw1-b12-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:02:47] PROBLEM - Host ps1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [08:02:49] PROBLEM - Host cr1-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:02:49] PROBLEM - Host cr2-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:02:51] PROBLEM - Host asw1-b13-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:03:07] PROBLEM - Host ps1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [08:03:25] XioNoX: ^ mgmt network related? [08:03:46] 10SRE, 10Traffic, 10Patch-For-Review: Confusing error message when making plaintext HTTP POST requests to Wikimedia sites - https://phabricator.wikimedia.org/T338481 (10Vgutierrez) 05Open→03Resolved a:03taavi [08:03:53] PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 32, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:04:57] PROBLEM - Host scs-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [08:05:59] RECOVERY - Ensure legal html en.wp on en.wikipedia.org is OK: All legal html excerpts are present for https://en.wikipedia.org/wiki/Main_Page (desktop site): copyright, terms, privacy, trademark https://wikitech.wikimedia.org/wiki/Check_legal_html [08:05:59] RECOVERY - Ensure legal html en.m.wp on en.m.wikipedia.org is OK: All legal html excerpts are present for https://en.m.wikipedia.org/wiki/Main_Page (mobile site): copyright, terms, privacy https://wikitech.wikimedia.org/wiki/Check_legal_html [08:06:55] marostegui: yeah looks like it [08:07:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:08:02] (03PS4) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [08:10:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow3002.esams.wmnet with OS bookworm [08:10:57] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow3002.esams.wmnet with OS bookworm completed: - netflow3002 (**WARN**) -... [08:11:40] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41684/console" [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:12:31] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929629 (https://phabricator.wikimedia.org/T337527) [08:12:33] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929629 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [08:12:54] (03CR) 10Urbanecm: shellbox: Add service mesh envoy retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929354 (https://phabricator.wikimedia.org/T292633) (owner: 10Effie Mouzeli) [08:13:20] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929629 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [08:14:12] marostegui: more exactly we lost 1 power feed [08:14:19] in both racks [08:16:29] XioNoX: ouch [08:16:40] it's a planned maintenance [08:17:00] Time Start: 13 June 2023 09:00 Local time [08:17:00] Time End: 13 June 2023 18:00 Local time [08:17:06] (03PS5) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [08:17:45] but no email sent to the maint-announce list [08:19:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow1002.eqiad.wmnet with OS bookworm [08:19:51] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow1002.eqiad.wmnet with OS bookworm [08:20:17] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.13 refs T337527 [08:20:20] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [08:22:15] (03CR) 10Slyngshede: "We would like scripts to not be template, as this prevent CI tooling from working correctly on e.g. Python scripts." [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:22:19] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6015.drmrs.wmnet [08:22:19] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6007.drmrs.wmnet [08:22:45] l!og reboot cp6007 and cp6015 for kernel upgrade (T335835) [08:22:58] !log reboot cp6007 and cp6015 for kernel upgrade (T335835) [08:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:03] RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.67 ms [08:23:05] RECOVERY - Host asw1-b12-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.12 ms [08:23:05] RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.53 ms [08:23:23] RECOVERY - Host asw1-b13-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.57 ms [08:24:19] RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:24:21] I opened a ticket with interxion so they add maint-announce@wikimedia.org [08:24:35] RECOVERY - Host cr1-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.57 ms [08:24:35] RECOVERY - Host cr2-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.51 ms [08:25:39] !log cleaning up prometheus-https service from IPVS on lvs2014 - T326657 [08:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:43] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [08:26:55] RECOVERY - Host scs-drmrs is UP: PING OK - Packet loss = 0%, RTA = 85.64 ms [08:27:04] (03CR) 10Ayounsi: [C: 03+2] Add report-zero-oif-gw-on-discard for netflow [homer/public] - 10https://gerrit.wikimedia.org/r/899504 (https://phabricator.wikimedia.org/T331707) (owner: 10Ayounsi) [08:27:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:27:43] 10SRE-OnFire, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10Gehel) [08:28:08] 10SRE-OnFire, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel) [08:28:48] (03CR) 10Jbond: idp: add gitlab-replicas to gitlab_oidc config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [08:28:55] 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10jcrespo) DC-ops: By any chance do you have a compatible spare memory stick to substitute de above failed module? [08:29:10] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:30:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6007.drmrs.wmnet [08:30:53] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6015.drmrs.wmnet [08:30:55] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10jcrespo) [08:31:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow1002.eqiad.wmnet with reason: host reimage [08:35:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow1002.eqiad.wmnet with reason: host reimage [08:36:08] (03PS1) 10Kosta Harlan: Section images: Fix image placeholder alignment for RTL content [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929464 (https://phabricator.wikimedia.org/T338837) [08:36:30] (03CR) 10Ayounsi: [C: 03+2] nfacctd: export next-hop IP and outbound interface [puppet] - 10https://gerrit.wikimedia.org/r/899516 (https://phabricator.wikimedia.org/T331707) (owner: 10Ayounsi) [08:36:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [08:37:09] (03PS1) 10Jcrespo: mariadb: Disable notifications for db1139 after crash [puppet] - 10https://gerrit.wikimedia.org/r/929632 (https://phabricator.wikimedia.org/T338766) [08:38:35] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) > After fixing the redirect_uri I'm able to login successfully to the admin interface (https://gitlab.wikimedia.org/admin) using... [08:41:38] (03CR) 10Jcrespo: Revert "bacula: Ignore the backup check of contint1001 jobs" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867231 (owner: 10Jcrespo) [08:43:56] (03PS1) 10Winston Sung: Revert "Implement Language Converter for yue (Cantonese)" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929647 (https://phabricator.wikimedia.org/T59106) [08:44:23] (03PS2) 10Winston Sung: Revert "Implement Language Converter for yue (Cantonese)" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929647 (https://phabricator.wikimedia.org/T59106) [08:44:44] (03CR) 10Jcrespo: [C: 03+2] mariadb: Disable notifications for db1139 after crash [puppet] - 10https://gerrit.wikimedia.org/r/929632 (https://phabricator.wikimedia.org/T338766) (owner: 10Jcrespo) [08:45:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:49:05] (03PS4) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [08:49:38] !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudservices2004-dev [08:51:55] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [08:52:06] (03CR) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:52:47] (03CR) 10Jbond: [C: 03+2] tlsproxy::localssl: drop class [puppet] - 10https://gerrit.wikimedia.org/r/929383 (https://phabricator.wikimedia.org/T191393) (owner: 10Jbond) [08:52:49] (03PS5) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [08:53:47] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Tracking-Neverending: Puppet: tracking catalogs that changes at every run - https://phabricator.wikimedia.org/T191388 (10jbond) [08:53:50] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet: tlsproxy localssl default_server make a Notify at each run - https://phabricator.wikimedia.org/T191393 (10jbond) 05Open→03Resolved a:03jbond This class has now been removed [08:54:31] (03PS2) 10Jelto: idp: add gitlab-replicas and gitlab_replica_oidc config [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) [08:54:33] (03PS6) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [08:54:48] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:00] (03CR) 10Jelto: idp: add gitlab-replicas and gitlab_replica_oidc config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [08:59:29] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [09:03:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow1002.eqiad.wmnet with OS bookworm [09:03:23] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6016.drmrs.wmnet [09:03:24] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow1002.eqiad.wmnet with OS bookworm completed: - netflow1002 (**PASS**) -... [09:03:24] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp6008.drmrs.wmnet [09:03:41] !log reboot cp6008 and cp6016 for kernel upgrade (T335835) [09:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:07:16] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [09:07:49] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) `name=IPv4,lang=json { "event_type": "purge", "tag2": 1, "as_src": 48551, "as_dst": 0, "comms": "", "as_path": ""... [09:08:34] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [09:08:34] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:08:35] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices2004-dev [09:08:45] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: `cloudservices2004-dev`... [09:10:11] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:10:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:12:17] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6016.drmrs.wmnet [09:12:20] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [09:12:24] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp6008.drmrs.wmnet [09:12:30] (03CR) 10Kamila Součková: [C: 03+1] poolcounter: use per-format throttling key [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/929394 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [09:13:22] (03CR) 10Hnowlan: [C: 03+2] Thumbor: deploy various poolcounter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/929392 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [09:14:55] (03Merged) 10jenkins-bot: Thumbor: deploy various poolcounter fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/929392 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [09:16:35] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) [09:16:38] 10SRE-swift-storage: Bring ms-be107[2-5] into the rings - https://phabricator.wikimedia.org/T335279 (10MatthewVernon) 05Open→03Resolved These nodes are now fully loaded into the rings. [09:18:01] (03PS1) 10MVernon: swift: remove ms-be104[0-3] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/929642 (https://phabricator.wikimedia.org/T335281) [09:18:30] 10SRE-swift-storage, 10Patch-For-Review: Drain and then decommission ms-be10[40-43] - https://phabricator.wikimedia.org/T335281 (10MatthewVernon) These are all at 0 weight in the rings, so can now be removed from the rings ready to decommission. [09:19:40] (03CR) 10Muehlenhoff: "All netflow* hosts are on 1.2.4 now" [puppet] - 10https://gerrit.wikimedia.org/r/921390 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [09:20:04] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:20:21] (03CR) 10Marostegui: [C: 03+1] swift: remove ms-be104[0-3] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/929642 (https://phabricator.wikimedia.org/T335281) (owner: 10MVernon) [09:20:35] (03PS1) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [09:22:09] (03CR) 10MVernon: [C: 03+2] swift: remove ms-be104[0-3] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/929642 (https://phabricator.wikimedia.org/T335281) (owner: 10MVernon) [09:22:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2180 to upgrade to 10.6.14 T338918', diff saved to https://phabricator.wikimedia.org/P49403 and previous config saved to /var/cache/conftool/dbconfig/20230613-092208-root.json [09:22:13] T338918: Compile and package 10.6.14 - https://phabricator.wikimedia.org/T338918 [09:22:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [09:22:52] (03CR) 10CI reject: [V: 04-1] C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:23:00] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2004-dev: cleanup hieradata [puppet] - 10https://gerrit.wikimedia.org/r/929644 (https://phabricator.wikimedia.org/T338778) [09:23:07] (03PS6) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [09:23:31] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4045.ulsfo.wmnet [09:23:31] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4037.ulsfo.wmnet [09:23:32] (03CR) 10Ayounsi: Fastnetmon: enable Prometheus exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921390 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [09:23:34] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:23:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2004-dev: cleanup hieradata [puppet] - 10https://gerrit.wikimedia.org/r/929644 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [09:24:00] !log reboot cp4037 and cp4045 for kernel upgrade (T335835) [09:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:28] (03CR) 10Slyngshede: P:hive::client move beeline script to files. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:26:01] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: upload.wikimedia.beta.wmflabs.org certificate expired (May 2023) - https://phabricator.wikimedia.org/T337642 (10AlexisJazz) 05Open→03Resolved a:03AlexisJazz Dunno how but it works again. [09:26:06] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) [09:26:14] (03PS7) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [09:26:18] PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [09:26:41] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) 05Open→03Resolved [09:26:54] (03PS1) 10Muehlenhoff: Remove LDAP access for Bishop Fox contractors [puppet] - 10https://gerrit.wikimedia.org/r/929645 (https://phabricator.wikimedia.org/T336357) [09:29:53] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:53] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:30:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/921390 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [09:30:36] (03CR) 10Ayounsi: [C: 03+2] Fastnetmon: enable Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/921390 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [09:32:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:33:21] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4045.ulsfo.wmnet [09:33:21] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4037.ulsfo.wmnet [09:33:44] (03PS1) 10Cathal Mooney: Disable multihop BGP for cloud hosts connected directly to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/929666 (https://phabricator.wikimedia.org/T324992) [09:34:33] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for Bishop Fox contractors [puppet] - 10https://gerrit.wikimedia.org/r/929645 (https://phabricator.wikimedia.org/T336357) (owner: 10Muehlenhoff) [09:38:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Disable multihop BGP for cloud hosts connected directly to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/929666 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney) [09:38:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/929666 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney) [09:38:39] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [09:38:52] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [09:39:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [09:40:11] (03CR) 10Cathal Mooney: [C: 03+2] Disable multihop BGP for cloud hosts connected directly to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/929666 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney) [09:41:44] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [09:42:51] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [09:45:30] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [09:45:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49404 and previous config saved to /var/cache/conftool/dbconfig/20230613-094538-root.json [09:48:05] (03PS1) 10Jelto: miscweb: add transparencyreport release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/929667 (https://phabricator.wikimedia.org/T338781) [09:48:17] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [09:48:57] RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 85.61 ms [09:49:58] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [09:51:17] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.25 ms [09:51:17] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 301.22 ms [09:52:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:16] (03CR) 10Hnowlan: [C: 03+2] poolcounter: use per-format throttling key [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/929394 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [09:55:46] (03PS3) 10Hnowlan: service: move rest-gateway to production [puppet] - 10https://gerrit.wikimedia.org/r/920667 (https://phabricator.wikimedia.org/T329049) [09:58:36] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts netflow2002.codfw.wmnet [09:58:46] (03PS1) 10Ladsgroup: Add 2023/drop_revision_comment_temp_T338284.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/929668 (https://phabricator.wikimedia.org/T338284) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1000) [10:00:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49405 and previous config saved to /var/cache/conftool/dbconfig/20230613-100043-root.json [10:02:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:03:29] (03PS1) 10Ladsgroup: Set medium wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929670 (https://phabricator.wikimedia.org/T335343) [10:03:59] (03Merged) 10jenkins-bot: poolcounter: use per-format throttling key [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/929394 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:04:43] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:06:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:06:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow2002.codfw.wmnet [10:07:01] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `netflow2002.codfw.wmnet` - netflow2002.codfw.wmnet (**PASS**) - Downtimed host... [10:07:15] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:07:42] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4046.ulsfo.wmnet [10:07:43] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4038.ulsfo.wmnet [10:07:48] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) [10:07:55] !log reboot cp4038 and cp4046 for kernel upgrade (T335835) [10:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:45] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All newflow hosts are migrated to Bookworm and thus FNM 1.2.4 [10:12:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10MoritzMuehlenhoff) >>! In T338468#8926447, @MoritzMuehlenhoff wrote: >>>! In T338468#8917166, @SLyngshede-WMF wrote: >> User have been added... [10:12:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1138.eqiad.wmnet with reason: Maintenance [10:13:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1138.eqiad.wmnet with reason: Maintenance [10:13:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T336886)', diff saved to https://phabricator.wikimedia.org/P49406 and previous config saved to /var/cache/conftool/dbconfig/20230613-101310-ladsgroup.json [10:13:14] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:14:31] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Mail: Cleanup debconf handling in mailman puppet setup - https://phabricator.wikimedia.org/T144933 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This can be closed, the current debconf integration appears to be working fine. [10:15:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49407 and previous config saved to /var/cache/conftool/dbconfig/20230613-101548-root.json [10:15:53] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:15:56] (03CR) 10Effie Mouzeli: shellbox: Add service mesh envoy retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929354 (https://phabricator.wikimedia.org/T292633) (owner: 10Effie Mouzeli) [10:17:28] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4038.ulsfo.wmnet [10:17:56] (03CR) 10Marostegui: Add 2023/drop_revision_comment_temp_T338284.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/929668 (https://phabricator.wikimedia.org/T338284) (owner: 10Ladsgroup) [10:18:06] !killed extensions/MachineVision/maintenance/prioritizeFilesWithTemplate.php it was blocking a depool in s4 [10:18:12] !log killed extensions/MachineVision/maintenance/prioritizeFilesWithTemplate.php it was blocking a depool in s4 [10:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:25] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4046.ulsfo.wmnet [10:18:42] (03PS1) 10Slyngshede: data.yaml: Add superpes as LDAP only user. [puppet] - 10https://gerrit.wikimedia.org/r/929672 (https://phabricator.wikimedia.org/T338468) [10:18:47] (03PS2) 10Ladsgroup: Add 2023/drop_revision_comment_temp_T338284.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/929668 (https://phabricator.wikimedia.org/T338284) [10:18:49] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05aborrero→03Papaul hey @Papaul or @Jhancock.wm would you please do the following: * disconnect server eno1 from... [10:19:02] (03CR) 10Ladsgroup: Add 2023/drop_revision_comment_temp_T338284.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/929668 (https://phabricator.wikimedia.org/T338284) (owner: 10Ladsgroup) [10:19:13] (03PS2) 10Slyngshede: data.yaml: Add superpes as LDAP only users. [puppet] - 10https://gerrit.wikimedia.org/r/929672 (https://phabricator.wikimedia.org/T338468) [10:19:29] (03CR) 10Marostegui: [C: 03+1] Add 2023/drop_revision_comment_temp_T338284.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/929668 (https://phabricator.wikimedia.org/T338284) (owner: 10Ladsgroup) [10:20:14] (03CR) 10Ladsgroup: [C: 03+2] Add 2023/drop_revision_comment_temp_T338284.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/929668 (https://phabricator.wikimedia.org/T338284) (owner: 10Ladsgroup) [10:20:39] (03Merged) 10jenkins-bot: Add 2023/drop_revision_comment_temp_T338284.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/929668 (https://phabricator.wikimedia.org/T338284) (owner: 10Ladsgroup) [10:21:23] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_MachineVision_prioritize_uncategorized.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T336886)', diff saved to https://phabricator.wikimedia.org/P49408 and previous config saved to /var/cache/conftool/dbconfig/20230613-102227-ladsgroup.json [10:22:32] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:22:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10SLyngshede-WMF) > The interim LDAP access needs a tracking entry I meant, it's currently alerting for a lack of it, see the mail sent root@... [10:23:50] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) @aborrero yep exactly, when this is done let me know I'll change the netbox side. [10:25:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:25:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:25:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:25:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:25:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2124.codfw.wmnet with reason: Maintenance [10:26:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2124.codfw.wmnet with reason: Maintenance [10:26:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:26:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:26:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance [10:26:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance [10:26:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance [10:26:58] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10MoritzMuehlenhoff) [10:27:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance [10:27:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2187.codfw.wmnet with reason: Maintenance [10:27:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/929672 (https://phabricator.wikimedia.org/T338468) (owner: 10Slyngshede) [10:27:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2187.codfw.wmnet with reason: Maintenance [10:27:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance [10:27:50] (03CR) 10Slyngshede: [C: 03+2] data.yaml: Add superpes as LDAP only users. [puppet] - 10https://gerrit.wikimedia.org/r/929672 (https://phabricator.wikimedia.org/T338468) (owner: 10Slyngshede) [10:28:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance [10:28:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:28:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:28:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance [10:28:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance [10:29:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:29:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10MoritzMuehlenhoff) >>! In T334955#8925229, @Papaul wrote: > @MoritzMuehlenhoff > ` > > (initramfs) uname -a > Linux (none) 4.19.0-24-amd64 #1 SMP D... [10:29:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:29:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:30:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:30:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:30:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:30:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1168.eqiad.wmnet with reason: Maintenance [10:30:24] (03PS3) 10Slyngshede: data.yaml: Add user superpes to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/929623 (https://phabricator.wikimedia.org/T338468) [10:30:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1168.eqiad.wmnet with reason: Maintenance [10:30:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:30:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:30:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49409 and previous config saved to /var/cache/conftool/dbconfig/20230613-103053-root.json [10:30:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:31:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:31:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:31:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:31:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1201.eqiad.wmnet with reason: Maintenance [10:31:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1201.eqiad.wmnet with reason: Maintenance [10:31:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1213.eqiad.wmnet with reason: Maintenance [10:32:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1213.eqiad.wmnet with reason: Maintenance [10:32:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1224.eqiad.wmnet with reason: Maintenance [10:32:18] jouncebot: nowandnext [10:32:18] For the next 0 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1000) [10:32:18] In 2 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1300) [10:32:19] In 2 hour(s) and 27 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1300) [10:32:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1224.eqiad.wmnet with reason: Maintenance [10:32:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:32:36] (03CR) 10Vgutierrez: [C: 03+1] service: move rest-gateway to production [puppet] - 10https://gerrit.wikimedia.org/r/920667 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [10:32:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:34:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:35:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:35:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:35:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:36:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:36:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:37:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:37:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:37:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P49410 and previous config saved to /var/cache/conftool/dbconfig/20230613-103734-ladsgroup.json [10:38:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:38:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:38:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:38:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:39:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:39:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:40:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1185.eqiad.wmnet with reason: Maintenance [10:40:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1185.eqiad.wmnet with reason: Maintenance [10:41:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:41:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:41:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1210.eqiad.wmnet with reason: Maintenance [10:42:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1210.eqiad.wmnet with reason: Maintenance [10:42:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:43:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1213.eqiad.wmnet with reason: Maintenance [10:43:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1213.eqiad.wmnet with reason: Maintenance [10:43:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:44:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:44:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2101.codfw.wmnet with reason: Maintenance [10:44:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2101.codfw.wmnet with reason: Maintenance [10:45:29] (03PS1) 10Hnowlan: trafficserver: route proton requests via the API gateway [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) [10:45:54] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4039.ulsfo.wmnet [10:45:55] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet [10:45:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49411 and previous config saved to /var/cache/conftool/dbconfig/20230613-104557-root.json [10:46:10] (03CR) 10CI reject: [V: 04-1] trafficserver: route proton requests via the API gateway [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:46:19] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [10:46:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2111.codfw.wmnet with reason: Maintenance [10:46:43] !log reboot cp4039 and cp4047 for kernel upgrade (T335835) [10:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2111.codfw.wmnet with reason: Maintenance [10:48:12] (03PS2) 10Hnowlan: trafficserver: route proton requests via the API gateway [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) [10:48:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2123.codfw.wmnet with reason: Maintenance [10:48:23] (03CR) 10CI reject: [V: 04-1] trafficserver: route proton requests via the API gateway [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [10:48:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2123.codfw.wmnet with reason: Maintenance [10:49:06] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2128.codfw.wmnet with reason: Maintenance [10:50:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2128.codfw.wmnet with reason: Maintenance [10:50:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:50:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:51:02] jouncebot: nowandnext [10:51:02] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1000) [10:51:02] In 2 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1300) [10:51:02] In 2 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1300) [10:52:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance [10:52:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance [10:52:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P49412 and previous config saved to /var/cache/conftool/dbconfig/20230613-105240-ladsgroup.json [10:53:10] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:53:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:55:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:55:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:56:12] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4039.ulsfo.wmnet [10:56:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4047.ulsfo.wmnet [10:56:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2178.codfw.wmnet with reason: Maintenance [10:57:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:01:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49413 and previous config saved to /var/cache/conftool/dbconfig/20230613-110102-root.json [11:02:09] (03PS1) 10Stevemunene: Prevent removal of python2 on bullseye stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/929675 (https://phabricator.wikimedia.org/T336036) [11:02:19] (03CR) 10CI reject: [V: 04-1] Prevent removal of python2 on bullseye stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/929675 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [11:04:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1130.eqiad.wmnet with reason: Maintenance [11:04:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1130.eqiad.wmnet with reason: Maintenance [11:05:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2113.codfw.wmnet with reason: Maintenance [11:05:18] (03PS1) 10Hnowlan: thumbor: split expensive format poolcounter buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/929676 (https://phabricator.wikimedia.org/T337649) [11:05:25] (03CR) 10CI reject: [V: 04-1] thumbor: split expensive format poolcounter buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/929676 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [11:05:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2113.codfw.wmnet with reason: Maintenance [11:06:12] gerrit is unwell it seems [11:06:30] (03PS1) 10Ladsgroup: moveToExternal: Also check for utf8 encoding before trying to convert [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/929648 [11:06:42] (03CR) 10Ladsgroup: [C: 03+2] moveToExternal: Also check for utf8 encoding before trying to convert [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/929648 (owner: 10Ladsgroup) [11:07:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:07:28] (03CR) 10Ladsgroup: [C: 03+2] Set medium wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929670 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [11:07:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:07:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:07:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:07:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:07:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T336886)', diff saved to https://phabricator.wikimedia.org/P49414 and previous config saved to /var/cache/conftool/dbconfig/20230613-110746-ladsgroup.json [11:07:50] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:07:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:07:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:08:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:08:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1177.eqiad.wmnet with reason: Maintenance [11:08:16] (03Merged) 10jenkins-bot: Set medium wikis to read new for externallinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929670 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [11:08:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1177.eqiad.wmnet with reason: Maintenance [11:08:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1178.eqiad.wmnet with reason: Maintenance [11:08:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1178.eqiad.wmnet with reason: Maintenance [11:08:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1192.eqiad.wmnet with reason: Maintenance [11:09:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1192.eqiad.wmnet with reason: Maintenance [11:09:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1193.eqiad.wmnet with reason: Maintenance [11:09:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1193.eqiad.wmnet with reason: Maintenance [11:09:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1203.eqiad.wmnet with reason: Maintenance [11:09:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1203.eqiad.wmnet with reason: Maintenance [11:09:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1209.eqiad.wmnet with reason: Maintenance [11:09:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1209.eqiad.wmnet with reason: Maintenance [11:09:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1211.eqiad.wmnet with reason: Maintenance [11:10:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1211.eqiad.wmnet with reason: Maintenance [11:10:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1214.eqiad.wmnet with reason: Maintenance [11:10:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1214.eqiad.wmnet with reason: Maintenance [11:10:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1216.eqiad.wmnet with reason: Maintenance [11:10:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1216.eqiad.wmnet with reason: Maintenance [11:10:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:10:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:10:48] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:929670|Set medium wikis to read new for externallinks (T335343)]] [11:10:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2098.codfw.wmnet with reason: Maintenance [11:10:51] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [11:11:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2098.codfw.wmnet with reason: Maintenance [11:11:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2100.codfw.wmnet with reason: Maintenance [11:11:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2100.codfw.wmnet with reason: Maintenance [11:11:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2152.codfw.wmnet with reason: Maintenance [11:11:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2152.codfw.wmnet with reason: Maintenance [11:11:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2154.codfw.wmnet with reason: Maintenance [11:11:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2154.codfw.wmnet with reason: Maintenance [11:11:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2161.codfw.wmnet with reason: Maintenance [11:12:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2161.codfw.wmnet with reason: Maintenance [11:12:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2162.codfw.wmnet with reason: Maintenance [11:12:25] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:929670|Set medium wikis to read new for externallinks (T335343)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [11:12:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2162.codfw.wmnet with reason: Maintenance [11:12:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2163.codfw.wmnet with reason: Maintenance [11:12:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2163.codfw.wmnet with reason: Maintenance [11:12:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2164.codfw.wmnet with reason: Maintenance [11:13:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2164.codfw.wmnet with reason: Maintenance [11:13:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [11:13:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [11:13:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2166.codfw.wmnet with reason: Maintenance [11:13:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2166.codfw.wmnet with reason: Maintenance [11:13:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2167.codfw.wmnet with reason: Maintenance [11:13:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2167.codfw.wmnet with reason: Maintenance [11:13:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2168.codfw.wmnet with reason: Maintenance [11:14:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2168.codfw.wmnet with reason: Maintenance [11:14:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2181.codfw.wmnet with reason: Maintenance [11:14:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2181.codfw.wmnet with reason: Maintenance [11:15:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1126.eqiad.wmnet with reason: Maintenance [11:15:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1126.eqiad.wmnet with reason: Maintenance [11:15:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2165.codfw.wmnet with reason: Maintenance [11:15:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2165.codfw.wmnet with reason: Maintenance [11:15:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:15:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:15:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:15:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:15:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T336886)', diff saved to https://phabricator.wikimedia.org/P49415 and previous config saved to /var/cache/conftool/dbconfig/20230613-111549-ladsgroup.json [11:15:54] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:16:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49416 and previous config saved to /var/cache/conftool/dbconfig/20230613-111607-root.json [11:16:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1127.eqiad.wmnet with reason: Maintenance [11:16:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1127.eqiad.wmnet with reason: Maintenance [11:16:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:16:44] (03PS7) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [11:16:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:16:52] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:17:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:17:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:17:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:17:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:17:32] (03PS8) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [11:17:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:17:42] (03CR) 10CI reject: [V: 04-1] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:17:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:18:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:18:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:18:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:18:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:18:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1191.eqiad.wmnet with reason: Maintenance [11:19:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1191.eqiad.wmnet with reason: Maintenance [11:19:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1194.eqiad.wmnet with reason: Maintenance [11:19:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1194.eqiad.wmnet with reason: Maintenance [11:19:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1202.eqiad.wmnet with reason: Maintenance [11:19:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1202.eqiad.wmnet with reason: Maintenance [11:19:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:19:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:20:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2098.codfw.wmnet with reason: Maintenance [11:20:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2098.codfw.wmnet with reason: Maintenance [11:20:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2100.codfw.wmnet with reason: Maintenance [11:20:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2100.codfw.wmnet with reason: Maintenance [11:20:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2108.codfw.wmnet with reason: Maintenance [11:20:57] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:929670|Set medium wikis to read new for externallinks (T335343)]] (duration: 10m 09s) [11:21:00] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [11:21:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2108.codfw.wmnet with reason: Maintenance [11:21:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2120.codfw.wmnet with reason: Maintenance [11:21:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2120.codfw.wmnet with reason: Maintenance [11:21:49] (03CR) 10Hnowlan: [C: 03+2] service: move rest-gateway to production [puppet] - 10https://gerrit.wikimedia.org/r/920667 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [11:21:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2121.codfw.wmnet with reason: Maintenance [11:22:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2121.codfw.wmnet with reason: Maintenance [11:22:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2122.codfw.wmnet with reason: Maintenance [11:22:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2122.codfw.wmnet with reason: Maintenance [11:22:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:23:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:23:12] (03PS2) 10Daimona Eaytoy: filtered_tables.txt: Update for CampaignEvents schema change [puppet] - 10https://gerrit.wikimedia.org/r/925921 [11:23:22] (03CR) 10CI reject: [V: 04-1] filtered_tables.txt: Update for CampaignEvents schema change [puppet] - 10https://gerrit.wikimedia.org/r/925921 (owner: 10Daimona Eaytoy) [11:23:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:23:33] (03PS1) 10Clément Goubert: kubernetes: Bump envoy image version to 1.18.3-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) [11:23:43] (03CR) 10CI reject: [V: 04-1] kubernetes: Bump envoy image version to 1.18.3-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [11:23:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:23:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:23:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:24:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2168.codfw.wmnet with reason: Maintenance [11:24:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2168.codfw.wmnet with reason: Maintenance [11:24:13] (03PS2) 10Clément Goubert: kubernetes: Bump envoy image version to 1.18.3-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) [11:24:15] (03PS3) 10Daimona Eaytoy: filtered_tables.txt: Update for CampaignEvents schema change [puppet] - 10https://gerrit.wikimedia.org/r/925921 [11:24:23] (03CR) 10CI reject: [V: 04-1] kubernetes: Bump envoy image version to 1.18.3-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [11:24:25] (03PS2) 10Stevemunene: Prevent removal of python2 on bullseye stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/929675 (https://phabricator.wikimedia.org/T336036) [11:24:27] (03CR) 10jenkins-bot: filtered_tables.txt: Update for CampaignEvents schema change [puppet] - 10https://gerrit.wikimedia.org/r/925921 (owner: 10Daimona Eaytoy) [11:24:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance [11:24:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance [11:24:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2182.codfw.wmnet with reason: Maintenance [11:25:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2182.codfw.wmnet with reason: Maintenance [11:25:42] (03Merged) 10jenkins-bot: moveToExternal: Also check for utf8 encoding before trying to convert [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/929648 (owner: 10Ladsgroup) [11:25:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:26:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:26:06] (03CR) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:26:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2118.codfw.wmnet with reason: Maintenance [11:26:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2118.codfw.wmnet with reason: Maintenance [11:26:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/929675 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [11:26:44] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:929648|moveToExternal: Also check for utf8 encoding before trying to convert]] [11:27:21] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41687/console" [puppet] - 10https://gerrit.wikimedia.org/r/929675 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [11:27:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:27:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:27:52] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41686/console" [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [11:28:26] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:929648|moveToExternal: Also check for utf8 encoding before trying to convert]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [11:29:17] hnowlan: did you reach out to anyone about gerrit having rebase confusion? [11:31:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49417 and previous config saved to /var/cache/conftool/dbconfig/20230613-113111-root.json [11:31:32] (03PS3) 10Ladsgroup: ores: override Beta cluster liftwing URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:31:35] (03CR) 10Ladsgroup: [C: 03+2] ores: override Beta cluster liftwing URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:31:39] (03CR) 10CI reject: [V: 04-1] ores: override Beta cluster liftwing URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:31:42] claime: no, not yet [11:31:42] (03CR) 10CI reject: [V: 04-1] ores: override Beta cluster liftwing URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:32:34] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4048.ulsfo.wmnet [11:32:36] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4040.ulsfo.wmnet [11:32:46] !log reboot cp4040 and cp4048 for kernel upgrade (T335835) [11:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:36] (03CR) 10Ladsgroup: [C: 03+2] "ha?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:34:27] (03Merged) 10jenkins-bot: ores: override Beta cluster liftwing URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929352 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:35:15] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T329049) [11:35:18] T329049: Configure REST Gateway - https://phabricator.wikimedia.org/T329049 [11:36:44] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:929648|moveToExternal: Also check for utf8 encoding before trying to convert]] (duration: 09m 59s) [11:37:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:37:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:37:39] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T329049) [11:39:49] (03PS3) 10Ayounsi: Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) [11:39:59] (03CR) 10CI reject: [V: 04-1] Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [11:40:43] (03PS4) 10Ayounsi: Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) [11:40:44] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T329049) [11:40:48] T329049: Configure REST Gateway - https://phabricator.wikimedia.org/T329049 [11:41:01] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [11:41:05] (03CR) 10CI reject: [V: 04-1] Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [11:41:38] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T329049) [11:42:46] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4040.ulsfo.wmnet [11:42:59] (03CR) 10Hnowlan: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929676 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [11:43:29] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4048.ulsfo.wmnet [11:43:34] (03PS1) 10Jbond: rake: fix spdx typo [puppet] - 10https://gerrit.wikimedia.org/r/929680 [11:43:36] (03PS1) 10Jbond: rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 [11:43:45] (03CR) 10CI reject: [V: 04-1] rake: fix spdx typo [puppet] - 10https://gerrit.wikimedia.org/r/929680 (owner: 10Jbond) [11:43:47] (03CR) 10CI reject: [V: 04-1] rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (owner: 10Jbond) [11:44:04] (03PS2) 10Jbond: rake: fix spdx typo [puppet] - 10https://gerrit.wikimedia.org/r/929680 [11:44:14] (03PS2) 10Jbond: rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 [11:44:27] (03CR) 10CI reject: [V: 04-1] rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (owner: 10Jbond) [11:45:15] !log cat wikis_having_stubs | xargs -I {} bash -c 'echo {}; touch /home/ladsgroup/{}.undo.sql; chmod 777 /home/ladsgroup/{}.undo.sql; mwscript maintenance/storage/moveToExternal.php --wiki={} --end 200000000 --undo /home/ladsgroup/{}.undo.sql DB cluster26' (T299387) [11:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:18] T299387: Database corruption due to compressOld array plus bug, April 2006 - https://phabricator.wikimedia.org/T299387 [11:46:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:46:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:47:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:49:12] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:51:31] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10MoritzMuehlenhoff) p:05Medium→03Low [11:53:41] (03PS3) 10Jbond: rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) [11:53:52] (03CR) 10CI reject: [V: 04-1] rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [11:56:06] 10SRE, 10Infrastructure-Foundations: Consider OS level tracking/configuration of performance/powersaving settings - https://phabricator.wikimedia.org/T338944 (10MoritzMuehlenhoff) [11:56:14] 10SRE, 10Infrastructure-Foundations: Consider OS level tracking/configuration of performance/powersaving settings - https://phabricator.wikimedia.org/T338944 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:56:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:56:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:56:54] (03PS4) 10Jbond: rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) [11:56:56] (03PS1) 10Jbond: DO NOt MERGE: testing CI [puppet] - 10https://gerrit.wikimedia.org/r/929685 (https://phabricator.wikimedia.org/T182641) [11:57:05] (03CR) 10CI reject: [V: 04-1] rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [11:57:08] (03CR) 10CI reject: [V: 04-1] DO NOt MERGE: testing CI [puppet] - 10https://gerrit.wikimedia.org/r/929685 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [11:57:13] (03PS5) 10Jbond: rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) [11:57:21] (03PS2) 10Jbond: DO NOt MERGE: testing CI [puppet] - 10https://gerrit.wikimedia.org/r/929685 (https://phabricator.wikimedia.org/T182641) [11:57:25] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10Performance-Team (Radar): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10MoritzMuehlenhoff) 05Open→03Declined The hardware in this task been replaced, closing the task. I've opened T338944 for a more... [11:57:35] (03CR) 10CI reject: [V: 04-1] rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [11:57:42] (03CR) 10Jbond: [C: 03+2] rake: fix spdx typo [puppet] - 10https://gerrit.wikimedia.org/r/929680 (owner: 10Jbond) [11:57:49] (03CR) 10CI reject: [V: 04-1] DO NOt MERGE: testing CI [puppet] - 10https://gerrit.wikimedia.org/r/929685 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [11:57:56] jbond: CI issues see #wikimedia-releng [11:58:46] (03CR) 10Effie Mouzeli: [C: 04-1] "I think we should break this into 2 commits, one for the modules part, and one with the MediaWiki chart changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [11:59:32] (03CR) 10Hashar: admin: reserve gerrit uid/gid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T338470) (owner: 10Hashar) [11:59:41] (03PS3) 10Hashar: admin: reserve gerrit uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T338470) [12:04:01] (03CR) 10Jbond: [C: 03+1] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T338470) (owner: 10Hashar) [12:04:06] (03CR) 10Jbond: [C: 03+2] admin: reserve gerrit uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T338470) (owner: 10Hashar) [12:04:26] RhinosF1: ack cheers [12:04:36] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: services: disable slapd mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/929687 (https://phabricator.wikimedia.org/T338778) [12:04:48] (03CR) 10CI reject: [V: 04-1] openstack: codfw1dev: services: disable slapd mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/929687 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [12:05:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:06:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:06:24] (03PS6) 10Jbond: rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) [12:06:32] (03PS3) 10Jbond: DO NOt MERGE: testing CI [puppet] - 10https://gerrit.wikimedia.org/r/929685 (https://phabricator.wikimedia.org/T182641) [12:06:35] (03CR) 10CI reject: [V: 04-1] rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [12:06:46] (03CR) 10CI reject: [V: 04-1] DO NOt MERGE: testing CI [puppet] - 10https://gerrit.wikimedia.org/r/929685 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [12:07:00] I am restarting Zuul CI due to T309376 [12:07:01] T309376: Zuul jenkins-bot user holding open SSH sessions - https://phabricator.wikimedia.org/T309376 [12:09:43] !log Restarted Zuul CI due to T309376 [12:09:45] (03PS1) 10Slyngshede: C:IDM Don't show users name. [puppet] - 10https://gerrit.wikimedia.org/r/929688 [12:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:10] (03CR) 10Jbond: [C: 03+1] "LGTM one optional nit i missed, feel free to ping me and i can expand on the details" [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:10:46] (03CR) 10Hashar: "recheck due to CI issue T309376" [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [12:10:53] (03PS1) 10Cathal Mooney: Remove optional var to set COS buffers for QFX/EX switches [homer/public] - 10https://gerrit.wikimedia.org/r/929689 (https://phabricator.wikimedia.org/T284592) [12:11:19] (03CR) 10Slyngshede: [C: 03+2] C:IDM Don't show users name. [puppet] - 10https://gerrit.wikimedia.org/r/929688 (owner: 10Slyngshede) [12:11:52] (03CR) 10Cathal Mooney: [C: 03+1] Remove cloudsw-loopback.pol (folded into common-loopback) [homer/public] - 10https://gerrit.wikimedia.org/r/929316 (owner: 10Ayounsi) [12:12:33] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/929301 (https://phabricator.wikimedia.org/T338201) (owner: 10Ayounsi) [12:13:28] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, and 2 others: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The main issue here is resolved since 2018 with the merge o... [12:15:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41689/console" [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:15:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1212.eqiad.wmnet with reason: Maintenance [12:15:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1212.eqiad.wmnet with reason: Maintenance [12:15:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:15:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:16:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T336886)', diff saved to https://phabricator.wikimedia.org/P49418 and previous config saved to /var/cache/conftool/dbconfig/20230613-121611-ladsgroup.json [12:16:15] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:16:40] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [12:16:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41690/console" [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:17:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] "LGTm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:18:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4049.ulsfo.wmnet [12:18:30] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4041.ulsfo.wmnet [12:18:36] !log reboot cp4041 and cp4049 for kernel upgrade (T335835) [12:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:54] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:14] (03CR) 10Jbond: [V: 03+1 C: 03+2] idp: add gitlab-replicas and gitlab_replica_oidc config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:22:34] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Prevent removal of python2 on bullseye stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/929675 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [12:25:23] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: services: disable slapd mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/929687 (https://phabricator.wikimedia.org/T338778) [12:25:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1223.eqiad.wmnet with reason: Maintenance [12:26:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1223.eqiad.wmnet with reason: Maintenance [12:26:03] (03PS8) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [12:26:26] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:27:03] (03PS7) 10Jbond: rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) [12:27:10] (03PS4) 10Jbond: DO NOt MERGE: testing CI [puppet] - 10https://gerrit.wikimedia.org/r/929685 (https://phabricator.wikimedia.org/T182641) [12:27:30] (03PS9) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [12:27:36] (03CR) 10CI reject: [V: 04-1] rake: add rake task to detect and fix CRLF line endings [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [12:27:46] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:56] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:27:59] 10SRE, 10Infrastructure-Foundations, 10Performance-Team (Radar): Consider OS level tracking/configuration of performance/powersaving settings - https://phabricator.wikimedia.org/T338944 (10Krinkle) [12:28:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:53] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4041.ulsfo.wmnet [12:29:12] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4049.ulsfo.wmnet [12:30:25] (03PS3) 10Arturo Borrero Gonzalez: openstack: codfw1dev: services: disable slapd mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/929687 (https://phabricator.wikimedia.org/T338778) [12:30:38] (03CR) 10CI reject: [V: 04-1] openstack: codfw1dev: services: disable slapd mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/929687 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [12:31:10] (03PS10) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [12:31:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P49419 and previous config saved to /var/cache/conftool/dbconfig/20230613-123117-ladsgroup.json [12:31:34] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:33:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:33:50] (03PS4) 10Arturo Borrero Gonzalez: openstack: codfw1dev: services: disable slapd mirror mode [puppet] - 10https://gerrit.wikimedia.org/r/929687 (https://phabricator.wikimedia.org/T338778) [12:35:10] (03PS1) 10Alexandros Kosiaris: changeprop-jobqueue: Increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/929691 (https://phabricator.wikimedia.org/T329366) [12:35:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:35:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:38:11] (03CR) 10Ayounsi: [C: 03+2] Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [12:38:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] changeprop-jobqueue: Increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/929691 (https://phabricator.wikimedia.org/T329366) (owner: 10Alexandros Kosiaris) [12:38:31] (03PS9) 10Muehlenhoff: Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) [12:39:32] (03Merged) 10jenkins-bot: changeprop-jobqueue: Increase memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/929691 (https://phabricator.wikimedia.org/T329366) (owner: 10Alexandros Kosiaris) [12:44:02] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:44:16] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:44:20] (03PS11) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [12:44:34] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:45:23] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:45:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:45:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:45:51] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:46:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P49420 and previous config saved to /var/cache/conftool/dbconfig/20230613-124623-ladsgroup.json [12:46:34] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:47:54] (03PS1) 10Alexandros Kosiaris: recommendation-api: restbase isn't used for anything [deployment-charts] - 10https://gerrit.wikimedia.org/r/929695 (https://phabricator.wikimedia.org/T338471) [12:48:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41691/console" [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:49:47] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:50:26] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikidiff2 for TheresNoTime - https://phabricator.wikimedia.org/T338948 (10TheresNoTime) [12:51:30] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4050.ulsfo.wmnet [12:51:31] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4042.ulsfo.wmnet [12:51:42] !log reboot cp4042 and cp4050 for kernel upgrade (T335835) [12:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:20] taavi: thanks! [12:52:59] (03CR) 10Jbond: [C: 04-1] "missed one" [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:54:19] (03CR) 10Marostegui: "Oh sorry - I wasn't aware of this patch and merged the one removing the two tables!" [puppet] - 10https://gerrit.wikimedia.org/r/925921 (owner: 10Daimona Eaytoy) [12:55:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2109.codfw.wmnet with reason: Maintenance [12:55:10] (03PS1) 10Arturo Borrero Gonzalez: pdns_server: make default-soa-content configurable [puppet] - 10https://gerrit.wikimedia.org/r/929697 (https://phabricator.wikimedia.org/T338938) [12:55:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1300). Please do the needful. [13:00:04] MichaelG_WMDE and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1300) [13:00:22] i can deploy today [13:00:53] MichaelG_WMDE: tgr_: hi! [13:00:59] o/ [13:01:00] !log installing nbconvert security updates [13:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:15] Hi! [13:01:17] (03CR) 10Urbanecm: [C: 03+2] Section images: Fix image placeholder alignment for RTL content [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929464 (https://phabricator.wikimedia.org/T338837) (owner: 10Kosta Harlan) [13:01:26] sorry, meeting overran [13:01:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T336886)', diff saved to https://phabricator.wikimedia.org/P49421 and previous config saved to /var/cache/conftool/dbconfig/20230613-130129-ladsgroup.json [13:01:33] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:01:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4042.ulsfo.wmnet [13:01:50] MichaelG_WMDE: no worries. just to confirm, is it ok to go ahead with your patches even though the train might rollback later in the week? [13:02:07] (03PS1) 10Slyngshede: P:IDM Switch production server to MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/929699 [13:02:13] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4050.ulsfo.wmnet [13:02:14] (it is currently on group0 as you noted on the calendar, but it might change) [13:02:38] yes, it is ok [13:02:44] okay [13:02:50] thank you! [13:03:02] mind removing the -1 on the second patch? :) [13:03:46] (03PS2) 10Urbanecm: Drop disabling removed Datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928600 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [13:03:50] (03CR) 10Urbanecm: [C: 03+2] Drop disabling removed Datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928600 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [13:04:00] (03CR) 10Michael Große: [C: 03+1] "T332724 is ready enough, so this is good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928601 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [13:04:06] thanks [13:04:33] (03PS2) 10Urbanecm: Testwikidatawiki: Enable new EntitySchema Datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928601 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [13:04:37] (03CR) 10Urbanecm: [C: 03+2] Testwikidatawiki: Enable new EntitySchema Datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928601 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [13:04:45] (03Merged) 10jenkins-bot: Drop disabling removed Datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928600 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [13:04:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928601 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [13:05:24] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41692/console" [puppet] - 10https://gerrit.wikimedia.org/r/929699 (owner: 10Slyngshede) [13:05:35] (03Merged) 10jenkins-bot: Testwikidatawiki: Enable new EntitySchema Datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928601 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [13:06:01] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:928600|Drop disabling removed Datatype (T332724)]], [[gerrit:928601|Testwikidatawiki: Enable new EntitySchema Datatype (T332724)]] [13:06:04] T332724: [ES-M2]: Enable new EntitySchema data type on Test Wikidata - https://phabricator.wikimedia.org/T332724 [13:07:38] !log urbanecm@deploy1002 migr and urbanecm: Backport for [[gerrit:928600|Drop disabling removed Datatype (T332724)]], [[gerrit:928601|Testwikidatawiki: Enable new EntitySchema Datatype (T332724)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:07:51] MichaelG_WMDE: your patch's at mwdebug1002. can you test it there please? [13:08:17] will have a look right away! [13:08:52] I see it now on https://test.wikidata.org/wiki/Special:ListDatatypes - it works, thank you :) [13:09:23] great, syncing! [13:10:21] (03PS1) 10Btullis: Update the cumin alias for analytics-airflow [puppet] - 10https://gerrit.wikimedia.org/r/929702 (https://phabricator.wikimedia.org/T333697) [13:11:09] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/928920 [13:11:40] (03PS2) 10Slyngshede: P:IDM Switch production server to MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/929699 [13:12:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41694/console" [puppet] - 10https://gerrit.wikimedia.org/r/929699 (owner: 10Slyngshede) [13:13:33] (03CR) 10Slyngshede: P:IDM Switch production server to MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/929699 (owner: 10Slyngshede) [13:13:45] (03PS12) 10Jbond: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:13:54] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/928920 (owner: 10PipelineBot) [13:14:44] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/928920 (owner: 10PipelineBot) [13:15:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:928600|Drop disabling removed Datatype (T332724)]], [[gerrit:928601|Testwikidatawiki: Enable new EntitySchema Datatype (T332724)]] (duration: 09m 29s) [13:15:35] T332724: [ES-M2]: Enable new EntitySchema data type on Test Wikidata - https://phabricator.wikimedia.org/T332724 [13:15:38] * urbanecm waiting on CI now [13:16:01] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:16:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:16:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:17:00] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41695/console" [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:18:24] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:19:09] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:20:16] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:20:22] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:21:03] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:21:10] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:21:44] (03CR) 10Jbond: P:hive::client move beeline script to files. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:21:47] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:23:28] (03CR) 10Elukey: [C: 03+1] recommendation-api: restbase isn't used for anything [deployment-charts] - 10https://gerrit.wikimedia.org/r/929695 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris) [13:23:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:23:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:24:47] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4051.ulsfo.wmnet [13:24:49] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4043.ulsfo.wmnet [13:25:00] !log reboot cp4043 and cp4051 for kernel upgrade (T335835) [13:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:12] (03CR) 10Ladsgroup: "I'm waiting for T338357 to resolve before moving forward (and a double check of how things are done)" [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [13:27:18] (03PS1) 10MVernon: Revert "tlsproxy::localssl: drop class" [puppet] - 10https://gerrit.wikimedia.org/r/929650 [13:27:36] @urbanecm I can see the feature now on test.wikidata.org even without WikimediaDebug. Thank you for deploying it! 🙏 [13:27:36] (03PS5) 10Clément Goubert: modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) [13:27:38] (03PS1) 10Clément Goubert: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T338210) [13:27:44] MichaelG_WMDE: great! any time :) [13:28:37] (03CR) 10Ayounsi: [C: 03+2] Remove cloudsw-loopback.pol (folded into common-loopback) [homer/public] - 10https://gerrit.wikimedia.org/r/929316 (owner: 10Ayounsi) [13:29:16] (03Merged) 10jenkins-bot: Remove cloudsw-loopback.pol (folded into common-loopback) [homer/public] - 10https://gerrit.wikimedia.org/r/929316 (owner: 10Ayounsi) [13:29:22] (03CR) 10Clément Goubert: modules: Add preStop sleep and draining to mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [13:29:24] (03CR) 10CI reject: [V: 04-1] Revert "tlsproxy::localssl: drop class" [puppet] - 10https://gerrit.wikimedia.org/r/929650 (owner: 10MVernon) [13:30:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:30:49] (03Merged) 10jenkins-bot: Section images: Fix image placeholder alignment for RTL content [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929464 (https://phabricator.wikimedia.org/T338837) (owner: 10Kosta Harlan) [13:30:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:30:55] (03Restored) 10Esanders: Revert "Remove 50% opacity from notification badges when they are all read" [extensions/Echo] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902154 (https://phabricator.wikimedia.org/T331502) (owner: 10Samtar) [13:31:05] (03Abandoned) 10Esanders: Revert "Remove 50% opacity from notification badges when they are all read" [extensions/Echo] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902154 (https://phabricator.wikimedia.org/T331502) (owner: 10Samtar) [13:31:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] "overiding CI" [puppet] - 10https://gerrit.wikimedia.org/r/929650 (owner: 10MVernon) [13:31:12] (03CR) 10Clément Goubert: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [13:31:15] 👀 [13:31:31] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:929464|Section images: Fix image placeholder alignment for RTL content (T338837)]] [13:31:35] T338837: Section-Level Images: Image placeholder should appear on left while in suggestions mode for RTL languages - https://phabricator.wikimedia.org/T338837 [13:31:44] * urbanecm waves to TheresNoTime [13:32:16] * TheresNoTime got pinged by the Restored/Abandoned patch ^ [13:32:23] (03PS13) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [13:32:43] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Tracking-Neverending: Puppet: tracking catalogs that changes at every run - https://phabricator.wikimedia.org/T191388 (10jbond) [13:32:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [13:33:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [13:33:02] !log urbanecm@deploy1002 kharlan and urbanecm: Backport for [[gerrit:929464|Section images: Fix image placeholder alignment for RTL content (T338837)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:33:20] 10Puppet, 10SRE, 10SRE-swift-storage, 10Maps: Puppet: tlsproxy localssl default_server make a Notify at each run - https://phabricator.wikimedia.org/T191393 (10jbond) 05Resolved→03Open This is still in use by swift and maps [13:33:31] (03PS2) 10Clément Goubert: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T338210) [13:33:33] 10Puppet, 10SRE, 10SRE-swift-storage, 10Maps: Puppet: tlsproxy localssl default_server make a Notify at each run - https://phabricator.wikimedia.org/T191393 (10MatthewVernon) [the class has been put back, because it's still in use] [13:33:41] (03CR) 10Ayounsi: Remove optional var to set COS buffers for QFX/EX switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/929689 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [13:33:51] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/928921 [13:34:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:34:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:34:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:34:27] tgr_: your patch's at the debug server, if testable there. [13:34:37] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:34:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:35:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [13:35:12] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4043.ulsfo.wmnet [13:35:18] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4051.ulsfo.wmnet [13:35:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [13:35:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) @MoritzMuehlenhoff yes i did pass -pxe-media installer510 to the reimage cookbook. I can tray again. [13:36:27] urbanecm: not testable, the feature is not enabled in production yet. [13:36:36] thought so. proceeding then. [13:37:15] (03PS3) 10Clément Goubert: kubernetes: Bump envoy image version to 1.18.3-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) [13:38:41] (03PS1) 10Zabe: Revert "noc: Switch default selection on db.php from eqiad to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929709 [13:38:46] (03CR) 10Ssingh: [C: 03+1] "Let me know if you want someone to merge it. Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [13:39:09] (03CR) 10Majavah: [V: 03+1] bird: prune unmanaged anycast-healthchecker checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [13:39:16] (03PS5) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) [13:39:34] (03CR) 10Ssingh: [C: 03+2] bird: prune unmanaged anycast-healthchecker checks [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [13:39:58] (03Abandoned) 10Ayounsi: homer: update tests for graphQL [software/homer] - 10https://gerrit.wikimedia.org/r/929324 (owner: 10Jbond) [13:41:19] (03CR) 10CI reject: [V: 04-1] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [13:41:38] (03CR) 10Ayounsi: [C: 03+2] Prioritize direct peers connected to primary IXP [homer/public] - 10https://gerrit.wikimedia.org/r/929301 (https://phabricator.wikimedia.org/T338201) (owner: 10Ayounsi) [13:41:45] !log disable puppet on R:Class bird::anycast_healthchecker to merge CR 928804 [13:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:52] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] doc: Switch sync between nodes to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [13:42:00] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:929464|Section images: Fix image placeholder alignment for RTL content (T338837)]] (duration: 10m 29s) [13:42:04] T338837: Section-Level Images: Image placeholder should appear on left while in suggestions mode for RTL languages - https://phabricator.wikimedia.org/T338837 [13:42:06] tgr_: should be live [13:42:08] anything else, anyone? [13:42:15] (03Merged) 10jenkins-bot: Prioritize direct peers connected to primary IXP [homer/public] - 10https://gerrit.wikimedia.org/r/929301 (https://phabricator.wikimedia.org/T338201) (owner: 10Ayounsi) [13:42:24] thanks urbanecm! [13:42:27] np [13:42:39] sukhe: did you notice the new script you can use via cumin to detect files that patch would purge? [13:43:38] taavi: the other CR you submitted right? [13:43:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/928857 this one? [13:43:51] yes [13:44:11] yes, I did notice it, very helpful [13:44:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:44:38] great [13:44:45] if you were hinting at the puppet disable above, it's because it touches a few critical stuff so for bird changes, I almost always do it [13:44:53] (recdns being the primary one here) [13:45:11] !re-enable puppet on R:Class bird::anycast_healthchecker [13:45:17] urbanecm: if you are done, I would briefly merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/929709/ [13:45:24] (03PS2) 10Cathal Mooney: Remove optional var to set COS buffers for QFX/EX switches [homer/public] - 10https://gerrit.wikimedia.org/r/929689 (https://phabricator.wikimedia.org/T284592) [13:45:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:45:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:45:34] zabe: floor's yours [13:45:54] that was indeed the reason I was asking, that makes sense [13:46:00] (03CR) 10Cathal Mooney: Remove optional var to set COS buffers for QFX/EX switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/929689 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [13:46:12] (03CR) 10Zabe: [C: 03+2] Revert "noc: Switch default selection on db.php from eqiad to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929709 (owner: 10Zabe) [13:46:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [13:46:58] (03Merged) 10jenkins-bot: Revert "noc: Switch default selection on db.php from eqiad to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929709 (owner: 10Zabe) [13:47:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [13:47:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:47:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:48:21] (03PS14) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [13:49:20] taavi: patch deployed, thanks for submitting it [13:49:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:50:33] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:51:11] (03PS1) 10Zabe: Revert "switch noc.wikimedia.org from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/929652 (https://phabricator.wikimedia.org/T331634) [13:51:36] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Tracking-Neverending: Puppet: tracking catalogs that changes at every run - https://phabricator.wikimedia.org/T191388 (10jbond) [13:51:40] (03PS2) 10Zabe: Revert "switch noc.wikimedia.org from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/929652 (https://phabricator.wikimedia.org/T331634) [13:51:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2127.codfw.wmnet with reason: Maintenance [13:52:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10BTullis) [13:52:07] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10BTullis) [13:52:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2127.codfw.wmnet with reason: Maintenance [13:52:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [13:52:15] 10Puppet, 10SRE, 10SRE-swift-storage, 10Maps: Puppet: tlsproxy localssl default_server make a Notify at each run - https://phabricator.wikimedia.org/T191393 (10jbond) 05Open→03Resolved This is currently only used by maps and swift and @MatthewVernon has confirmed we dont see this issue on those machines [13:52:23] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10Papaul) @aborrero can we move this to ge-0/0/11 and not ge-0/0/36? [13:52:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [13:52:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:56:08] (03PS1) 10Effie Mouzeli: iPoid: update ENV variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/929711 [13:56:28] (03CR) 10Clément Goubert: [C: 03+1] "Thanks, good catch!" [dns] - 10https://gerrit.wikimedia.org/r/929652 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe) [13:56:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [13:57:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster [14:00:02] claime: could you take care of getting that ^ deployed? [14:00:22] zabe: for sure [14:00:36] thanks :) [14:01:19] (03CR) 10Clément Goubert: [C: 03+2] Revert "switch noc.wikimedia.org from eqiad to codfw" [dns] - 10https://gerrit.wikimedia.org/r/929652 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe) [14:02:14] (03PS1) 10Btullis: Fail over hive services to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/929712 (https://phabricator.wikimedia.org/T303168) [14:03:13] !log Revert noc.wikimedia.org to eqiad, running authdns-update - T331634 [14:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:17] T331634: switch noc.wikimedia.org from eqiad to codfw - https://phabricator.wikimedia.org/T331634 [14:03:42] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10Papaul) @cmooney is it possible when moving servers from asw to cloudsw try to connect that server on the interface that matches... [14:04:45] (03PS1) 10EoghanGaffney: doc: Clean up leftover bits from switch to quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/929713 (https://phabricator.wikimedia.org/T333945) [14:04:48] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T338915 (10Papaul) 05Open→03Resolved a:03Papaul [14:05:06] zabe: all done [14:05:17] thanks for the patch [14:05:34] (03PS3) 10JHathaway: tshark: drop debconf::seen [puppet] - 10https://gerrit.wikimedia.org/r/928644 [14:05:51] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4052.ulsfo.wmnet [14:05:52] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4044.ulsfo.wmnet [14:05:58] !log reboot cp4044 and cp4052 for kernel upgrade (T335835) [14:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:18] (03PS4) 10JHathaway: tshark: drop debconf::seen [puppet] - 10https://gerrit.wikimedia.org/r/928644 [14:06:37] (03CR) 10Jbond: P:hive::client move beeline script to files. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [14:07:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:56] yw [14:07:57] (03PS13) 10Ssingh: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/922514 [14:07:59] (03PS15) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [14:09:34] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41696/console" [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:10:06] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) >>! In T338778#8927696, @Papaul wrote: > @aborrero can we move this to ge-0/0/11 and not ge-0/0/36? yes, I'm fine with... [14:10:35] (03PS1) 10Aklapper: phabricator: Add second recipient for quarterly/yearly metrics emails [puppet] - 10https://gerrit.wikimedia.org/r/929714 (https://phabricator.wikimedia.org/T338955) [14:10:37] (03PS6) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) [14:10:43] (03CR) 10JHathaway: [C: 03+2] tshark: drop debconf::seen (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928644 (owner: 10JHathaway) [14:10:56] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [14:11:06] (03CR) 10Ssingh: [V: 03+1] "Changes since the last review: might be bike-sheddy but I updated the key to be anycast_services instead of anycast_service (extra s), to " [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:11:15] (03CR) 10Aklapper: "Note that I do not know if this is the correct array syntax" [puppet] - 10https://gerrit.wikimedia.org/r/929714 (https://phabricator.wikimedia.org/T338955) (owner: 10Aklapper) [14:11:51] (03CR) 10Ssingh: [V: 03+1] P:bird::anycast_healthchecker: allow binding to multiple services (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:12:23] (03CR) 10CI reject: [V: 04-1] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [14:12:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:12:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:13:40] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: Figure out what changes are needed in the traffic layer for having codfw be the r/w DC for half a year - https://phabricator.wikimedia.org/T337535 (10akosiaris) 05Open→03Resolved a:03akosiaris Makes sense. Added in [Phase 9 of the Switchover](h... [14:15:51] (03PS16) 10Slyngshede: P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) [14:16:14] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4044.ulsfo.wmnet [14:16:34] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4052.ulsfo.wmnet [14:17:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:20] (03CR) 10CI reject: [V: 04-1] P:hive::client move beeline script to files. [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [14:24:10] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/929697/41697/" [puppet] - 10https://gerrit.wikimedia.org/r/929697 (https://phabricator.wikimedia.org/T338938) (owner: 10Arturo Borrero Gonzalez) [14:25:42] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] pdns_server: make default-soa-content configurable [puppet] - 10https://gerrit.wikimedia.org/r/929697 (https://phabricator.wikimedia.org/T338938) (owner: 10Arturo Borrero Gonzalez) [14:26:12] (03CR) 10Cathal Mooney: [C: 03+2] Improve logic getting switch port when primary IP is on bridge device (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921032 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [14:27:47] (03PS1) 10Jelto: gitlab: make oauth client identifier configurable [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) [14:28:16] (03PS1) 10Herron: pyrra: add pyrra::(api|filesystem) modules [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) [14:28:48] (03CR) 10CI reject: [V: 04-1] pyrra: add pyrra::(api|filesystem) modules [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [14:29:05] (03Merged) 10jenkins-bot: Improve logic getting switch port when primary IP is on bridge device [cookbooks] - 10https://gerrit.wikimedia.org/r/921032 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [14:29:38] (03PS7) 10Jbond: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [14:29:43] (03CR) 10Ayounsi: [C: 03+1] Remove optional var to set COS buffers for QFX/EX switches [homer/public] - 10https://gerrit.wikimedia.org/r/929689 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [14:30:35] (03PS2) 10Herron: pyrra: add pyrra::(api|filesystem) modules [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) [14:31:25] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41698/console" [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [14:31:27] (03CR) 10Andrew Bogott: [C: 03+1] "temporary, right?" [puppet] - 10https://gerrit.wikimedia.org/r/929687 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [14:31:36] (03CR) 10CI reject: [V: 04-1] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [14:32:04] (03CR) 10Hnowlan: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [14:33:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2149.codfw.wmnet with reason: Maintenance [14:33:29] (03PS3) 10Herron: pyrra: add pyrra::(api|filesystem) modules [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) [14:33:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2149.codfw.wmnet with reason: Maintenance [14:34:56] (03CR) 10Btullis: [C: 03+2] Fail over hive services to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/929712 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [14:35:02] (03PS2) 10Btullis: Fail over hive services to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/929712 (https://phabricator.wikimedia.org/T303168) [14:35:21] (03CR) 10Muehlenhoff: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:36:59] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10lmata) >>! In T337829#8905433, @MatthewVernon wrote: > @nskaggs you're the listed approver for the `wmcs-roots` group, are you OK to approve access to that? > @lmata are you ha... [14:37:35] (03PS4) 10Herron: pyrra: add pyrra::(api|filesystem) modules [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) [14:40:11] (03PS1) 10Cathal Mooney: Rename variable to match convention elsewhere in cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/929721 (https://phabricator.wikimedia.org/T296832) [14:40:16] (03CR) 10Btullis: [C: 03+2] wikireplicas: drop views for pagetriage_log [puppet] - 10https://gerrit.wikimedia.org/r/884454 (https://phabricator.wikimedia.org/T325519) (owner: 10Majavah) [14:40:46] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922514 (owner: 10Ssingh) [14:44:21] (03CR) 10Cathal Mooney: [C: 03+2] Rename variable to match convention elsewhere in cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/929721 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [14:44:35] (03PS2) 10Jelto: gitlab: make oauth client identifier configurable [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) [14:46:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/929699 (owner: 10Slyngshede) [14:46:47] (03Merged) 10jenkins-bot: Rename variable to match convention elsewhere in cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/929721 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [14:47:17] moritzm: sorry to bother you again on this, but the bookworm image still doesn't seem to be available (I'm trying to pull docker-registry.wikimedia.org/bookworm:latest). could you check if something went wrong with the build? [14:47:24] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:47:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] modules: Add preStop sleep and draining to mesh (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [14:47:48] (03CR) 10Cwhite: [C: 03+2] lvs: remove lvs::monitor_services [puppet] - 10https://gerrit.wikimedia.org/r/925120 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [14:48:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41699/console" [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [14:48:48] (03PS3) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) [14:49:13] (03CR) 10Hashar: "Trivial rebased since parent change had a typo fix" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:49:20] (03PS4) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [14:49:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] modules: Add preStop sleep and draining to mesh (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [14:49:28] (03CR) 10Hashar: "Trivial rebased since grand parent change had a typo fix" [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:49:37] taavi: I ran the job for the base image yesterday and it completed without an error, but maybe something else is needed/broken? I'll have a closer look tomorrow [14:50:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] modules: Add preStop sleep and draining to mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [14:51:05] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:51:16] (03CR) 10Jbond: [C: 04-1] Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:51:18] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:52:41] (03PS1) 10Ottomata: eventgate-main - Bump image version to pick up change to mediawiki/page/change [deployment-charts] - 10https://gerrit.wikimedia.org/r/929725 (https://phabricator.wikimedia.org/T337395) [14:52:52] (03CR) 10JHathaway: [C: 03+2] dev env: make container facts structured (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928903 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:53:11] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) 05Open→03Resolved a:03cmooney Above patch implements the logic from the "Re-image cookbook changes" in... [14:53:16] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/928922 [14:53:30] (03CR) 10JHathaway: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/929161 (https://phabricator.wikimedia.org/T337972) (owner: 10Jbond) [14:54:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:54:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:54:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:54:47] (03CR) 10Ottomata: [C: 03+2] eventgate-main - Bump image version to pick up change to mediawiki/page/change [deployment-charts] - 10https://gerrit.wikimedia.org/r/929725 (https://phabricator.wikimedia.org/T337395) (owner: 10Ottomata) [14:54:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:55:41] (03Merged) 10jenkins-bot: eventgate-main - Bump image version to pick up change to mediawiki/page/change [deployment-charts] - 10https://gerrit.wikimedia.org/r/929725 (https://phabricator.wikimedia.org/T337395) (owner: 10Ottomata) [14:56:23] (03CR) 10JHathaway: [C: 03+2] dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:56:48] (03PS3) 10Hashar: fix-staging-perms: set group name from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) [14:56:50] (03PS4) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) [14:56:52] (03PS5) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [14:57:02] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:57:05] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:57:08] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:57:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [14:57:18] (03CR) 10Jelto: [V: 03+1] "I'm struggling a bit to configure the client_options identifier in this puppet change. The correct identifier shows up in PCC (gitlab_oidc" [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [14:57:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster [14:57:46] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [14:58:05] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [14:58:30] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [14:59:16] (03PS1) 10Cathal Mooney: Add user samtar to shell group wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/929726 (https://phabricator.wikimedia.org/T337829) [14:59:24] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:59:43] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [15:00:18] PROBLEM - cinder-volume process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:00:29] !log run kafka re-assign partitions for eqiad.change-prop.transcludes.resource-change on kafka-main1001 - T338357 [15:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:34] T338357: Pushing jobs to jobqueue is slow again - https://phabricator.wikimedia.org/T338357 [15:00:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:01:21] (03PS2) 10Cathal Mooney: Add user samtar to shell group wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/929726 (https://phabricator.wikimedia.org/T337829) [15:01:25] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [15:02:04] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [15:02:52] (03CR) 10JHathaway: [C: 03+2] dev env: nrpe listen on all interfaces in a container (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:03:32] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) @Papaul why do we need to change? Easiest thing here is just move the cable from eno2 to eno1 on the server side, then... [15:05:22] (03PS6) 10Clément Goubert: modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) [15:05:56] (03CR) 10Clément Goubert: modules: Add preStop sleep and draining to mesh (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [15:06:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes: Bump envoy image version to 1.18.3-2-s2 [puppet] - 10https://gerrit.wikimedia.org/r/929678 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [15:06:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: services: disable slapd mirror mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929687 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [15:07:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/929695 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris) [15:07:51] (03PS3) 10JHathaway: dev env: don't manage resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/928661 (https://phabricator.wikimedia.org/T337972) [15:07:59] (03Merged) 10jenkins-bot: recommendation-api: restbase isn't used for anything [deployment-charts] - 10https://gerrit.wikimedia.org/r/929695 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris) [15:08:31] (03CR) 10Hashar: "I am not sure what is happening with the Puppet compiler, it fails even against the production branch:" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [15:09:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:10:08] (03PS7) 10Clément Goubert: modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) [15:10:47] (03PS2) 10KartikMistry: testwiki: Enable Section Translation for 3 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929621 (https://phabricator.wikimedia.org/T338123) [15:11:39] (03CR) 10JHathaway: [C: 03+2] dev env: don't manage resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/928661 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:12:10] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) >>! In T338778#8927735, @Papaul wrote: > @cmooney is it possible when moving servers from asw to cloudsw try to connect... [15:12:20] (03PS1) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [15:12:59] (03CR) 10Dzahn: "this probably needs a similar adjustment as the $month parameter, to allow an array or multiple recipients some way. I will look at that b" [puppet] - 10https://gerrit.wikimedia.org/r/929714 (https://phabricator.wikimedia.org/T338955) (owner: 10Aklapper) [15:13:09] (03CR) 10JHathaway: [C: 03+2] apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/927788 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [15:13:10] PROBLEM - cinder-volume process on cloudcontrol1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:13:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:14:04] (03CR) 10Alexandros Kosiaris: modules: Add preStop sleep and draining to mesh (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [15:14:07] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [15:14:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [15:14:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:14:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:14:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:14:50] (03CR) 10Dzahn: [C: 03+1] doc: Clean up leftover bits from switch to quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/929713 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [15:14:52] !log deploying refinery source [15:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:54] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:15:58] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: sync [15:16:05] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [15:16:15] (03CR) 10Herron: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/929421 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall) [15:17:28] (03CR) 10Ottomata: [C: 03+1] Declare mediawiki.page_outlink_topic_prediction_change.v1 stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [15:18:30] (03PS8) 10Clément Goubert: modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) [15:19:04] (03CR) 10Arturo Borrero Gonzalez: Add a define to declare an nftables set in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:19:06] (03PS9) 10Clément Goubert: modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) [15:19:08] (03CR) 10Clément Goubert: modules: Add preStop sleep and draining to mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [15:19:41] (03PS1) 10MVernon: hiera: remove ms-be104[0-3] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/929730 (https://phabricator.wikimedia.org/T335281) [15:19:57] (03CR) 10CI reject: [V: 04-1] modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [15:20:40] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149'] [15:20:45] (03PS1) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [15:20:59] (03PS1) 10Muehlenhoff: Install Linux 5.10 for snapshot1016/1017 [puppet] - 10https://gerrit.wikimedia.org/r/929732 (https://phabricator.wikimedia.org/T334955) [15:21:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1149'] [15:21:22] (03CR) 10Hashar: fix-staging-perms: set group name from Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [15:21:26] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149'] [15:23:22] (03PS2) 10Muehlenhoff: Install Linux 5.10 for snapshot1016/1017 [puppet] - 10https://gerrit.wikimedia.org/r/929732 (https://phabricator.wikimedia.org/T334955) [15:23:43] (03PS1) 10Herron: pyrra: deploy to alert/thanos-fe hosts [puppet] - 10https://gerrit.wikimedia.org/r/929734 (https://phabricator.wikimedia.org/T302995) [15:26:29] (03CR) 10Muehlenhoff: [C: 03+2] Install Linux 5.10 for snapshot1016/1017 [puppet] - 10https://gerrit.wikimedia.org/r/929732 (https://phabricator.wikimedia.org/T334955) (owner: 10Muehlenhoff) [15:26:38] PROBLEM - cinder-volume process on cloudcontrol1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:27:06] (03CR) 10Dzahn: [C: 03+1] miscweb: add transparencyreport release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/929667 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [15:27:28] (WidespreadPuppetFailure) firing: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:27:45] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host snapshot1016.eqiad.wmnet with OS buster [15:27:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations, 10Patch-For-Review: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host snapshot1016.eqiad.wmnet with OS bus... [15:28:01] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [15:28:17] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [15:28:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1149'] [15:28:22] (03PS10) 10Clément Goubert: modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) [15:28:28] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [15:28:35] (03CR) 10Dzahn: [C: 03+2] "yep, contint1001 is for sure not in prod. And doesn't seem like you saw failing backups on contint1002 or contint2001, did you?" [puppet] - 10https://gerrit.wikimedia.org/r/867231 (owner: 10Jcrespo) [15:28:54] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [15:32:42] RECOVERY - cinder-volume process on cloudcontrol1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:33:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:34:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:16] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149'] [15:34:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [15:34:19] RECOVERY - cinder-volume process on cloudcontrol1007 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:34:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster [15:34:44] RECOVERY - cinder-volume process on cloudcontrol1006 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:34:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:34:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1149'] [15:35:44] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [15:36:23] (03PS2) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [15:36:55] 10SRE, 10Infrastructure-Foundations, 10netops: Peering: prefer primary IXP for direcly connected networks - https://phabricator.wikimedia.org/T338201 (10ayounsi) 05Open→03Resolved a:03ayounsi Tested in eqsin, traffic is now balanced more equally between all 3 IXPs. Same for ulsfo. [15:37:45] (03PS3) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [15:39:51] (03PS4) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [15:42:14] (03CR) 10CI reject: [V: 04-1] profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:43:38] (03CR) 10Dzahn: [C: 03+2] zuul: add a gerrit-reporter gerrit connection [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) (owner: 10Hashar) [15:45:12] (03PS5) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [15:45:25] !log Deployed refinery-source using jenkins [15:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:28] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:48:16] (03PS2) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [15:49:02] (03PS1) 10Jbond: profile::base: use hiera instead of the envioronment for flow control [puppet] - 10https://gerrit.wikimedia.org/r/929737 (https://phabricator.wikimedia.org/T337972) [15:51:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1016.eqiad.wmnet with reason: host reimage [15:52:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1157.eqiad.wmnet with reason: Maintenance [15:52:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1157.eqiad.wmnet with reason: Maintenance [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:13] (03CR) 10Jbond: "I added some text to the phab task i think best to talk about it there" [puppet] - 10https://gerrit.wikimedia.org/r/929737 (https://phabricator.wikimedia.org/T337972) (owner: 10Jbond) [15:55:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1016.eqiad.wmnet with reason: host reimage [15:55:30] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: fix pdns query-local-adress [puppet] - 10https://gerrit.wikimedia.org/r/929739 (https://phabricator.wikimedia.org/T338938) [15:55:34] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: listen-on only the new address [puppet] - 10https://gerrit.wikimedia.org/r/929740 (https://phabricator.wikimedia.org/T338938) [15:55:55] (03PS1) 10JHathaway: Revert "apt: ensure profile::apt is applied before packages." [puppet] - 10https://gerrit.wikimedia.org/r/929654 [15:56:47] (03CR) 10Dzahn: [C: 03+1] "(Exec[apt-get update] => Class[Apt] => Class[Profile::Apt] => Package[openjdk-11-jdk] => Java::Package[openjdk-jdk-11] => Class[Java] => J" [puppet] - 10https://gerrit.wikimedia.org/r/929654 (owner: 10JHathaway) [15:56:55] jhathaway: is the revert related to the WidespreadPuppetFailure was just about to look at that? [15:57:03] (03CR) 10Hashar: [C: 03+1] "That caused a dependency cycle when using Apt::Package_from_component on contint1002 :(" [puppet] - 10https://gerrit.wikimedia.org/r/929654 (owner: 10JHathaway) [15:57:42] jbond: most likely, sorry [15:57:59] jbond: I think it's all hosts using apt::package_from_component [15:58:07] jhathaway: ack just caught up oin m_sec let me know if you need a hand with anything [15:58:15] * jhathaway goes to look for WidespreadPuppetFailure [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:26] which conflicts with regular apt class ..if that change is made in base [15:58:38] (03CR) 10JHathaway: [C: 03+2] Revert "apt: ensure profile::apt is applied before packages." [puppet] - 10https://gerrit.wikimedia.org/r/929654 (owner: 10JHathaway) [15:58:44] (03PS1) 10KartikMistry: Enable Content and Section Translation for a 2nd group of 9 languages previously lacking machine translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929741 (https://phabricator.wikimedia.org/T337669) [15:58:48] I would say revert and then we can run puppet via cumin on all using that class [15:59:00] I can take contint, already on them [15:59:07] ok, reverted [15:59:28] runs puppet on contint1002, contint2001 [15:59:29] (03CR) 10CI reject: [V: 04-1] Enable Content and Section Translation for a 2nd group of 9 languages previously lacking machine translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929741 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [15:59:54] error is gone, confirmed [16:00:05] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:04] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: fix pdns query-local-adress [puppet] - 10https://gerrit.wikimedia.org/r/929739 (https://phabricator.wikimedia.org/T338938) [16:01:06] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: listen-on only the new address [puppet] - 10https://gerrit.wikimedia.org/r/929740 (https://phabricator.wikimedia.org/T338938) [16:02:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:02:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:02:42] (03CR) 10Jbond: "Sorry i just relised i never hit send" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [16:03:23] (03PS3) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: fix pdns query-local-adress [puppet] - 10https://gerrit.wikimedia.org/r/929739 (https://phabricator.wikimedia.org/T338938) [16:03:25] (03PS3) 10Arturo Borrero Gonzalez: openstack: codfw1dev: designate: listen-on only the new address [puppet] - 10https://gerrit.wikimedia.org/r/929740 (https://phabricator.wikimedia.org/T338938) [16:06:03] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10akosiaris) @MatthewVernon We probably want to store those models in Swift. Would you like us to file a different... [16:06:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/929739/41702/" [puppet] - 10https://gerrit.wikimedia.org/r/929739 (https://phabricator.wikimedia.org/T338938) (owner: 10Arturo Borrero Gonzalez) [16:07:28] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:07:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye [16:07:43] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [16:08:06] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: codfw1dev: designate: fix pdns query-local-adress [puppet] - 10https://gerrit.wikimedia.org/r/929739 (https://phabricator.wikimedia.org/T338938) (owner: 10Arturo Borrero Gonzalez) [16:12:02] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:13:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1017.eqiad.wmnet with OS buster [16:13:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1017.eqiad.wmnet with OS buster [16:14:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:16:14] (03PS1) 10Albertoleoncio: Enable mobile page tabs for everyone in ptwikisource. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929742 (https://phabricator.wikimedia.org/T338974) [16:19:07] (03CR) 10Dzahn: [C: 03+1] admin: add all miscweb domains as extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [16:19:11] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [16:19:34] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10Pginer-WMF) >>! In T335491#8928309, @akosiaris wrote: > * Updating frequency: Never has happened up to now, alrea... [16:19:48] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [16:20:45] (03CR) 10Dzahn: [C: 03+1] "double checked, the list looks complete to me for this purpose. os_reports and query.wikidata are going to be special cases." [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [16:23:58] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10taavi) The current implementation on codfw1dev seems to have forgotten that the recursors need outbound acc... [16:24:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1129.eqiad.wmnet with reason: Maintenance [16:24:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1129.eqiad.wmnet with reason: Maintenance [16:24:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:24:31] (03PS1) 10Aklapper: Change type of 'age-factor-decay' from non-existing float to wild [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/929744 (https://phabricator.wikimedia.org/T338970) [16:24:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:24:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:25:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:25:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:25:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:25:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:25:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:26:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:26:11] (03PS3) 10Clément Goubert: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T338210) [16:26:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:26:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:26:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:26:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:27:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:27:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1197.eqiad.wmnet with reason: Maintenance [16:27:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1197.eqiad.wmnet with reason: Maintenance [16:27:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1222.eqiad.wmnet with reason: Maintenance [16:28:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1222.eqiad.wmnet with reason: Maintenance [16:28:03] (03PS1) 10Jdlrobson: Ignore the JSFiddle article in alerts [puppet] - 10https://gerrit.wikimedia.org/r/929745 (https://phabricator.wikimedia.org/T338963) [16:28:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1225.eqiad.wmnet with reason: Maintenance [16:28:14] (03CR) 10Dzahn: [C: 03+1] "oh, wait, maybe I was just too pessimistic and it already works" [puppet] - 10https://gerrit.wikimedia.org/r/929714 (https://phabricator.wikimedia.org/T338955) (owner: 10Aklapper) [16:28:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1225.eqiad.wmnet with reason: Maintenance [16:28:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:29:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:29:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2097.codfw.wmnet with reason: Maintenance [16:29:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2097.codfw.wmnet with reason: Maintenance [16:29:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:30:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:30:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:30:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:31:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2126.codfw.wmnet with reason: Maintenance [16:31:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2126.codfw.wmnet with reason: Maintenance [16:31:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:31:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:32:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:32:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:32:12] (03CR) 10Jbond: gitlab: make oauth client identifier configurable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929718 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [16:32:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:32:23] (03PS4) 10Clément Goubert: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T338210) [16:32:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2148.codfw.wmnet with reason: Maintenance [16:32:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10aborrero) [16:32:50] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) 05Stalled→03Open [16:32:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2148.codfw.wmnet with reason: Maintenance [16:33:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:33:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:33:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:34:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2175.codfw.wmnet with reason: Maintenance [16:34:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2175.codfw.wmnet with reason: Maintenance [16:35:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1162.eqiad.wmnet with reason: Maintenance [16:35:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1162.eqiad.wmnet with reason: Maintenance [16:35:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/929714/41704/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/929714 (https://phabricator.wikimedia.org/T338955) (owner: 10Aklapper) [16:35:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2107.codfw.wmnet with reason: Maintenance [16:35:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2107.codfw.wmnet with reason: Maintenance [16:36:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1138.eqiad.wmnet with reason: Maintenance [16:36:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1138.eqiad.wmnet with reason: Maintenance [16:36:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1141.eqiad.wmnet with reason: Maintenance [16:37:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1141.eqiad.wmnet with reason: Maintenance [16:37:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1017.eqiad.wmnet with OS buster [16:37:25] (03CR) 10Jbond: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [16:37:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1142.eqiad.wmnet with reason: Maintenance [16:37:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1017.eqiad.wmnet with OS buster [16:37:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1142.eqiad.wmnet with reason: Maintenance [16:37:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1143.eqiad.wmnet with reason: Maintenance [16:37:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1143.eqiad.wmnet with reason: Maintenance [16:37:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:38:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:38:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:38:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:38:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:38:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:38:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1147.eqiad.wmnet with reason: Maintenance [16:38:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1147.eqiad.wmnet with reason: Maintenance [16:38:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1148.eqiad.wmnet with reason: Maintenance [16:39:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1148.eqiad.wmnet with reason: Maintenance [16:39:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1149.eqiad.wmnet with reason: Maintenance [16:39:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1149.eqiad.wmnet with reason: Maintenance [16:39:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:39:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:39:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1190.eqiad.wmnet with reason: Maintenance [16:40:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1190.eqiad.wmnet with reason: Maintenance [16:40:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1199.eqiad.wmnet with reason: Maintenance [16:40:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1199.eqiad.wmnet with reason: Maintenance [16:40:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1221.eqiad.wmnet with reason: Maintenance [16:40:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1221.eqiad.wmnet with reason: Maintenance [16:40:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:41:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:41:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:41:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:41:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2099.codfw.wmnet with reason: Maintenance [16:41:20] !log otto@deploy1002 Started deploy [analytics/refinery@c337e2f]: Regular analytics weekly train [analytics/refinery@c337e2f] [16:41:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2099.codfw.wmnet with reason: Maintenance [16:41:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2106.codfw.wmnet with reason: Maintenance [16:41:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2106.codfw.wmnet with reason: Maintenance [16:41:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:42:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:42:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2119.codfw.wmnet with reason: Maintenance [16:42:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2119.codfw.wmnet with reason: Maintenance [16:42:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2136.codfw.wmnet with reason: Maintenance [16:42:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2136.codfw.wmnet with reason: Maintenance [16:42:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance [16:42:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance [16:42:57] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8928378, @taavi wrote: > The current implementation on codfw1dev seems to have forg... [16:43:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:43:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:43:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:43:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:43:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2147.codfw.wmnet with reason: Maintenance [16:43:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2147.codfw.wmnet with reason: Maintenance [16:43:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2155.codfw.wmnet with reason: Maintenance [16:43:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2155.codfw.wmnet with reason: Maintenance [16:43:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:44:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:44:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2172.codfw.wmnet with reason: Maintenance [16:44:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2172.codfw.wmnet with reason: Maintenance [16:44:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2179.codfw.wmnet with reason: Maintenance [16:44:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2179.codfw.wmnet with reason: Maintenance [16:45:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:45:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:45:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2140.codfw.wmnet with reason: Maintenance [16:45:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2140.codfw.wmnet with reason: Maintenance [16:45:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:45:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:45:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1118.eqiad.wmnet with reason: Maintenance [16:45:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1118.eqiad.wmnet with reason: Maintenance [16:46:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1119.eqiad.wmnet with reason: Maintenance [16:46:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1119.eqiad.wmnet with reason: Maintenance [16:46:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1128.eqiad.wmnet with reason: Maintenance [16:46:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1128.eqiad.wmnet with reason: Maintenance [16:46:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1132.eqiad.wmnet with reason: Maintenance [16:46:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1132.eqiad.wmnet with reason: Maintenance [16:46:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1134.eqiad.wmnet with reason: Maintenance [16:47:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1134.eqiad.wmnet with reason: Maintenance [16:47:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:47:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:47:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:47:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:47:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1140.eqiad.wmnet with reason: Maintenance [16:47:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1140.eqiad.wmnet with reason: Maintenance [16:47:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance [16:48:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance [16:48:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:48:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:48:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1186.eqiad.wmnet with reason: Maintenance [16:48:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1186.eqiad.wmnet with reason: Maintenance [16:48:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1196.eqiad.wmnet with reason: Maintenance [16:48:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1196.eqiad.wmnet with reason: Maintenance [16:48:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:49:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:49:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:49:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:49:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1207.eqiad.wmnet with reason: Maintenance [16:49:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1207.eqiad.wmnet with reason: Maintenance [16:49:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1218.eqiad.wmnet with reason: Maintenance [16:50:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1218.eqiad.wmnet with reason: Maintenance [16:50:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1219.eqiad.wmnet with reason: Maintenance [16:50:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1219.eqiad.wmnet with reason: Maintenance [16:50:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:50:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:50:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2097.codfw.wmnet with reason: Maintenance [16:50:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2097.codfw.wmnet with reason: Maintenance [16:50:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2102.codfw.wmnet with reason: Maintenance [16:50:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2102.codfw.wmnet with reason: Maintenance [16:50:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:51:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:51:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2116.codfw.wmnet with reason: Maintenance [16:51:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2116.codfw.wmnet with reason: Maintenance [16:51:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2130.codfw.wmnet with reason: Maintenance [16:51:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2130.codfw.wmnet with reason: Maintenance [16:52:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2141.codfw.wmnet with reason: Maintenance [16:52:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2141.codfw.wmnet with reason: Maintenance [16:52:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2145.codfw.wmnet with reason: Maintenance [16:52:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2145.codfw.wmnet with reason: Maintenance [16:52:30] (03PS1) 10Papaul: Add snapshot101[67] [puppet] - 10https://gerrit.wikimedia.org/r/929767 (https://phabricator.wikimedia.org/T334955) [16:52:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2146.codfw.wmnet with reason: Maintenance [16:52:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2146.codfw.wmnet with reason: Maintenance [16:52:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2153.codfw.wmnet with reason: Maintenance [16:53:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1017.eqiad.wmnet with reason: host reimage [16:53:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2153.codfw.wmnet with reason: Maintenance [16:53:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2167.codfw.wmnet with reason: Maintenance [16:53:25] (03CR) 10Papaul: [C: 03+2] Add snapshot101[67] [puppet] - 10https://gerrit.wikimedia.org/r/929767 (https://phabricator.wikimedia.org/T334955) (owner: 10Papaul) [16:53:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2167.codfw.wmnet with reason: Maintenance [16:53:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:53:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:53:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2173.codfw.wmnet with reason: Maintenance [16:53:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2173.codfw.wmnet with reason: Maintenance [16:53:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [16:53:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: Maintenance [16:54:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:54:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:54:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:54:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:54:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1163.eqiad.wmnet with reason: Maintenance [16:54:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1163.eqiad.wmnet with reason: Maintenance [16:54:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2112.codfw.wmnet with reason: Maintenance [16:54:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2112.codfw.wmnet with reason: Maintenance [16:56:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1017.eqiad.wmnet with reason: host reimage [16:56:11] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) @aborrero I discussed the idea of a [[ https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Service... [16:59:34] (03PS1) 10Ladsgroup: Make old_links retrieval cleaner [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929655 [16:59:47] jouncebot: nowandnext [16:59:48] For the next 0 hour(s) and 0 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1600) [16:59:48] In 0 hour(s) and 0 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1700) [16:59:57] (03CR) 10Ladsgroup: [C: 03+2] Make old_links retrieval cleaner [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929655 (owner: 10Ladsgroup) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1700) [17:02:34] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [17:03:34] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:05:23] !log otto@deploy1002 Finished deploy [analytics/refinery@c337e2f]: Regular analytics weekly train [analytics/refinery@c337e2f] (duration: 24m 03s) [17:06:06] !log otto@deploy1002 Started deploy [analytics/refinery@c337e2f]: Regular analytics weekly train [analytics/refinery@c337e2f] [17:06:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:13:58] !log otto@deploy1002 Finished deploy [analytics/refinery@c337e2f]: Regular analytics weekly train [analytics/refinery@c337e2f] (duration: 07m 51s) [17:17:38] (03Merged) 10jenkins-bot: Make old_links retrieval cleaner [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929655 (owner: 10Ladsgroup) [17:19:34] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikidiff2 for TheresNoTime - https://phabricator.wikimedia.org/T338948 (10KSiebert) @TheresNoTime Approving! [17:20:15] !log otto@deploy1002 Started deploy [analytics/refinery@c337e2f] (thin): Regular analytics weekly train THIN [analytics/refinery@c337e2f] [17:20:20] !log otto@deploy1002 Finished deploy [analytics/refinery@c337e2f] (thin): Regular analytics weekly train THIN [analytics/refinery@c337e2f] (duration: 00m 04s) [17:20:21] !log otto@deploy1002 Started deploy [analytics/refinery@c337e2f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c337e2f] [17:20:28] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:929655|Make old_links retrieval cleaner]] [17:22:00] (03CR) 10BCornwall: [C: 03+2] prometheus: Disable SNI support in Envoy tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/929421 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall) [17:22:05] !log otto@deploy1002 Finished deploy [analytics/refinery@c337e2f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c337e2f] (duration: 01m 43s) [17:22:14] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:929655|Make old_links retrieval cleaner]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [17:22:36] !log otto@deploy1002 Started deploy [analytics/refinery@c337e2f]: Regular analytics weekly train [analytics/refinery@c337e2f] - to stat1009 [17:22:38] !log otto@deploy1002 Finished deploy [analytics/refinery@c337e2f]: Regular analytics weekly train [analytics/refinery@c337e2f] - to stat1009 (duration: 00m 02s) [17:27:03] (03PS1) 10BCornwall: prometheus: Add global_cert_name to Envoy config [puppet] - 10https://gerrit.wikimedia.org/r/929768 (https://phabricator.wikimedia.org/T301944) [17:27:34] !log otto@deploy1002 Started deploy [analytics/refinery@c337e2f]: Regular analytics weekly train [analytics/refinery@c337e2f] - to stat1009f [17:28:57] I caused a break in puppet, rectifying now [17:28:59] !log otto@deploy1002 Finished deploy [analytics/refinery@c337e2f]: Regular analytics weekly train [analytics/refinery@c337e2f] - to stat1009f (duration: 01m 25s) [17:30:27] (03PS1) 10BCornwall: Revert "prometheus: Disable SNI support in Envoy tlsproxy" [puppet] - 10https://gerrit.wikimedia.org/r/929656 [17:31:09] (03CR) 10BCornwall: [C: 03+2] Revert "prometheus: Disable SNI support in Envoy tlsproxy" [puppet] - 10https://gerrit.wikimedia.org/r/929656 (owner: 10BCornwall) [17:31:58] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41705/console" [puppet] - 10https://gerrit.wikimedia.org/r/929768 (https://phabricator.wikimedia.org/T301944) (owner: 10BCornwall) [17:32:02] Fixed. Apologies for the inconvenience ._. [17:38:38] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:929655|Make old_links retrieval cleaner]] (duration: 18m 09s) [17:50:37] (03PS1) 10Ladsgroup: Retrieve external links from PreparedUpdate [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929657 (https://phabricator.wikimedia.org/T65632) [17:50:50] jouncebot: nowandnext [17:50:50] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1700) [17:50:51] In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1800) [17:55:03] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host snapshot1016.eqiad.wmnet with OS buster [17:55:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e... [17:55:32] PROBLEM - Check systemd state on snapshot1016 is CRITICAL: CRITICAL - degraded: The following units failed: mnt-dumpsdata.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [17:56:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster [17:56:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1016.eqiad.wmnet with reason: host reimage [17:58:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1016.eqiad.wmnet with reason: host reimage [18:00:04] jnuche and jeena: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T1800). [18:02:29] (03PS2) 10BCornwall: prometheus: Disable SNI support in Envoy tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/929768 (https://phabricator.wikimedia.org/T301944) [18:05:20] (03PS1) 10Papaul: Add snapshot101[67] to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/929772 (https://phabricator.wikimedia.org/T334955) [18:06:06] (03CR) 10Papaul: [C: 03+2] Add snapshot101[67] to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/929772 (https://phabricator.wikimedia.org/T334955) (owner: 10Papaul) [18:06:52] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:08:22] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:08:24] (03CR) 10Ladsgroup: [C: 03+2] Retrieve external links from PreparedUpdate [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929657 (https://phabricator.wikimedia.org/T65632) (owner: 10Ladsgroup) [18:13:40] RECOVERY - Host mw1492 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:13:46] PROBLEM - Check systemd state on mw1492 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:17:08] 10SRE: makevm cookbook should remove VMs if OS install fails - https://phabricator.wikimedia.org/T338986 (10Dzahn) [18:17:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:16] RECOVERY - Check systemd state on mw1492 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:41] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:22:24] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:23:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:23:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1016.eqiad.wmnet with OS buster [18:23:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster completed: - sn... [18:24:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:26:57] (03PS2) 10Cwhite: Ignore the JSFiddle article in alerts [puppet] - 10https://gerrit.wikimedia.org/r/929745 (https://phabricator.wikimedia.org/T338963) (owner: 10Jdlrobson) [18:27:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb1021.eqiad.wmnet with reason: T337961 [18:27:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb1021.eqiad.wmnet with reason: T337961 [18:27:17] T337961: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 [18:27:28] 10SRE: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10Dzahn) [18:27:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:28:00] (03CR) 10Cwhite: [C: 03+2] Ignore the JSFiddle article in alerts [puppet] - 10https://gerrit.wikimedia.org/r/929745 (https://phabricator.wikimedia.org/T338963) (owner: 10Jdlrobson) [18:29:07] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:11] !log root@clouddb1021.eqiad.wmnet[ruwikinews]> ALTER TABLE externallinks ROW_FORMAT=COMPRESSED; (T337961) [18:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:41] !log ganeti1028 - deleting VM people1004 [18:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:03] PROBLEM - mediawiki-installation DSH group on mw1492 is CRITICAL: Host mw1492 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:30:07] !log ganeti1028 - deleting VM people2003 [18:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:51] !log ganeti2021 - deleting VM people2003 [18:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:02] (03Merged) 10jenkins-bot: Retrieve external links from PreparedUpdate [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929657 (https://phabricator.wikimedia.org/T65632) (owner: 10Ladsgroup) [18:32:36] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:929657|Retrieve external links from PreparedUpdate (T65632 T264104)]] [18:32:41] T264104: Verify AbuseFilter code that claims to share and re-use ParserOutput from core - https://phabricator.wikimedia.org/T264104 [18:32:41] T65632: AbuseFilter *_links variables incorrect on item edits. - https://phabricator.wikimedia.org/T65632 [18:33:35] (03PS1) 10Dzahn: site: add people1004/people2003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/929776 (https://phabricator.wikimedia.org/T338827) [18:34:10] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:929657|Retrieve external links from PreparedUpdate (T65632 T264104)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [18:35:40] !log grafana2001 - apt-get clean [18:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:22] (03CR) 10Dzahn: [C: 03+2] site: add people1004/people2003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/929776 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [18:43:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:43:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1017.eqiad.wmnet with OS buster [18:43:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1017.eqiad.wmnet with OS buster completed: - sn... [18:43:44] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people2003.codfw.wmnet [18:43:46] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [18:44:09] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:44:55] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:929657|Retrieve external links from PreparedUpdate (T65632 T264104)]] (duration: 12m 18s) [18:45:00] T264104: Verify AbuseFilter code that claims to share and re-use ParserOutput from core - https://phabricator.wikimedia.org/T264104 [18:45:00] T65632: AbuseFilter *_links variables incorrect on item edits. - https://phabricator.wikimedia.org/T65632 [18:45:35] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs request for people - https://phabricator.wikimedia.org/T338998 (10Dzahn) [18:46:22] (03PS1) 10Ssingh: Release 3.99.0~alpha2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/929779 [18:46:27] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338904 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Server back up after main board replacement [18:46:29] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs request for people - https://phabricator.wikimedia.org/T338998 (10Dzahn) to replace existing VMs on bullseye with new VMs on bookworm and then delete the bullseye VMs. people1004 and people2003 to replace people1003 and people2002 -> service i... [18:46:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:46:51] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs request for people - https://phabricator.wikimedia.org/T338998 (10Dzahn) [18:47:04] !log root@clouddb1021.eqiad.wmnet[commonswiki]> ALTER TABLE externallinks ROW_FORMAT=COMPRESSED; (T337961) [18:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:08] T337961: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 [18:47:40] (03PS2) 10Ssingh: Release 3.99.0~alpha2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/929779 [18:47:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) 05Open→03Resolved Replaced main board. Server is back up now @elukey [18:47:47] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs request for people - https://phabricator.wikimedia.org/T338998 (10Dzahn) ` dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 2 --disk 80 --network private --cluster codfw --group D people2003 --os bookworm Ready to create Gan... [18:48:00] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people2003.codfw.wmnet - dzahn@cumin1001" [18:50:37] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10Jclark-ctr) a:03Jclark-ctr [18:50:39] 10SRE, 10ops-eqiad, 10database-backups: db1139 rebooted - https://phabricator.wikimedia.org/T338766 (10Jclark-ctr) @jcrespo is server down? i would need to open it up to verify memory speed but should have something to get it up. if it is down can i can address 1st thing tomorrow for my morning [18:52:14] (03CR) 10BBlack: [C: 03+1] Release 3.99.0~alpha2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/929779 (owner: 10Ssingh) [18:52:39] (03CR) 10Ssingh: [C: 03+2] Release 3.99.0~alpha2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/929779 (owner: 10Ssingh) [19:08:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people2003.codfw.wmnet - dzahn@cumin1001" [19:08:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:08:00] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people2003.codfw.wmnet on all recursors [19:08:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people2003.codfw.wmnet on all recursors [19:08:29] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people2003.codfw.wmnet - dzahn@cumin1001" [19:08:32] !log dns4004: downtiming and stopping agent for a bit, to test some new software [19:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:17] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:09:23] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:09:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people2003.codfw.wmnet - dzahn@cumin1001" [19:12:38] (03PS1) 10Cwhite: hiera: actually delete chunks from loki [puppet] - 10https://gerrit.wikimedia.org/r/929748 (https://phabricator.wikimedia.org/T335610) [19:12:40] (03PS1) 10Cwhite: hiera: actually delete chunks from loki [puppet] - 10https://gerrit.wikimedia.org/r/929749 (https://phabricator.wikimedia.org/T335610) [19:12:57] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:13:03] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:13:21] PROBLEM - Recursive DNS on 198.35.26.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:13:36] ^ expected [19:14:07] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:18] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10BTullis) @KCVelaga - I think I know what the cause of this error is. Are you, by any chance, trying the run the command from a shell within a Jupyt... [19:16:32] (03CR) 10Cwhite: [C: 03+2] hiera: actually delete chunks from loki [puppet] - 10https://gerrit.wikimedia.org/r/929748 (https://phabricator.wikimedia.org/T335610) (owner: 10Cwhite) [19:17:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host people2003.codfw.wmnet with OS bookworm [19:17:17] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [19:17:27] RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:17:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:17:37] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:17:53] RECOVERY - Recursive DNS on 198.35.26.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:18:06] !log dns4004 - downtime removed, agent back to normal, etc [19:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) [19:18:58] inflatador: You got the wdqs issues? [19:20:00] (03PS1) 10BBlack: nop commit to test updates [dns] - 10https://gerrit.wikimedia.org/r/929783 [19:20:04] (03CR) 10Effie Mouzeli: [C: 03+2] iPoid: update ENV variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/929711 (owner: 10Effie Mouzeli) [19:21:16] (03Merged) 10jenkins-bot: iPoid: update ENV variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/929711 (owner: 10Effie Mouzeli) [19:22:27] 10SRE: makevm cookbook should remove VMs if OS install fails - https://phabricator.wikimedia.org/T338986 (10Dzahn) I went on and deleted the VMs manually with "gnt-instance remove" on the ganeti masters. Then I ran the cookbook again. This is what happened during that run. It said now that it DID find the host... [19:22:31] (03CR) 10BBlack: [C: 03+2] nop commit to test updates [dns] - 10https://gerrit.wikimedia.org/r/929783 (owner: 10BBlack) [19:22:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:24:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff thanks the patch works @ArielGlenn I had to change the role to insetup::core_platform since puppet... [19:25:49] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [19:27:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:31:03] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:33:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [19:35:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on people2003.codfw.wmnet with reason: host reimage [19:38:27] 10SRE: makevm cookbook should remove VMs if OS install fails - https://phabricator.wikimedia.org/T338986 (10Dzahn) After this we are back to: ` [1/50, retrying in 3.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title... [19:38:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on people2003.codfw.wmnet with reason: host reimage [19:39:03] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs request for people - https://phabricator.wikimedia.org/T338998 (10Dzahn) 05Open→03Stalled stalling on T338986 [19:43:16] 10SRE: makevm cookbook should remove VMs if OS install fails - https://phabricator.wikimedia.org/T338986 (10Dzahn) nevermind, it does continue after about the 12th attempt: ` Found Nagios_host resource for this host in PuppetDB ` [19:48:54] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:50:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.864 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:50:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host people2003.codfw.wmnet with OS bookworm [19:50:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host people2003.codfw.wmnet [19:53:21] (03PS1) 10Bartosz Dziewoński: Exclude after-aligned tools when creating target widgets [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929658 (https://phabricator.wikimedia.org/T338978) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230613T2000). [20:00:05] ebernhardson and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] i can deploy today [20:00:27] hi [20:00:37] hi [20:00:56] MatmaRex: i assume i need to do backport first and then script? [20:01:03] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1004.eqiad.wmnet [20:01:04] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:01:05] or not, since it's JS? [20:01:35] they're unrelated [20:01:42] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs request for people - https://phabricator.wikimedia.org/T338998 (10Dzahn) 05Stalled→03In progress ` dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 2 --disk 80 --network private --cluster eqiad --group D people1004 --os... [20:01:45] so, any order you want [20:02:01] (03CR) 10Urbanecm: [C: 03+2] Exclude after-aligned tools when creating target widgets [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929658 (https://phabricator.wikimedia.org/T338978) (owner: 10Bartosz Dziewoński) [20:02:02] sounds good [20:04:13] ebernhardson: hi! ad your patch, do you want me to deploy it for you, or should i ping you once done? [20:05:38] (03PS1) 10JHathaway: apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) [20:07:56] (03CR) 10CI reject: [V: 04-1] apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [20:07:58] MatmaRex: just double checking, i'm running `foreachwikiindblist 'group2 & s7' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all`, is that right? [20:08:36] urbanecm: yes, thanks [20:08:45] okay, starting. [20:08:57] (03CR) 10JHathaway: "second attempt, updated rationale in the phab task and inline the code." [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [20:09:29] !log Start `foreachwikiindblist 'group2 & s7' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all` in a tmux session on mwmaint1002 (T315510) [20:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:33] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [20:14:04] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [20:15:28] (03PS1) 10Dzahn: site: add microsites::peopleweb to new people VMs [puppet] - 10https://gerrit.wikimedia.org/r/929788 (https://phabricator.wikimedia.org/T338827) [20:15:51] (03CR) 10CI reject: [V: 04-1] site: add microsites::peopleweb to new people VMs [puppet] - 10https://gerrit.wikimedia.org/r/929788 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [20:16:47] (03PS1) 10Dzahn: peopleweb: make people1004 a new destination sync host [puppet] - 10https://gerrit.wikimedia.org/r/929789 (https://phabricator.wikimedia.org/T338827) [20:16:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1004.eqiad.wmnet - dzahn@cumin1001" [20:16:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:16:58] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache people1004.eqiad.wmnet on all recursors [20:17:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1004.eqiad.wmnet on all recursors [20:17:05] (03CR) 10CI reject: [V: 04-1] peopleweb: make people1004 a new destination sync host [puppet] - 10https://gerrit.wikimedia.org/r/929789 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [20:17:26] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people1004.eqiad.wmnet - dzahn@cumin1001" [20:18:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people1004.eqiad.wmnet - dzahn@cumin1001" [20:19:53] (03Merged) 10jenkins-bot: Exclude after-aligned tools when creating target widgets [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929658 (https://phabricator.wikimedia.org/T338978) (owner: 10Bartosz Dziewoński) [20:20:52] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:929658|Exclude after-aligned tools when creating target widgets (T338978)]] [20:20:56] T338978: Image and Media upload fails in Beta cluster - https://phabricator.wikimedia.org/T338978 [20:21:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host people1004.eqiad.wmnet with OS bookworm [20:22:24] !log urbanecm@deploy1002 matmarex and urbanecm: Backport for [[gerrit:929658|Exclude after-aligned tools when creating target widgets (T338978)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:22:39] MatmaRex: your patch is at mwdebug1002 -- can you test? [20:23:06] looking [20:23:29] urbanecm: looks good [20:23:33] thanks, proceeding [20:29:02] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:929658|Exclude after-aligned tools when creating target widgets (T338978)]] (duration: 08m 10s) [20:29:08] T338978: Image and Media upload fails in Beta cluster - https://phabricator.wikimedia.org/T338978 [20:29:18] MatmaRex: should be synced [20:29:21] thanks [20:30:03] np [20:30:14] the maintenace script is still at arwiki [20:30:18] i'll leave it runing in the tmux [20:30:41] (fwiw, output is accessible in `/home/urbanecm/matmarex-T315510.log`) [20:30:41] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [20:31:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on people1004.eqiad.wmnet with reason: host reimage [20:31:41] urbanecm: fyi, the script will probably take a few weeks in total across all the wikis [20:31:46] i haven't really made that clear, sorry [20:31:50] but i hope that's okay [20:32:07] i guessed so from the task. it should be fine, but it might be terminated if something happens (server reboot, whatever) [20:32:08] (you can stop it and start it differently if that changes anything) [20:32:36] yeah. we can restart it then [20:32:45] sounds good [20:33:13] i'm planning to schedule the other sections too, not just s7. but starting with this one to make sure it's not causing any issues [20:34:32] good idea :) [20:34:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on people1004.eqiad.wmnet with reason: host reimage [20:37:02] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10Urbanecm_WMF) I can confirm this for both officewiki and community-related private wikis I'm privy to (stewardwiki, checkuserwiki). [20:42:41] (03PS2) 10JHathaway: apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) [20:45:24] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [20:46:05] urbanecm: doh i was distracted, if your done i can probably deploy mine [20:46:12] yup, go ahead [20:46:16] thanks [20:46:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929411 (https://phabricator.wikimedia.org/T334194) (owner: 10Ebernhardson) [20:47:43] (03Merged) 10jenkins-bot: cirrus: Enable analysis chain deduplication for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929411 (https://phabricator.wikimedia.org/T334194) (owner: 10Ebernhardson) [20:48:09] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]] [20:48:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host people1004.eqiad.wmnet with OS bookworm [20:48:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host people1004.eqiad.wmnet [20:48:13] T334194: Optimize the elasticsearch analysis settings for wikibase - https://phabricator.wikimedia.org/T334194 [20:49:21] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs request for people - https://phabricator.wikimedia.org/T338998 (10Dzahn) 05In progress→03Resolved a:03Dzahn [20:49:37] !log ebernhardson@deploy1002 ebernhardson: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:49:39] (03PS2) 10Dzahn: site: add microsites::peopleweb to new people VMs [puppet] - 10https://gerrit.wikimedia.org/r/929788 (https://phabricator.wikimedia.org/T338827) [20:51:48] (03CR) 10Dzahn: [C: 03+2] site: add microsites::peopleweb to new people VMs [puppet] - 10https://gerrit.wikimedia.org/r/929788 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [20:53:32] (03PS1) 10Andrew Bogott: nova.conf: pass service-user token to cinder [puppet] - 10https://gerrit.wikimedia.org/r/929791 (https://phabricator.wikimedia.org/T338262) [20:55:19] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [20:55:46] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]] (duration: 07m 36s) [20:55:50] T334194: Optimize the elasticsearch analysis settings for wikibase - https://phabricator.wikimedia.org/T334194 [20:55:57] all done [20:56:12] PROBLEM - Check systemd state on people1004 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:46] PROBLEM - Check systemd state on people2003 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:51] (03CR) 10JHathaway: [C: 03+2] dev env: nrpe listen on all interfaces in a container (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:57:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on people1004.eqiad.wmnet with reason: first setup [20:57:26] mutante: was going to ask you if that was known [20:57:28] (03CR) 10Andrea Denisse: [C: 03+1] profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [20:57:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on people1004.eqiad.wmnet with reason: first setup [20:57:38] RECOVERY - Check systemd state on people1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:41] * RhinosF1 takes the downtime as a yes [20:58:09] RhinosF1: yes, happens every time a role is applied first time [20:58:10] (03CR) 10Andrea Denisse: [C: 03+1] pyrra: add pyrra::(api|filesystem) modules [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [20:58:12] RECOVERY - Check systemd state on people2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:39] mutante: cool [20:58:41] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: pass service-user token to cinder [puppet] - 10https://gerrit.wikimedia.org/r/929791 (https://phabricator.wikimedia.org/T338262) (owner: 10Andrew Bogott) [21:03:38] RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:04:53] or not so cool, but it fixes itself, yea :) [21:05:09] after the second run.. just that the first fails.. as is common [21:05:41] brett sorry for late reply, but yes [21:05:57] :) [21:06:08] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye [21:06:21] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye [21:07:50] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: use transferpy lib for data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/914018 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [21:12:36] (03CR) 10Bking: [C: 03+2] wdqs: use transferpy lib for data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/914018 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [21:15:40] (03Merged) 10jenkins-bot: wdqs: use transferpy lib for data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/914018 (https://phabricator.wikimedia.org/T321605) (owner: 10Bking) [21:18:14] (03PS2) 10Dzahn: peopleweb: make people1004 a new destination sync host [puppet] - 10https://gerrit.wikimedia.org/r/929789 (https://phabricator.wikimedia.org/T338827) [21:23:37] (03CR) 10Dzahn: [C: 03+2] peopleweb: make people1004 a new destination sync host [puppet] - 10https://gerrit.wikimedia.org/r/929789 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [21:26:38] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:35:35] PROBLEM - people.wikimedia.org requires authentication on people2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:36:52] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10NMariano-WMF) Hi, @Dzahn my apologies for the delay. It took me a while to get in touch with comms and I was OOO all last week. Here is a general overview document of the project... [21:38:10] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10NMariano-WMF) Comms also let me know that they are ok with using social.wikimedia.org as requested by @Legoktm [21:39:56] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10BCornwall) a:05BCornwall→03None [21:47:20] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Dzahn) Thank you for the link, @NMariano-WMF . It's appreciated. Though I am not sure I see technical things on where and how to host it in there? I would like to suggest what L... [21:49:56] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10BCornwall) Yeah, the document is pretty barren. It sounds like there needs to be a little bit more planning! [22:08:15] (03PS10) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [22:08:17] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10Papaul) @Jclark-ctr xe-0/0/18 is configured on the switch but not in netbox and on the switch it has no description so i can not add it to Netbox. Can please check what is connected to that interface? ` papaul@c... [22:11:10] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [22:12:59] 10SRE, 10DNS, 10Domains, 10Traffic: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Legoktm) >>! In T337586#8929576, @NMariano-WMF wrote: > Hi, @Dzahn my apologies for the delay. It took me a while to get in touch with comms and I was OOO all last week. Here is... [22:14:19] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1035'] [22:20:00] (03PS1) 10Dzahn: peopleweb: allow rsyncing user data from people2002 to people2003 [puppet] - 10https://gerrit.wikimedia.org/r/929799 (https://phabricator.wikimedia.org/T338827) [22:21:59] PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 3672 MB (3% inode=93%): /tmp 3672 MB (3% inode=93%): /var/tmp 3672 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops [22:28:21] (03PS1) 10Papaul: Update role in site.pp for cloudcephosd1035-1040 [puppet] - 10https://gerrit.wikimedia.org/r/929800 (https://phabricator.wikimedia.org/T324998) [22:29:10] (03CR) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [22:29:27] (03CR) 10Papaul: [C: 03+2] Update role in site.pp for cloudcephosd1035-1040 [puppet] - 10https://gerrit.wikimedia.org/r/929800 (https://phabricator.wikimedia.org/T324998) (owner: 10Papaul) [22:36:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1035'] [22:39:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye [22:39:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye [22:40:15] (03PS2) 10Dzahn: peopleweb: allow rsyncing user data from people2002 to people2003 [puppet] - 10https://gerrit.wikimedia.org/r/929799 (https://phabricator.wikimedia.org/T338827) [23:00:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [23:01:16] (03CR) 10Dzahn: [C: 03+2] peopleweb: allow rsyncing user data from people2002 to people2003 [puppet] - 10https://gerrit.wikimedia.org/r/929799 (https://phabricator.wikimedia.org/T338827) (owner: 10Dzahn) [23:08:37] (03PS8) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [23:11:01] (03CR) 10CI reject: [V: 04-1] Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [23:16:19] RECOVERY - people.wikimedia.org requires authentication on people2003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 1.195 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:22:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:23:19] 10ops-eqiad, 10DC-Ops: hw troubleshooting: disk replacement for an-worker1110.eqiad.wmnet - https://phabricator.wikimedia.org/T336930 (10Jclark-ctr) [23:40:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1035.eqiad.wmnet with OS bullseye [23:57:50] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer