[00:04:41] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: discard_held_messages.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:14] (03CR) 10Dzahn: "I wonder if this is possible and easy enough or if there is a blocker." [puppet] - 10https://gerrit.wikimedia.org/r/893579 (https://phabricator.wikimedia.org/T330963) (owner: 10Dzahn) [00:23:23] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/output/896194/40068/" [puppet] - 10https://gerrit.wikimedia.org/r/896194 (https://phabricator.wikimedia.org/T330759) (owner: 10BryanDavis) [00:30:33] (03PS1) 10Cathal Mooney: Restrict prefix length for public announce, allow bgp for cloud range [homer/public] - 10https://gerrit.wikimedia.org/r/896200 (https://phabricator.wikimedia.org/T327919) [00:31:02] (03CR) 10CI reject: [V: 04-1] Restrict prefix length for public announce, allow bgp for cloud range [homer/public] - 10https://gerrit.wikimedia.org/r/896200 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [00:33:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [00:33:14] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [00:33:49] (03PS2) 10Cathal Mooney: Restrict prefix length for public announce, allow bgp for cloud range [homer/public] - 10https://gerrit.wikimedia.org/r/896200 (https://phabricator.wikimedia.org/T327919) [00:34:49] (03CR) 10Cathal Mooney: [C: 03+2] Restrict prefix length for public announce, allow bgp for cloud range [homer/public] - 10https://gerrit.wikimedia.org/r/896200 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [00:35:24] (03Merged) 10jenkins-bot: Restrict prefix length for public announce, allow bgp for cloud range [homer/public] - 10https://gerrit.wikimedia.org/r/896200 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [00:56:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) I've added the BGP config and moved the GW interfaces from the two CRs i... [01:28:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye [01:28:34] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**) - Removed from Puppet and P... [01:44:42] (03CR) 10Eevans: [C: 03+1] swift: bring ms-be1066 sdr1 back into service [puppet] - 10https://gerrit.wikimedia.org/r/896124 (https://phabricator.wikimedia.org/T329305) (owner: 10MVernon) [01:52:02] (03PS1) 10Zabe: Revert "Unload RenameUser, now part of core: Part II of II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896032 (https://phabricator.wikimedia.org/T331685) [01:53:16] (03CR) 10Zabe: [C: 03+2] Revert "Unload RenameUser, now part of core: Part II of II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896032 (https://phabricator.wikimedia.org/T331685) (owner: 10Zabe) [01:53:58] (03Merged) 10jenkins-bot: Revert "Unload RenameUser, now part of core: Part II of II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896032 (https://phabricator.wikimedia.org/T331685) (owner: 10Zabe) [01:54:21] !log zabe@deploy2002 Started scap: T331685 [01:54:27] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [01:55:20] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [01:55:33] (03PS1) 10Zabe: Revert "Unload RenameUser, now part of core: Part I of II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896033 (https://phabricator.wikimedia.org/T331685) [01:55:49] (03CR) 10Zabe: [C: 03+2] Revert "Unload RenameUser, now part of core: Part I of II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896033 (https://phabricator.wikimedia.org/T331685) (owner: 10Zabe) [01:56:35] (03Merged) 10jenkins-bot: Revert "Unload RenameUser, now part of core: Part I of II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896033 (https://phabricator.wikimedia.org/T331685) (owner: 10Zabe) [02:01:49] !log zabe@deploy2002 Finished scap: T331685 (duration: 07m 28s) [02:01:54] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [02:02:03] !log zabe@deploy2002 Started scap: T331685 [02:02:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:05:35] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:09:56] !log zabe@deploy2002 Finished scap: T331685 (duration: 07m 52s) [02:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:10:01] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [02:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:42] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10RLazarus) Perfect, thank you! I started https://wikitech.wikimedia.org/wiki/Incidents/2023-02-11_logstash_latency and filled in what we know so far (as far as I know, anyw... [02:58:09] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.435 second response time https://wikitech.wikimedia.org/wiki/Swift [02:59:47] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Swift [04:05:42] (03PS1) 10Varnent: T331680 [puppet] - 10https://gerrit.wikimedia.org/r/896211 [04:08:07] (03CR) 10CI reject: [V: 04-1] T331680 [puppet] - 10https://gerrit.wikimedia.org/r/896211 (owner: 10Varnent) [04:21:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:07] (03PS1) 10Varnent: T331680 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 [05:18:33] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Krinkle) [05:18:40] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Krinkle) [05:21:52] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Krinkle) No mention of a production error in this task. [05:51:14] (03PS1) 10Varnent: T297396 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 [06:10:05] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.216 second response time https://wikitech.wikimedia.org/wiki/Swift [06:11:47] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Swift [06:23:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:24:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:29:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.388 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:30:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:43] PROBLEM - Check systemd state on arclamp2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230310T0700) [07:14:17] (03Abandoned) 10Hashar: scap: disable git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [07:16:29] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Radar): git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) On `deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud`: ` Notice: /Stage[ma... [07:18:11] (03CR) 10Elukey: Move default kubernetes version to 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:20:44] (03CR) 10Elukey: [C: 03+1] Migrate away from deprecated typology annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [07:22:58] (03CR) 10Elukey: [C: 03+1] cert-manager: Enable stable certificate request names in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/896111 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [07:30:23] (03CR) 10Gergő Tisza: changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [07:32:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:32:51] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:50] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Aklapper) @BCornwall: Argh, and you are absolutely right about the future aspect that I somehow failed to realize. Sorry (and thanks). I propose to decl... [07:54:05] (03CR) 10Muehlenhoff: [C: 03+2] Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/895815 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff) [07:54:40] (03PS1) 10Nicolas Fraison: spark: Authorize driver and executor pods to communicate [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) [07:55:20] (03CR) 10Nicolas Fraison: "Still in Progress" [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [07:55:24] (03CR) 10Nicolas Fraison: "Still in Progress" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230310T0800) [08:10:43] 10SRE, 10Infrastructure-Foundations: MIgrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10MoritzMuehlenhoff) [08:11:16] 10SRE, 10Infrastructure-Foundations: MIgrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10MoritzMuehlenhoff) [08:11:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10MoritzMuehlenhoff) [08:11:25] 10SRE, 10Infrastructure-Foundations: MIgrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:11:59] 10SRE, 10Infrastructure-Foundations: Migrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10MoritzMuehlenhoff) [08:12:17] (03CR) 10Kosta Harlan: changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [08:15:48] (03Abandoned) 10Alexandros Kosiaris: DNM: showcase fixtures for jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/896177 (owner: 10Alexandros Kosiaris) [08:17:33] (03PS1) 10Nicolas Fraison: jobhistory: add prometheus jmx javaagent on prod jobhistory [puppet] - 10https://gerrit.wikimedia.org/r/896305 [08:17:57] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Bugreporter) The subtasks may be more important. Once all per-year wiki are migrated we can redirect all of wikimania20\d\d.wikimedia.org. [08:21:15] (03PS2) 10Nicolas Fraison: jobhistory: add prometheus jmx javaagent on prod jobhistory [puppet] - 10https://gerrit.wikimedia.org/r/896305 [08:36:40] 10SRE, 10Infrastructure-Foundations: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10MoritzMuehlenhoff) [08:36:49] 10SRE, 10Infrastructure-Foundations: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:36:56] (03PS4) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [08:36:58] (03PS4) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) [08:37:20] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [08:37:27] (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [08:37:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:38:20] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Aklapper) I propose not to redirect all of wikimania20\d\d.wikimedia.org, see previous comments. [08:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:45:53] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:45:59] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:46:33] (03CR) 10Btullis: "Looks good to me in general." [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) (owner: 10Nicolas Fraison) [08:46:34] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) [08:46:50] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:50:02] 10SRE, 10Infrastructure-Foundations: Migrate cuminunpriv1001 to Bullseye - https://phabricator.wikimedia.org/T331700 (10MoritzMuehlenhoff) [08:50:10] 10SRE, 10Infrastructure-Foundations: Migrate cuminunpriv1001 to Bullseye - https://phabricator.wikimedia.org/T331700 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:51:46] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) [08:56:57] 10SRE: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10MoritzMuehlenhoff) [08:57:06] 10SRE: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:57:13] (03PS9) 10Jelto: gitlab_runner: add optional docker registry proxy to runners [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) [09:04:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:04:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:07:04] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40071/console" [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [09:13:43] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add optional docker registry proxy to runners [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [09:34:55] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MatthewVernon) @thcipriani you're the approver for the `deployment` group, are you happy for me to add @Htriedman to it, please? [09:35:50] 10SRE, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye - https://phabricator.wikimedia.org/T331706 (10MoritzMuehlenhoff) [09:36:00] 10SRE, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye - https://phabricator.wikimedia.org/T331706 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:37:21] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10taavi) You sure that's the correct group? The `deploy_airflow` keyholder key can be accessed by the following groups: `lang=yaml # Shared deploy ssh key for Data Engineering maintained # Airflow instan... [09:47:53] 10ops-eqiad, 10Infrastructure-Foundations: Confirm cable labels and add to Netbox - https://phabricator.wikimedia.org/T331709 (10cmooney) p:05Triage→03Low [09:49:34] 10SRE: Migrate Kafka test clusterto Bullseye - https://phabricator.wikimedia.org/T331710 (10MoritzMuehlenhoff) [09:49:46] 10SRE: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:52:32] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MatthewVernon) Happy to hold off until we're sure what the correct group is :) [09:54:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:55:07] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:55:12] (03CR) 10Muehlenhoff: [C: 03+1] admin: update sbassett ssh key [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554) (owner: 10MVernon) [09:56:28] (03CR) 10MVernon: [C: 03+2] admin: update sbassett ssh key [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554) (owner: 10MVernon) [09:57:19] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove netbox-generated DNS records which have been defined manually. - cmooney@cumin1001" [09:58:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10MatthewVernon) 05In progress→03Resolved a:03MatthewVernon @sbassett done. [09:58:23] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove netbox-generated DNS records which have been defined manually. - cmooney@cumin1001" [09:58:23] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:59:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:59:29] (03CR) 10MVernon: [C: 03+2] swift: bring ms-be1066 sdr1 back into service [puppet] - 10https://gerrit.wikimedia.org/r/896124 (https://phabricator.wikimedia.org/T329305) (owner: 10MVernon) [09:59:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:01:57] (03Abandoned) 10Majavah: cr-cloud: permit toolsdb return traffic to cloudcontrols [homer/public] - 10https://gerrit.wikimedia.org/r/896051 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah) [10:02:37] (03PS5) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [10:02:39] (03PS5) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) [10:03:56] 10SRE, 10Cassandra, 10Data-Persistence: Migrate cassandra-dev to Bullseye - https://phabricator.wikimedia.org/T331711 (10MoritzMuehlenhoff) [10:04:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:04:44] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [10:13:41] (03PS6) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [10:13:43] (03PS6) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) [10:15:54] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10MoritzMuehlenhoff) [10:21:33] zabe: for T331685 I think we can just fix those with fixStuckGlobalRename.php [10:21:34] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [10:21:46] yeah, planned to do that [10:21:58] cool, lmk if I can help [10:23:38] (03PS1) 10Cathal Mooney: Allow cloudsw in codfw to announce 208.80.153.184/29 [homer/public] - 10https://gerrit.wikimedia.org/r/896310 (https://phabricator.wikimedia.org/T327919) [10:24:31] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10MoritzMuehlenhoff) [10:24:35] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate sessionstore servers to Bullseye - https://phabricator.wikimedia.org/T331714 (10MoritzMuehlenhoff) [10:24:55] jouncebot: nowandnext [10:24:55] For the next 21 hour(s) and 35 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230310T0800) [10:24:55] In 21 hour(s) and 35 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230311T0800) [10:25:08] oh yeah, Friday [10:26:40] 10SRE, 10Infrastructure-Foundations, 10netops: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10cmooney) > One takeaway is that we might be able to add IPFIX's fwd_status in Pmacct, Druid and Turnilo that way we can filter on the traffic being sampled but dropped at our bor... [10:26:47] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:26:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:51] (03CR) 10Cathal Mooney: [C: 03+2] Allow cloudsw in codfw to announce 208.80.153.184/29 [homer/public] - 10https://gerrit.wikimedia.org/r/896310 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [10:28:27] (03Merged) 10jenkins-bot: Allow cloudsw in codfw to announce 208.80.153.184/29 [homer/public] - 10https://gerrit.wikimedia.org/r/896310 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [10:29:12] (03PS2) 10Samtar: Enable VE on more namespaces on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders) [10:33:19] (03CR) 10Samtar: [C: 03+1] "lgtm, will schedule for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 (owner: 10Varnent) [10:33:24] (03CR) 10Samtar: [C: 03+1] "lgtm, will schedule for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894094 (https://phabricator.wikimedia.org/T331079) (owner: 10Esanders) [10:35:17] (03PS1) 10Elukey: api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) [10:36:22] (03CR) 10CI reject: [V: 04-1] api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [10:37:11] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/896314 [10:37:25] (03CR) 10Samtar: [C: 03+1] "lgtm, will schedule for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 (owner: 10Varnent) [10:40:32] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'ExplosiveCreeper294' 'NotGalxyGaming' # T331685 [10:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:38] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [10:40:57] (03PS2) 10Elukey: api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) [10:40:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:41:26] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'HonzaSTECH' 'ShadyMedic' # T331685 [10:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:48] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Mac700' 'Unknown001100' # T331685 [10:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:06] (03CR) 10Elukey: "Hugh lemme know what you think about this, so we can figure out the best way to proceed. Adding this kind of config in the route section s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [10:44:39] (03PS1) 10Jelto: gitlab_runner: fix docker run command in registry service, fix hiera [puppet] - 10https://gerrit.wikimedia.org/r/896316 (https://phabricator.wikimedia.org/T329679) [10:45:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:48:21] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki --ignorestatus 'ExplosiveCreeper294' 'NotGalxyGaming' # T331685 [10:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:26] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [10:48:28] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki --ignorestatus 'HonzaSTECH' 'ShadyMedic' # T331685 [10:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:36] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki --ignorestatus 'Mac700' 'Unknown001100' # T331685 [10:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:24] (03CR) 10Elukey: [C: 04-1] "Ah snap the change is not entirely a no-op, fixing it :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [10:51:28] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=arwiki --logwiki=metawiki --ignorestatus 'Reza amjad(iran)' 'رضا امجد (تبریز)' # T331685 [10:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:15] ouch, a few stuck global renames? :/ [10:52:44] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=afwiki --logwiki=metawiki --ignorestatus 'Siniy7' 'Viktorbublik' # T331685 [10:52:45] yep [10:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:56] (03PS1) 10Majavah: Remove l10nupdate support [puppet] - 10https://gerrit.wikimedia.org/r/896318 [10:54:27] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki --ignorestatus 'Studio 7 Piaseczno Jarosław Zawadzki' 'Jarosław Andrzej Zawadzki (muzyk)' # T331685 [10:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:32] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [10:57:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40072/console" [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah) [10:58:01] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki --ignorestatus 'Tosikuni Japan' 'Revisionist14' # T331685 [10:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:44] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=afwiki --logwiki=metawiki --ignorestatus 'Tranquill Komnin' 'Nevechear' # T331685 [10:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:40] (03CR) 10Jelto: [C: 03+2] gitlab_runner: fix docker run command in registry service, fix hiera [puppet] - 10https://gerrit.wikimedia.org/r/896316 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [11:01:54] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki --ignorestatus 'Yair.herman' 'Manor258' # T331685 [11:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:59] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [11:03:27] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki --ignorestatus 'ZSTK Lublin' 'Sonabet4' # T331685 [11:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:36] (03PS3) 10Elukey: api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) [11:04:26] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=jawiki --logwiki=metawiki --ignorestatus 'あ ーあーあーあーあー' 'ARIAUSO' # T331685 [11:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:59] that should be it [11:05:43] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete. [11:08:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host urldownloader1004.wikimedia.org with OS bullseye [11:08:05] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host urldownloader1004.wikimedia.org with OS bullseye [11:11:47] (03PS1) 10Stang: zhwiki: Add movefile to extendedconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896321 (https://phabricator.wikimedia.org/T331691) [11:15:48] (03CR) 10Elukey: "ok now it is a nice no-op :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [11:16:35] (03CR) 10Elukey: api-gateway: allow to configure prefixes without JWT requirements (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [11:19:23] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/896314 (owner: 10Muehlenhoff) [11:20:00] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:20:05] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:35:42] !log instaling isc-dhcp bugfix updates from DLA 3326 [11:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader1004.wikimedia.org with reason: host reimage [11:39:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader1004.wikimedia.org with reason: host reimage [11:42:33] (03PS1) 10Volans: docstrings: automatically document type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 [11:46:35] (03CR) 10CI reject: [V: 04-1] docstrings: automatically document type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans) [11:51:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host urldownloader1004.wikimedia.org with OS bullseye [11:51:53] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host urldownloader1004.wikimedia.org with OS bullseye completed: - urldownloader1004 (**PASS**) -... [11:52:04] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [11:54:12] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:55:51] (03PS2) 10Volans: Use builtins GenericAlias objects for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895699 [11:55:53] (03PS2) 10Volans: Use collections.abc GenericAlias for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895700 [11:55:55] (03PS2) 10Volans: setup.py: bump dependencies minimum version [software/spicerack] - 10https://gerrit.wikimedia.org/r/895710 [11:55:57] (03PS2) 10Volans: setup.py: remove upper limit for prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/895711 [11:55:59] (03PS2) 10Volans: doc: dynamically set copyright year to current [software/spicerack] - 10https://gerrit.wikimedia.org/r/895872 [11:56:01] (03PS2) 10Volans: docstrings: automatically document type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 [11:56:03] (03PS1) 10Volans: tox: no bandit request_without_timeout in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/896324 [11:56:05] (03CR) 10Volans: "You can compare the results between:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans) [11:57:07] (03PS1) 10Muehlenhoff: Apply url_downloader role to urldownloader2004 [puppet] - 10https://gerrit.wikimedia.org/r/896325 (https://phabricator.wikimedia.org/T329945) [12:05:31] (03CR) 10Volans: [C: 03+2] "Merging to unblock CI on all other CRs. We can revisit later if deemed necessary." [software/spicerack] - 10https://gerrit.wikimedia.org/r/896324 (owner: 10Volans) [12:09:15] (03Merged) 10jenkins-bot: tox: no bandit request_without_timeout in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/896324 (owner: 10Volans) [12:13:00] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:14:43] (03CR) 10Jbond: [C: 03+2] ldap: move ldap lookup_options to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond) [12:14:49] (03PS3) 10Jbond: ldap: move ldap lookup_options to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/895811 [12:14:58] (03CR) 10Jbond: [V: 03+2] ldap: move ldap lookup_options to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond) [12:15:08] (03PS1) 10Cathal Mooney: Add includes for new private IP ranges in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/896329 (https://phabricator.wikimedia.org/T327919) [12:15:42] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:15:47] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:15:57] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [12:15:57] (03CR) 10CI reject: [V: 04-1] Add includes for new private IP ranges in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/896329 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [12:15:59] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:18:46] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new files for privte loopback ranges codfw. - cmooney@cumin1001" [12:20:40] (03CR) 10Jbond: [C: 04-2] R:idp_test create development service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896109 (owner: 10Slyngshede) [12:23:28] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new files for privte loopback ranges codfw. - cmooney@cumin1001" [12:23:29] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:24:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:25:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:27:21] (03CR) 10Jbond: [C: 03+1] Use GenericAlias for type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895827 (owner: 10Volans) [12:27:41] (03CR) 10Jbond: [C: 03+1] tox: fix setup for pytest on Python 3.10 [cookbooks] - 10https://gerrit.wikimedia.org/r/895828 (owner: 10Volans) [12:28:13] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:28:19] (03CR) 10Jbond: [C: 03+1] Add Cumin aliases for IDM [puppet] - 10https://gerrit.wikimedia.org/r/896112 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [12:29:08] (03CR) 10Jbond: [C: 03+1] setup.py: bump dependencies minimum version [software/spicerack] - 10https://gerrit.wikimedia.org/r/895710 (owner: 10Volans) [12:31:49] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new files for privte loopback ranges codfw. - cmooney@cumin1001" [12:31:50] (03CR) 10Jbond: [C: 03+1] Use builtins GenericAlias objects for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895699 (owner: 10Volans) [12:31:57] (03CR) 10Jbond: [C: 03+1] Use collections.abc GenericAlias for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895700 (owner: 10Volans) [12:32:08] (03PS2) 10Cathal Mooney: Add includes for new private IP ranges in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/896329 (https://phabricator.wikimedia.org/T327919) [12:32:52] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new files for privte loopback ranges codfw. - cmooney@cumin1001" [12:32:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:32:56] (03CR) 10Jbond: [C: 03+1] setup.py: remove upper limit for prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/895711 (owner: 10Volans) [12:33:17] (03CR) 10Jbond: [C: 03+1] doc: dynamically set copyright year to current [software/spicerack] - 10https://gerrit.wikimedia.org/r/895872 (owner: 10Volans) [12:38:38] (03CR) 10Jbond: [C: 03+1] "<3 <3 <3" [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans) [12:41:50] (03PS1) 10Cathal Mooney: Add Homer device entry for cloudsw1-b-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/896331 (https://phabricator.wikimedia.org/T327919) [12:41:56] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Add the aphlict config on aphlict2001.codfw [puppet] - 10https://gerrit.wikimedia.org/r/895240 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:44:27] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Add the aphlict config on aphlict2001.codfw (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/895240 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:46:55] !log installing libsdl2 security updates [12:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:13] (03PS1) 10Cathal Mooney: Puppet additions for new network device cloudsw1-b1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/896333 (https://phabricator.wikimedia.org/T327919) [12:47:50] (03CR) 10Cathal Mooney: [C: 03+2] Add Homer device entry for cloudsw1-b-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/896331 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [12:48:32] (03Merged) 10jenkins-bot: Add Homer device entry for cloudsw1-b-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/896331 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [12:49:04] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync data for new cloudsw1-b1-codfw device. - cmooney@cumin1001 - T327919" [12:49:09] T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 [12:50:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync data for new cloudsw1-b1-codfw device. - cmooney@cumin1001 - T327919" [12:51:20] (03PS2) 10Cathal Mooney: Puppet additions for new network device cloudsw1-b1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/896333 (https://phabricator.wikimedia.org/T327919) [12:56:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "the patch LGTM, but I'm not familiar with the semantics and meaning of the different config settings being introduced." [puppet] - 10https://gerrit.wikimedia.org/r/896333 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [12:57:11] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10Sgs) >>! In T330070#8679812, @thcipriani wrote: >>>! In T330070#8667684, @MatthewVernon wrote: >> @thcipriani can I ping you about this approval, please? > > Yes, sorry for the delay... [12:57:49] (03PS3) 10Cathal Mooney: Puppet additions for new network device cloudsw1-b1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/896333 (https://phabricator.wikimedia.org/T327919) [12:58:23] (03PS4) 10Cathal Mooney: Puppet additions for new network device cloudsw1-b1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/896333 (https://phabricator.wikimedia.org/T327919) [13:12:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [13:12:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [13:16:57] (03CR) 10Volans: [C: 03+2] Use builtins GenericAlias objects for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895699 (owner: 10Volans) [13:17:02] (03CR) 10Volans: [C: 03+2] Use collections.abc GenericAlias for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895700 (owner: 10Volans) [13:17:11] (03CR) 10Volans: [C: 03+2] setup.py: bump dependencies minimum version [software/spicerack] - 10https://gerrit.wikimedia.org/r/895710 (owner: 10Volans) [13:17:18] (03CR) 10Volans: [C: 03+2] setup.py: remove upper limit for prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/895711 (owner: 10Volans) [13:17:47] (03CR) 10Volans: [C: 03+2] doc: dynamically set copyright year to current [software/spicerack] - 10https://gerrit.wikimedia.org/r/895872 (owner: 10Volans) [13:20:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40073/console" [puppet] - 10https://gerrit.wikimedia.org/r/896333 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [13:20:53] (03Merged) 10jenkins-bot: Use builtins GenericAlias objects for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895699 (owner: 10Volans) [13:20:56] (03Merged) 10jenkins-bot: Use collections.abc GenericAlias for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895700 (owner: 10Volans) [13:20:58] (03Merged) 10jenkins-bot: setup.py: bump dependencies minimum version [software/spicerack] - 10https://gerrit.wikimedia.org/r/895710 (owner: 10Volans) [13:21:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/896333 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [13:21:22] (03Merged) 10jenkins-bot: setup.py: remove upper limit for prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/895711 (owner: 10Volans) [13:21:29] (03Merged) 10jenkins-bot: doc: dynamically set copyright year to current [software/spicerack] - 10https://gerrit.wikimedia.org/r/895872 (owner: 10Volans) [13:21:57] (03CR) 10Cathal Mooney: [C: 03+2] Puppet additions for new network device cloudsw1-b1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/896333 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [13:23:12] (03CR) 10Volans: [C: 03+2] Use GenericAlias for type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895827 (owner: 10Volans) [13:23:27] (03CR) 10Volans: [C: 03+2] tox: fix setup for pytest on Python 3.10 [cookbooks] - 10https://gerrit.wikimedia.org/r/895828 (owner: 10Volans) [13:25:34] (03CR) 10CI reject: [V: 04-1] Use GenericAlias for type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895827 (owner: 10Volans) [13:25:36] (03CR) 10CI reject: [V: 04-1] tox: fix setup for pytest on Python 3.10 [cookbooks] - 10https://gerrit.wikimedia.org/r/895828 (owner: 10Volans) [13:29:37] (03PS1) 10Jbond: pki2002: add grant for new pki server [puppet] - 10https://gerrit.wikimedia.org/r/896339 (https://phabricator.wikimedia.org/T331696) [13:29:39] (03PS1) 10Jbond: pki2002: move pki server to pki role [puppet] - 10https://gerrit.wikimedia.org/r/896340 (https://phabricator.wikimedia.org/T331696) [13:30:24] (03PS2) 10Jbond: pki2002: move pki server to pki role [puppet] - 10https://gerrit.wikimedia.org/r/896340 (https://phabricator.wikimedia.org/T331696) [13:30:35] PROBLEM - Host cloudsw1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:30:47] (03CR) 10SD hehua: [C: 03+1] "OK." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896321 (https://phabricator.wikimedia.org/T331691) (owner: 10Stang) [13:32:09] (03PS1) 10Marostegui: db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/896341 [13:32:17] (03CR) 10Jbond: [C: 03+2] pki2002: move pki server to pki role [puppet] - 10https://gerrit.wikimedia.org/r/896340 (https://phabricator.wikimedia.org/T331696) (owner: 10Jbond) [13:32:57] (03CR) 10Marostegui: [C: 03+2] db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/896341 (owner: 10Marostegui) [13:34:42] !log restart swift-object-replicator on ms-be2067 [13:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:33] (03CR) 10JMeybohm: [C: 04-1] "There is a global networkpolicy in place that allows all pod-to-pod egress (see dse-k8s.yaml) so the egress part is not required. I'd also" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) (owner: 10Nicolas Fraison) [13:39:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:40:02] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:44:42] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:44:45] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:45:20] (03PS2) 10Volans: Use GenericAlias for type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895827 [13:45:22] (03PS2) 10Volans: tox: fix setup for pytest on Python 3.10 [cookbooks] - 10https://gerrit.wikimedia.org/r/895828 [13:45:24] (03PS1) 10Volans: tox: bandit suppress request_without_timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/896346 [13:45:56] (03CR) 10CI reject: [V: 04-1] Use GenericAlias for type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895827 (owner: 10Volans) [13:47:13] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:47:16] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:48:00] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/895827 (owner: 10Volans) [13:51:41] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:54:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [13:54:25] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new cloudlb. - cmooney@cumin1001" [13:55:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add entries for new cloudlb. - cmooney@cumin1001" [13:55:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:57:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] admin_ng: Add default-network-policy globally (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893018 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [14:01:18] (03CR) 10Volans: [C: 03+2] tox: bandit suppress request_without_timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/896346 (owner: 10Volans) [14:03:21] (03Merged) 10jenkins-bot: tox: bandit suppress request_without_timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/896346 (owner: 10Volans) [14:03:45] (03Merged) 10jenkins-bot: Use GenericAlias for type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895827 (owner: 10Volans) [14:03:47] (03Merged) 10jenkins-bot: tox: fix setup for pytest on Python 3.10 [cookbooks] - 10https://gerrit.wikimedia.org/r/895828 (owner: 10Volans) [14:05:29] (03PS1) 10Cathal Mooney: Release v0.6.1 update [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/896348 (https://phabricator.wikimedia.org/T331519) [14:07:39] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:46] (03CR) 10Marostegui: "Same password as the rest of pki users right?" [puppet] - 10https://gerrit.wikimedia.org/r/896339 (https://phabricator.wikimedia.org/T331696) (owner: 10Jbond) [14:08:03] !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert for pki2002.codfw.wmnet: Renew puppet certificate - jbond@cumin1001 [14:09:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for pki2002.codfw.wmnet: Renew puppet certificate - jbond@cumin1001 [14:09:39] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10Gehel) 05Open→03Resolved [14:11:23] (03PS1) 10Cathal Mooney: Update switch_loopback prefix lists to include assigned codfw ranges [homer/public] - 10https://gerrit.wikimedia.org/r/896350 (https://phabricator.wikimedia.org/T327919) [14:12:18] (03CR) 10Cathal Mooney: [C: 03+2] Update switch_loopback prefix lists to include assigned codfw ranges [homer/public] - 10https://gerrit.wikimedia.org/r/896350 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [14:15:31] (03Merged) 10jenkins-bot: Update switch_loopback prefix lists to include assigned codfw ranges [homer/public] - 10https://gerrit.wikimedia.org/r/896350 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [14:16:41] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10jbond) I have built pki2002 with bullseye, ill do some testing on monday [14:17:19] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10TheDJ) This exception basically only happens at this codepath: https://gerrit.wikimedia.org/g/mediawiki/... [14:18:03] (03PS1) 10Andrew Bogott: paws/NFS: move paws to a project-local NFS server [puppet] - 10https://gerrit.wikimedia.org/r/896353 (https://phabricator.wikimedia.org/T301280) [14:19:30] (03CR) 10Cathal Mooney: [C: 03+2] Release v0.6.1 update [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/896348 (https://phabricator.wikimedia.org/T331519) (owner: 10Cathal Mooney) [14:19:56] (03CR) 10Andrew Bogott: "This will be merged during the PAWS cut-over window on 2023-03-14." [puppet] - 10https://gerrit.wikimedia.org/r/896353 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [14:20:24] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 update - cmooney@cumin1001 [14:21:04] (03CR) 10Hashar: "Looks that simply switch the `rsync` direction between the two hosts." [puppet] - 10https://gerrit.wikimedia.org/r/893579 (https://phabricator.wikimedia.org/T330963) (owner: 10Dzahn) [14:22:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 update - cmooney@cumin1001 [14:22:27] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:15] (03PS3) 10JMeybohm: admin_ng: Add default-network-policy globally [deployment-charts] - 10https://gerrit.wikimedia.org/r/893018 (https://phabricator.wikimedia.org/T275035) [14:24:17] (03PS3) 10JMeybohm: Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) [14:24:43] (03CR) 10CI reject: [V: 04-1] Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [14:25:03] (03CR) 10JMeybohm: admin_ng: Add default-network-policy globally (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893018 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [14:25:29] (03PS4) 10Jaime Nuche: mwdebug_deploy: remove configuration [puppet] - 10https://gerrit.wikimedia.org/r/867221 [14:25:31] (03PS1) 10Jaime Nuche: mwdebug_deploy: clean up physical resources from target hosts [puppet] - 10https://gerrit.wikimedia.org/r/896355 [14:28:22] (03PS4) 10JMeybohm: Remove default-network-policy-conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/893019 (https://phabricator.wikimedia.org/T275035) [14:29:36] RECOVERY - Host cloudsw1-b1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.81 ms [14:29:52] (03CR) 10Jbond: pki2002: add grant for new pki server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896339 (https://phabricator.wikimedia.org/T331696) (owner: 10Jbond) [14:30:24] (03CR) 10Jaime Nuche: mwdebug_deploy: remove configuration (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [14:31:10] (03CR) 10Marostegui: [C: 03+2] pki2002: add grant for new pki server [puppet] - 10https://gerrit.wikimedia.org/r/896339 (https://phabricator.wikimedia.org/T331696) (owner: 10Jbond) [14:32:28] (03CR) 10Marostegui: [C: 03+2] "User created:" [puppet] - 10https://gerrit.wikimedia.org/r/896339 (https://phabricator.wikimedia.org/T331696) (owner: 10Jbond) [14:36:14] (03PS1) 10Muehlenhoff: Configure database size for MDB backend [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) [14:36:25] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 update - cmooney@cumin1001 [14:38:02] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 update - cmooney@cumin1001 [14:42:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [14:45:53] (03PS1) 10DCausse: rdf-streaming-updater: add a "wcqs" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 [14:46:13] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10cmooney) I've added these in Netbox. Confused at first, they are 10G links actually so xe-0/0/31 and xe-0/0/32. @Papaul can you add the cabel labels id's... [14:46:16] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) I don't think much has changed regarding prod swift since 6th March (when we added a coup... [14:47:10] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [14:47:33] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10Jclark-ctr) 05Open→03Resolved Replaced Failed BBu Server it booting [14:47:45] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [14:48:14] (03PS1) 10Nicolas Fraison: spark: add support for spark-history on the spark image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896363 (https://phabricator.wikimedia.org/T330176) [14:49:01] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Jclark-ctr) [14:50:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [14:52:00] (03PS2) 10Muehlenhoff: Configure database size for MDB backend [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) [14:52:53] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host cloudlb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [14:52:53] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10Papaul) @ i don't do cable id's for servers in codfw. [14:55:13] (03PS2) 10DCausse: rdf-streaming-updater: add a "wcqs" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 [14:55:15] (03PS1) 10DCausse: mediawiki-page-content-change-enrichment: fix helmfile-defaults... [deployment-charts] - 10https://gerrit.wikimedia.org/r/896365 [14:56:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:57:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:00:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49710 bytes in 3.939 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:01:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:03:52] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_chartmuseum:prod.service,swift-account-stats_mlserve:prod.service,swift-account-stats_research:poc.service,swift-account-stats_search:platform.service,swift-account-stats_swift:dispersion.service,swift-account-stats_tegola:prod.service,swift-account-stats_thanos:prod.service,swift-account-stats_wdqs:flink.servic [15:03:52] dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:46] (03CR) 10Ssingh: [C: 03+1] Add includes for new private IP ranges in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/896329 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [15:06:48] (03CR) 10Cathal Mooney: [C: 03+2] Add includes for new private IP ranges in use in codfw [dns] - 10https://gerrit.wikimedia.org/r/896329 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [15:08:05] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host cloudlb2003-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:09:41] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:10:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [15:12:31] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10cmooney) @papaul cool that works for me! [15:14:48] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10Papaul) @cmooney if you working on this task please don't forget to check the boxes in the description so I can keep track. Thank you [15:16:30] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:14] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10cmooney) To update here I believe all but the last 2 items has been completed for both of these. I have: - Run the provision script to add the switch con... [15:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:04] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10cmooney) [15:22:56] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10Papaul) If the IDRAC is reachable and I see already entries for mgmt and production IP in Netbox that means you already did cookbook sre.dns.netbox so we... [15:23:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:25:10] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10Papaul) [15:31:04] (03PS1) 10Elukey: ml-services: fix eventgate stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/896371 [15:31:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2002-dev'] [15:31:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudlb2002-dev'] [15:31:59] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2002-dev'] [15:32:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2002-dev'] [15:34:00] (03PS1) 10David Caro: cloudceph: add the location info to the hosts [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) [15:34:24] (03CR) 10CI reject: [V: 04-1] cloudceph: add the location info to the hosts [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [15:34:37] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2002-dev'] [15:34:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudlb2002-dev'] [15:35:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2002-dev'] [15:35:52] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2003-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:35:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudlb2002-dev'] [15:36:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2002-dev'] [15:46:42] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: fix eventgate stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/896371 (owner: 10Elukey) [15:47:04] (03CR) 10Elukey: [C: 03+2] ml-services: fix eventgate stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/896371 (owner: 10Elukey) [15:47:31] (03CR) 10Klausman: [C: 03+1] ml-services: fix eventgate stream names [deployment-charts] - 10https://gerrit.wikimedia.org/r/896371 (owner: 10Elukey) [15:49:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah) [15:50:31] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:50:40] (03CR) 10Klausman: api-gateway: allow to configure prefixes without JWT requirements (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [15:50:49] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:51:13] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudlb2002-dev'] [15:52:03] (03CR) 10JHathaway: [C: 03+1] Configure database size for MDB backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [15:53:34] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:53:51] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:55:21] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2003-dev'] [15:56:23] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:56:29] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:56:44] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:56:49] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:56:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:57:05] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:57:46] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:59:05] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:59:48] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:00:20] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:01:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:04:05] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:04:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudlb2003-dev'] [16:04:19] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:06:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:47] (03PS2) 10Nicolas Fraison: spark: Authorize driver and executor pods to communicate [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) [16:11:06] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) 05Stalled→03Declined @Bugreporter, Thank you for reporting this issue. I'm sorry it took so long to get to this and for us to ultimately r... [16:11:09] (03CR) 10Nicolas Fraison: spark: Authorize driver and executor pods to communicate (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) (owner: 10Nicolas Fraison) [16:11:58] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:16:00] (03CR) 10Elukey: "In theory the chart's version needs to be bumped as well right?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) (owner: 10Nicolas Fraison) [16:17:13] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:17:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [16:18:28] 10SRE, 10ops-eqiad: Remove second links from cloud servers - https://phabricator.wikimedia.org/T331737 (10cmooney) p:05Triage→03Low [16:22:13] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:24:06] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [16:27:08] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10thcipriani) >>! In T331647#8682660, @taavi wrote: > You sure that's the correct group? The `deploy_airflow` keyholder key can be accessed by the following groups: > `lang=yaml > # Shared deploy ssh key f... [16:27:11] (03PS3) 10Nicolas Fraison: spark: Authorize driver and executor pods to communicate [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) [16:27:13] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:27:47] (03PS4) 10Nicolas Fraison: spark: Authorize driver and executor pods to communicate [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) [16:28:05] (03CR) 10Nicolas Fraison: spark: Authorize driver and executor pods to communicate (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) (owner: 10Nicolas Fraison) [16:30:36] (03PS1) 10Papaul: Add cloudlb200[23] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/896382 (https://phabricator.wikimedia.org/T329865) [16:30:58] (03CR) 10CI reject: [V: 04-1] Add cloudlb200[23] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/896382 (https://phabricator.wikimedia.org/T329865) (owner: 10Papaul) [16:31:10] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:32:13] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:37:02] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) Similarly, looking at logstash searches for e.g. `host: thumbor* AND message:502` doesn't... [16:39:32] (03CR) 10Muehlenhoff: Configure database size for MDB backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [16:41:56] (03PS2) 10Papaul: Add cloudlb200[23] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/896382 (https://phabricator.wikimedia.org/T329865) [16:42:18] (03CR) 10CI reject: [V: 04-1] Add cloudlb200[23] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/896382 (https://phabricator.wikimedia.org/T329865) (owner: 10Papaul) [16:42:49] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [16:44:10] (03PS1) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) [16:44:29] (03PS3) 10Papaul: Add cloudlb200[23] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/896382 (https://phabricator.wikimedia.org/T329865) [16:44:54] (03CR) 10CI reject: [V: 04-1] Add cloudlb200[23] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/896382 (https://phabricator.wikimedia.org/T329865) (owner: 10Papaul) [16:45:15] (03CR) 10BryanDavis: [C: 04-1] "We figured out where the real bug is. This won't fix things yet." [puppet] - 10https://gerrit.wikimedia.org/r/896194 (https://phabricator.wikimedia.org/T330759) (owner: 10BryanDavis) [16:45:19] (03PS3) 10DCausse: rdf-streaming-updater: add a "wcqs" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 [16:47:04] (03PS4) 10Papaul: Add cloudlb200[23] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/896382 (https://phabricator.wikimedia.org/T329865) [16:47:21] (03CR) 10JMeybohm: [C: 03+1] spark: Authorize driver and executor pods to communicate [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) (owner: 10Nicolas Fraison) [16:47:57] (03CR) 10Papaul: [C: 03+2] Add cloudlb200[23] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/896382 (https://phabricator.wikimedia.org/T329865) (owner: 10Papaul) [16:48:29] (03CR) 10DCausse: rdf-streaming-updater: add a "wcqs" release (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [16:49:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [16:49:54] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) I'm struggling to find any corresponding uptick in error logs on swift frontends, which i... [16:50:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10Papaul) [16:54:29] (03CR) 10JHathaway: [C: 03+1] Configure database size for MDB backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [16:54:43] 10SRE, 10MediaWiki-Documentation, 10serviceops-radar, 10Documentation, and 2 others: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Aklapper) [16:55:17] (03PS3) 10Aklapper: maps: introduce imposm-geometry-import [puppet] - 10https://gerrit.wikimedia.org/r/752748 (https://phabricator.wikimedia.org/T218097) (owner: 10MSantos) [16:55:50] (03CR) 10CI reject: [V: 04-1] maps: introduce imposm-geometry-import [puppet] - 10https://gerrit.wikimedia.org/r/752748 (https://phabricator.wikimedia.org/T218097) (owner: 10MSantos) [16:56:35] (03PS2) 10Aklapper: Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) [17:13:06] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:13:13] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:22:58] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:23:04] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:27:31] (03CR) 10Dzahn: doc.wikimedia.org: switch active host from eqiad to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893579 (https://phabricator.wikimedia.org/T330963) (owner: 10Dzahn) [17:28:36] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:28:44] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:34:16] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:34:27] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:35:34] (03CR) 10Dzahn: "Adding Brennen, not so much for the actual code review but because this needs deployment during a deployment window and one is planned for" [puppet] - 10https://gerrit.wikimedia.org/r/896211 (owner: 10Varnent) [17:35:50] PROBLEM - Host cloudsw1-b1-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:36:40] (03PS2) 10Dzahn: phabricator: update footer links for foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/896211 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [17:39:08] (03CR) 10CI reject: [V: 04-1] phabricator: update footer links for foundation wiki [puppet] - 10https://gerrit.wikimedia.org/r/896211 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [17:40:46] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:40:51] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye executed with e... [17:43:26] RECOVERY - Host cloudsw1-b1-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.90 ms [17:44:03] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:44:11] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:44:11] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:44:16] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye executed with e... [17:45:38] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.414 second response time https://wikitech.wikimedia.org/wiki/Swift [17:46:15] (03PS1) 10Herron: add tox json(net) linting and address issues raised [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896408 [17:47:22] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Swift [17:47:27] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:47:34] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:50:41] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:51:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:52:48] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:52:54] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye executed with e... [17:53:24] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:53:31] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:57:47] (03PS2) 10Herron: add tox json(net) linting and address issues raised [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896408 [17:59:44] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb2002-dev.codfw.wmnet with OS bullseye [17:59:51] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye executed with e... [18:04:20] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [18:04:26] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [18:05:36] (03CR) 10Brennen Bearnes: [C: 04-1] "See T331673 and discussion on https://gerrit.wikimedia.org/r/c/operations/puppet/+/877188/" [puppet] - 10https://gerrit.wikimedia.org/r/896211 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [18:06:06] (03PS3) 10Herron: add tox json(net) linting and address issues raised [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896408 (https://phabricator.wikimedia.org/T331659) [18:10:28] (03PS4) 10Herron: add tox json(net) linting and address issues raised [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896408 (https://phabricator.wikimedia.org/T331659) [18:11:19] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10cmooney) [18:12:54] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb2002-dev.codfw.wmnet with OS bullseye [18:13:01] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye executed with e... [18:13:13] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS bullseye [18:13:20] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye [18:21:04] (03PS4) 10Dzahn: releases: add monitor for releases-jenkins.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/893828 (https://phabricator.wikimedia.org/T330960) [18:22:10] (03PS5) 10Dzahn: releases: add monitor for releases-jenkins.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/893828 (https://phabricator.wikimedia.org/T330960) [18:22:36] (03CR) 10Dzahn: [C: 03+2] releases: add monitor for releases-jenkins.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/893828 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [18:22:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2003-dev.codfw.wmnet with OS bullseye [18:22:49] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudlb2003-dev.codfw.wmnet with OS bullseye [18:23:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for s-mukuti - https://phabricator.wikimedia.org/T331402 (10KSiebert) Welcome, @S_Mukuti! [18:23:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:24:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) We've tested PXEboot / DHCP / OS install for 2 new servers (cloudlb2002-dev and cloudlb2003-de... [18:24:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:25:32] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:31:46] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage [18:35:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage [18:42:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage [18:46:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage [18:51:14] (03CR) 10Bernard Wang: [C: 03+1] Add header at top of main page (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894765 (https://phabricator.wikimedia.org/T325362) (owner: 10Kimberly Sarabia) [18:51:22] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [18:55:11] !log milimetric@deploy2002 Started deploy [airflow-dags/analytics@4696eff]: Deploying analytics dags from origin/main_airflow_2.5 [airflow-dags@bb9a944] [18:55:23] !log milimetric@deploy2002 Finished deploy [airflow-dags/analytics@4696eff]: Deploying analytics dags from origin/main_airflow_2.5 [airflow-dags@bb9a944] (duration: 00m 12s) [19:00:53] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1002.eqiad.wmnet [19:00:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [19:00:58] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2002-dev.codfw.wmnet with OS bullseye [19:01:03] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudlb2002-dev.codfw.wmnet with OS bullseye completed: - cl... [19:02:21] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:07:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1002.eqiad.wmnet [19:11:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:11:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2003-dev.codfw.wmnet with OS bullseye [19:11:11] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudlb2003-dev.codfw.wmnet with OS bullseye completed: - clo... [19:12:00] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10Papaul) [19:13:01] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudlb200[23]-dev - https://phabricator.wikimedia.org/T329865 (10Papaul) 05Open→03Resolved @aborrero this is done [19:13:43] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:17:01] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new ms-fe servers - cmjohnson@cumin1001" [19:17:30] (03PS1) 10Gmodena: page-content-change: set flink total memory size. [deployment-charts] - 10https://gerrit.wikimedia.org/r/896423 [19:18:45] (03PS2) 10Gmodena: page-content-change: set flink total memory size. [deployment-charts] - 10https://gerrit.wikimedia.org/r/896423 [19:23:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new ms-fe servers - cmjohnson@cumin1001" [19:23:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:24:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ms-fe1013.mgmt.eqiad.wmnet with reboot policy FORCED [19:27:18] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1013.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:38] !log milimetric@deploy2002 Started deploy [analytics/refinery@898a942]: Special deploy for pageview job migration [analytics/refinery@898a942] [19:33:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ms-fe1014.mgmt.eqiad.wmnet with reboot policy FORCED [19:38:46] !log milimetric@deploy2002 Finished deploy [analytics/refinery@898a942]: Special deploy for pageview job migration [analytics/refinery@898a942] (duration: 08m 08s) [19:38:56] !log milimetric@deploy2002 Started deploy [analytics/refinery@898a942] (thin): Special deploy for pageview job migration [analytics/refinery@898a942] [19:39:06] !log milimetric@deploy2002 Finished deploy [analytics/refinery@898a942] (thin): Special deploy for pageview job migration [analytics/refinery@898a942] (duration: 00m 09s) [20:20:16] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [20:20:21] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [20:22:11] (03PS1) 10Herron: onboard home dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/896426 (https://phabricator.wikimedia.org/T331656) [20:31:21] PROBLEM - IPMI Sensor Status on parse2004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [20:32:42] (03CR) 10Jforrester: "This will break the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896032 (https://phabricator.wikimedia.org/T331685) (owner: 10Zabe) [20:33:05] (03PS1) 10Jforrester: Revert "Revert "Unload RenameUser, now part of core: Part II of II"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896037 [20:33:43] (03PS2) 10Jforrester: Drop loading of former extension Renameuser's i18n strings [Re-apply] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896037 [20:34:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:39:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:41:29] (03PS3) 10Gmodena: page-content-change: set flink total memory size. [deployment-charts] - 10https://gerrit.wikimedia.org/r/896423 [20:43:17] !log milimetric@deploy2002 Started deploy [airflow-dags/analytics@4696eff]: Deploying analytics dags from origin/main_airflow_2.5 [airflow-dags@dd7fc78] [20:43:28] !log milimetric@deploy2002 Finished deploy [airflow-dags/analytics@4696eff]: Deploying analytics dags from origin/main_airflow_2.5 [airflow-dags@dd7fc78] (duration: 00m 10s) [20:47:27] (03CR) 10Hashar: doc.wikimedia.org: switch active host from eqiad to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893579 (https://phabricator.wikimedia.org/T330963) (owner: 10Dzahn) [20:49:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:52:08] (03PS1) 10JHathaway: jaeger: try to fix istio definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/896429 (https://phabricator.wikimedia.org/T320554) [20:52:24] (03PS2) 10JHathaway: jaeger: try to fix istio definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/896429 (https://phabricator.wikimedia.org/T320554) [20:54:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:56:17] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:45] (03CR) 10Gergő Tisza: changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [20:58:28] (03PS2) 10Jforrester: docroot: Update privacy policy footer link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896216 (https://phabricator.wikimedia.org/T331680) (owner: 10Varnent) [20:59:20] (03PS2) 10Jforrester: [foundationwiki] Grant translation admin rights to 'editor' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896224 (https://phabricator.wikimedia.org/T297396) (owner: 10Varnent) [21:00:19] (03CR) 10JHathaway: [C: 03+2] jaeger: try to fix istio definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/896429 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [21:00:54] (03CR) 10Gergő Tisza: [C: 03+1] changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [21:03:32] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [21:04:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:09:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:10:08] (03PS1) 10Jsn.sherman: WIP: Log additional click events on Special:Diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) [21:13:38] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:14:11] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [21:21:04] (03CR) 10Krinkle: [C: 03+1] Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [21:24:17] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:33:05] (03PS1) 10JHathaway: aux-k8s: fix secret location, attempt four [labs/private] - 10https://gerrit.wikimedia.org/r/896436 [21:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:37:40] (03CR) 10JHathaway: [V: 03+2 C: 03+2] aux-k8s: fix secret location, attempt four [labs/private] - 10https://gerrit.wikimedia.org/r/896436 (owner: 10JHathaway) [21:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:40:09] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.262 second response time https://wikitech.wikimedia.org/wiki/Swift [21:41:53] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [21:47:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:49:03] (03PS2) 10BryanDavis: striker: Bump container version to 2023-03-10-212005-production [puppet] - 10https://gerrit.wikimedia.org/r/896194 (https://phabricator.wikimedia.org/T330759) [21:52:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:55:02] (03CR) 10BryanDavis: [C: 03+1] striker: Bump container version to 2023-03-10-212005-production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896194 (https://phabricator.wikimedia.org/T330759) (owner: 10BryanDavis) [22:16:18] !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [22:21:05] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Mike_Peel) @MatthewVernon Thanks for looking into this. Just to note that it does go back a bit longer,... [22:24:48] (03PS1) 10JHathaway: jaeger: move istio host to correct dc [deployment-charts] - 10https://gerrit.wikimedia.org/r/896441 (https://phabricator.wikimedia.org/T320554) [22:26:10] (03PS1) 10JHathaway: jaeger: fix es port [deployment-charts] - 10https://gerrit.wikimedia.org/r/896442 (https://phabricator.wikimedia.org/T320554) [22:26:27] !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:31:45] (03CR) 10JHathaway: [C: 03+2] jaeger: fix es port [deployment-charts] - 10https://gerrit.wikimedia.org/r/896442 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [22:32:03] (03PS2) 10JHathaway: jaeger: move istio host to correct dc [deployment-charts] - 10https://gerrit.wikimedia.org/r/896441 (https://phabricator.wikimedia.org/T320554) [22:32:52] !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [22:39:39] (03CR) 10JHathaway: [C: 03+2] jaeger: move istio host to correct dc [deployment-charts] - 10https://gerrit.wikimedia.org/r/896441 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [22:43:01] !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [23:52:24] (03CR) 10Thcipriani: [C: 03+1] keyholder-proxy: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895885 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [23:59:30] 10Puppet, 10Infrastructure-Foundations, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Jdlrobson) (Update )The "Request desktop site" feature is...