[00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940215 [00:38:31] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940215 (owner: 10TrainBranchBot) [00:55:14] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940215 (owner: 10TrainBranchBot) [01:49:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:16] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:32] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:38] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:54] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:27:07] (03PS1) 10Ayounsi: Add python 3.11 support to Tox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940514 [05:27:09] (03PS1) 10Ayounsi: Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) [05:28:19] (03CR) 10CI reject: [V: 04-1] Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [06:23:07] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [06:25:33] (03PS1) 10Elukey: knative-serving: rework probes and update the webhook's netowork policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/940518 [06:27:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [06:28:13] (03PS2) 10Elukey: knative-serving: rework probes and update the webhook's netowork policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/940518 [06:28:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:32:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:35:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:40:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:48:11] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42647/console" [puppet] - 10https://gerrit.wikimedia.org/r/939345 (owner: 10Jelto) [06:52:03] (03CR) 10Jelto: [V: 03+1 C: 03+2] Revert "gitlab: move gitlab to test idp" [puppet] - 10https://gerrit.wikimedia.org/r/939345 (owner: 10Jelto) [06:58:45] (03CR) 10Urbanecm: [C: 03+2] Add reassignMentees.php maintenance script [extensions/GrowthExperiments] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940142 (https://phabricator.wikimedia.org/T330071) (owner: 10Urbanecm) [06:59:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940142 (https://phabricator.wikimedia.org/T330071) (owner: 10Urbanecm) [07:00:05] Amir1, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:21] * urbanecm steals the window for 940142 [07:08:32] (03PS1) 10Urbanecm: ChangeMentor: Refactor the notification conditions [extensions/GrowthExperiments] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940462 (https://phabricator.wikimedia.org/T336875) [07:08:40] (03CR) 10Urbanecm: [C: 03+2] ChangeMentor: Refactor the notification conditions [extensions/GrowthExperiments] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940462 (https://phabricator.wikimedia.org/T336875) (owner: 10Urbanecm) [07:14:48] (03CR) 10Volans: "Post-merge clarification as I'm not sure what the original intention was." [puppet] - 10https://gerrit.wikimedia.org/r/939302 (owner: 10Ladsgroup) [07:15:45] (03PS1) 10Marostegui: install_server: Do not format db1208 [puppet] - 10https://gerrit.wikimedia.org/r/940631 [07:15:57] (03CR) 10Elukey: [C: 03+2] knative-serving: rework probes and update the webhook's netowork policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/940518 (owner: 10Elukey) [07:16:08] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) I switched GitLab oidc login back to produciton idp (https://gerrit.wikimedia.org/r/c/operations/puppet/+/939345). I get th... [07:17:37] (03Merged) 10jenkins-bot: Add reassignMentees.php maintenance script [extensions/GrowthExperiments] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940142 (https://phabricator.wikimedia.org/T330071) (owner: 10Urbanecm) [07:17:55] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:940142|Add reassignMentees.php maintenance script (T330071)]] [07:17:59] T330071: Mentorship: ensure that all mentees are assigned to an active mentor - https://phabricator.wikimedia.org/T330071 [07:21:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:22:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:23:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:25:18] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:26:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:28:00] (03Merged) 10jenkins-bot: ChangeMentor: Refactor the notification conditions [extensions/GrowthExperiments] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940462 (https://phabricator.wikimedia.org/T336875) (owner: 10Urbanecm) [07:30:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:31:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:31:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:32:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:940142|Add reassignMentees.php maintenance script (T330071)]] (duration: 14m 39s) [07:32:37] T330071: Mentorship: ensure that all mentees are assigned to an active mentor - https://phabricator.wikimedia.org/T330071 [07:33:05] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:940462|ChangeMentor: Refactor the notification conditions (T336875)]] [07:33:09] T336875: Review the rules for notifications sent when a new mentor is assigned to a given user - https://phabricator.wikimedia.org/T336875 [07:40:08] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:940462|ChangeMentor: Refactor the notification conditions (T336875)]] (duration: 07m 02s) [07:40:12] T336875: Review the rules for notifications sent when a new mentor is assigned to a given user - https://phabricator.wikimedia.org/T336875 [07:40:41] * urbanecm done [07:56:35] (03PS1) 10Stevemunene: airflow-wmde: Add Kara Payne to analytics-wmde [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) [07:57:20] (03CR) 10CI reject: [V: 04-1] airflow-wmde: Add Kara Payne to analytics-wmde [puppet] - 10https://gerrit.wikimedia.org/r/940863 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [08:10:55] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940215 (owner: 10TrainBranchBot) [08:15:51] (03PS2) 10Ayounsi: Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) [08:16:05] (03CR) 10Fabfur: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/940503 (https://phabricator.wikimedia.org/T342509) (owner: 10Vgutierrez) [08:16:33] (03CR) 10CI reject: [V: 04-1] Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [08:18:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940216 [08:18:25] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940216 (owner: 10TrainBranchBot) [08:18:54] (03PS3) 10Ayounsi: Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) [08:19:40] (03CR) 10CI reject: [V: 04-1] Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [08:20:44] (03CR) 10Vgutierrez: [C: 03+1] team-traffic: add service restart alert for bird [alerts] - 10https://gerrit.wikimedia.org/r/940359 (owner: 10Ssingh) [08:22:41] !log dcausse@deploy1002 Started deploy [airflow-dags/search@a47bd0f]: search: Fix partition definition for wmf_raw.mediawiki_page_table [08:22:54] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@a47bd0f]: search: Fix partition definition for wmf_raw.mediawiki_page_table (duration: 00m 12s) [08:23:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [08:24:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-api-int: increase namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939716 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [08:26:13] (03Merged) 10jenkins-bot: mw-api-int: increase namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939716 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [08:26:46] (03Merged) 10jenkins-bot: mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [08:27:08] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] spicerack: Add config file for MySQL/MariaDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939302 (owner: 10Ladsgroup) [08:27:43] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [08:28:47] !log oblivian@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:29:59] !log oblivian@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:30:03] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) >>! In T342230#9031047, @Aklapper wrote: > @Cyndymediawiksim Uhmmm. I am sorry for the hassle. [The Phabricator account now shows no authentication factors](https://phabricat... [08:30:27] !log oblivian@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:30:55] !log oblivian@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:31:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) I started working on trafficserver package [08:31:31] 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) a:05Vgutierrez→03RobH lvs1013-lvs1015 have been reimaged as expected, we've been unable to reimage lvs1016 [08:31:35] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:32:51] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:33:01] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:33:31] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:35:41] <_joe_> jouncebot: nowandnext [08:35:41] No deployments scheduled for the next 1 hour(s) and 24 minute(s) [08:35:41] In 1 hour(s) and 24 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T1000) [08:35:48] (03PS1) 10Elukey: knative-serving: add missing autoscaler replica var to the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/940865 [08:35:50] (03PS1) 10Elukey: admin_ng: raise the knative activator replicas to 4 for prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/940866 [08:35:55] <_joe_> let's say I'm going a bit early :) [08:36:08] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [08:36:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [08:36:50] (03PS1) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638) [08:37:18] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove haproxy log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937601 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [08:38:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940216 (owner: 10TrainBranchBot) [08:38:59] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [08:39:18] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [08:41:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [08:42:32] (03Merged) 10jenkins-bot: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [08:43:54] (03CR) 10Klausman: [C: 03+1] knative-serving: add missing autoscaler replica var to the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/940865 (owner: 10Elukey) [08:44:13] (03CR) 10Klausman: [C: 03+1] admin_ng: raise the knative activator replicas to 4 for prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/940866 (owner: 10Elukey) [08:44:36] (03CR) 10Elukey: [C: 03+2] knative-serving: add missing autoscaler replica var to the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/940865 (owner: 10Elukey) [08:44:41] (03CR) 10Elukey: [C: 03+2] admin_ng: raise the knative activator replicas to 4 for prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/940866 (owner: 10Elukey) [08:45:02] !log testing trafficserver 9.2.1 in cp4052 (upload node) - T339134 [08:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:06] T339134: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 [08:45:53] (03PS1) 10Filippo Giunchedi: base: wrap up cadvisor rollout [puppet] - 10https://gerrit.wikimedia.org/r/940868 (https://phabricator.wikimedia.org/T108027) [08:46:03] (03CR) 10Ilias Sarantopoulos: [C: 03+1] knative-serving: add missing autoscaler replica var to the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/940865 (owner: 10Elukey) [08:48:36] (03PS1) 10DCausse: rdf-streaming-updater (dse-k8s test): use mw-api-int-async-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/940870 (https://phabricator.wikimedia.org/T342252) [08:53:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] rdf-streaming-updater (dse-k8s test): use mw-api-int-async-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/940870 (https://phabricator.wikimedia.org/T342252) (owner: 10DCausse) [08:54:47] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:56:25] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:57:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:57:17] (03CR) 10Peter Fischer: [C: 03+2] Bump version of extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [08:58:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] base: wrap up cadvisor rollout [puppet] - 10https://gerrit.wikimedia.org/r/940868 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:58:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:58:55] (03CR) 10Filippo Giunchedi: [C: 03+2] base: wrap up cadvisor rollout [puppet] - 10https://gerrit.wikimedia.org/r/940868 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:59:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:00:16] (03CR) 10Klausman: [C: 03+1] knative-serving: rework probes and update the webhook's netowork policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/940518 (owner: 10Elukey) [09:00:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:01:14] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:03:09] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:07:49] (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp-drmrs [puppet] - 10https://gerrit.wikimedia.org/r/940873 (https://phabricator.wikimedia.org/T342211) [09:08:09] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [09:08:49] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [09:09:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:10:31] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Update learn.wiki DNS records - https://phabricator.wikimedia.org/T342509 (10Vgutierrez) [09:10:57] (03CR) 10Fabfur: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42648/console" [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [09:11:45] (03PS2) 10Vgutierrez: learn.wiki: Update ALB CNAME records [dns] - 10https://gerrit.wikimedia.org/r/940503 (https://phabricator.wikimedia.org/T342509) [09:12:02] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [09:13:00] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [09:13:09] (03CR) 10Fabfur: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42649/console" [puppet] - 10https://gerrit.wikimedia.org/r/940190 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [09:14:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:15:31] !log ladsgroup@cumin1001 START - Cookbook sre.mysql.clone of db1124.eqiad.wmnet onto db1133.eqiad.wmnet [09:16:07] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [09:16:20] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) 05In progress→03Resolved Now rdf-streaming-updater uses the read-only endpoint on k8s, meaning it... [09:16:34] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42650/console" [puppet] - 10https://gerrit.wikimedia.org/r/940873 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [09:20:06] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater (dse-k8s test): use mw-api-int-async-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/940870 (https://phabricator.wikimedia.org/T342252) (owner: 10DCausse) [09:21:04] (03Merged) 10jenkins-bot: rdf-streaming-updater (dse-k8s test): use mw-api-int-async-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/940870 (https://phabricator.wikimedia.org/T342252) (owner: 10DCausse) [09:21:28] PROBLEM - Check systemd state on ms-fe2014 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:37] !log rollback to trafficserver 9.1.4 in cp4052 - T339134 [09:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:41] T339134: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 [09:24:05] !log dcausse@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [09:25:37] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/940873 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [09:25:46] PROBLEM - Check systemd state on thanos-be2003 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:12] !log dcausse@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [09:26:52] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940873 (T342211) to drmrs DC (disable keepalive on port 80 on A:cp-drmrs) [09:26:53] (03CR) 10Vgutierrez: learn.wiki: Update ALB CNAME records [dns] - 10https://gerrit.wikimedia.org/r/940503 (https://phabricator.wikimedia.org/T342509) (owner: 10Vgutierrez) [09:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:55] T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 [09:27:08] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp-drmrs [puppet] - 10https://gerrit.wikimedia.org/r/940873 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [09:28:00] the cadvisor failures are me btw [09:31:36] (03CR) 10Fabfur: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/940503 (https://phabricator.wikimedia.org/T342509) (owner: 10Vgutierrez) [09:31:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1124.eqiad.wmnet onto db1133.eqiad.wmnet [09:31:46] PROBLEM - Check systemd state on ms-be1050 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:02] 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) Currently the issue with the reimaging of lvs1016 is related to T342345 [09:49:32] (03PS1) 10Ladsgroup: sre.mysql.clone: Fix setting replication password [cookbooks] - 10https://gerrit.wikimedia.org/r/940876 [09:49:51] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/937530 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [09:51:52] RECOVERY - Check systemd state on thanos-be2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:12] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1208 [puppet] - 10https://gerrit.wikimedia.org/r/940631 (owner: 10Marostegui) [09:55:34] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) We'll first make the move to 2% of traffic, then ramp up from there during the week. [09:55:44] (03CR) 10Marostegui: [C: 03+1] sre.mysql.clone: Fix setting replication password [cookbooks] - 10https://gerrit.wikimedia.org/r/940876 (owner: 10Ladsgroup) [09:55:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but let's have traffic merge this." [puppet] - 10https://gerrit.wikimedia.org/r/939757 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [09:56:08] (03CR) 10Ladsgroup: [C: 03+2] sre.mysql.clone: Fix setting replication password [cookbooks] - 10https://gerrit.wikimedia.org/r/940876 (owner: 10Ladsgroup) [09:57:56] RECOVERY - Check systemd state on ms-be1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:35] (03Merged) 10jenkins-bot: sre.mysql.clone: Fix setting replication password [cookbooks] - 10https://gerrit.wikimedia.org/r/940876 (owner: 10Ladsgroup) [09:59:44] RECOVERY - Check systemd state on ms-fe2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T1000) [10:02:07] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) >>! In T297314#9035364, @cmassaro wro... [10:05:50] (03CR) 10Volans: "Post-merge comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [10:06:19] (03CR) 10Volans: [C: 03+2] spicerack: set CookbookCollection args as kw-only [software/spicerack] - 10https://gerrit.wikimedia.org/r/938821 (owner: 10Volans) [10:09:16] 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [10:10:26] (03Merged) 10jenkins-bot: spicerack: set CookbookCollection args as kw-only [software/spicerack] - 10https://gerrit.wikimedia.org/r/938821 (owner: 10Volans) [10:11:45] (03PS1) 10Filippo Giunchedi: prometheus: add recording rules for cadvisor cpu/mem [puppet] - 10https://gerrit.wikimedia.org/r/940879 (https://phabricator.wikimedia.org/T108027) [10:12:40] (03CR) 10Volans: "Post merge comments" [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi) [10:15:05] (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp [puppet] - 10https://gerrit.wikimedia.org/r/940880 [10:18:15] (03PS2) 10Klausman: admin_ng: Increase memory for knative-serving/activator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940391 [10:23:32] (03PS1) 10Btullis: Fix the cephosd activate exec resources [puppet] - 10https://gerrit.wikimedia.org/r/940882 (https://phabricator.wikimedia.org/T330151) [10:24:04] (03CR) 10CI reject: [V: 04-1] Fix the cephosd activate exec resources [puppet] - 10https://gerrit.wikimedia.org/r/940882 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:24:48] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 14 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42651/console" [puppet] - 10https://gerrit.wikimedia.org/r/940880 (owner: 10Fabfur) [10:25:42] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10Volans) Indeed what John said. For context the integrity of the data is currently ensured by git fast-forward only. If tw... [10:30:02] (03CR) 10Elukey: "Left a nit, the rest looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940391 (owner: 10Klausman) [10:31:08] (03PS3) 10Klausman: admin_ng: Increase memory for knative-serving/activator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940391 [10:31:16] (03CR) 10Klausman: admin_ng: Increase memory for knative-serving/activator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940391 (owner: 10Klausman) [10:32:37] (03PS2) 10Btullis: Fix the cephosd activate exec resources [puppet] - 10https://gerrit.wikimedia.org/r/940882 (https://phabricator.wikimedia.org/T330151) [10:33:51] (03CR) 10Klausman: [C: 03+2] admin_ng: Increase memory for knative-serving/activator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940391 (owner: 10Klausman) [10:36:05] (03Merged) 10jenkins-bot: admin_ng: Increase memory for knative-serving/activator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940391 (owner: 10Klausman) [10:38:33] (03CR) 10Vgutierrez: [C: 03+1] "looking good, please fix the commit msg before merging" [puppet] - 10https://gerrit.wikimedia.org/r/940880 (owner: 10Fabfur) [10:39:10] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1005 [10:39:30] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1005 [10:39:38] (03PS2) 10Fabfur: haproxy: Add option to disable keepalive on port 80 on A:cp [puppet] - 10https://gerrit.wikimedia.org/r/940880 (https://phabricator.wikimedia.org/T342211) [10:40:15] (03CR) 10Fabfur: "Done thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/940880 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [10:41:08] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:41:10] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/940880 (T342211) to eqiad DC, only one left (disable keepalive on port 80 on A:cp) [10:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:13] T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 [10:41:25] (03CR) 10Fabfur: [C: 03+2] haproxy: Add option to disable keepalive on port 80 on A:cp [puppet] - 10https://gerrit.wikimedia.org/r/940880 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [10:41:37] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:42:46] 10SRE, 10Traffic, 10Patch-For-Review: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [10:44:25] (03PS1) 10Clément Goubert: mw-api-int: Raise php container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/940886 (https://phabricator.wikimedia.org/T342252) [10:44:48] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:45:07] 10SRE, 10Traffic, 10Patch-For-Review: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) The change has been applied in all DCs. Metrics shows a positive impact (as expected) on the number of sessions on port 80 for all DCs. Ex: {F37148259} [10:45:17] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:46:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-api-int: Raise php container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/940886 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [10:46:39] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Raise php container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/940886 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [10:46:56] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:46:59] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:47:14] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:47:34] (03Merged) 10jenkins-bot: mw-api-int: Raise php container CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/940886 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [10:47:36] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:47:43] (03CR) 10Btullis: [C: 03+2] Fix the cephosd activate exec resources [puppet] - 10https://gerrit.wikimedia.org/r/940882 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:48:36] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:49:54] I looks like there are a lot of changes "stuck" in the post-merge queue on zuul/jenkins [10:50:06] * MichaelG_WMDE never knows who to ping here [10:51:11] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on releases2002.codfw.wmnet,releases1002.eqiad.wmnet with reason: Decommissioning prep [10:51:27] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on releases2002.codfw.wmnet,releases1002.eqiad.wmnet with reason: Decommissioning prep [10:58:51] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:59:37] (03PS1) 10Arturo Borrero Gonzalez: network: data: add new cloud CIDRs for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/940887 (https://phabricator.wikimedia.org/T341063) [11:02:04] (03CR) 10Vgutierrez: [C: 03+2] learn.wiki: Update ALB CNAME records [dns] - 10https://gerrit.wikimedia.org/r/940503 (https://phabricator.wikimedia.org/T342509) (owner: 10Vgutierrez) [11:02:59] (03PS1) 10Clément Goubert: admin_ng: Raise max cpu per pod to 10 for mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/940888 (https://phabricator.wikimedia.org/T342252) [11:03:56] MichaelG_WMDE: ping hashar for zuul issues [11:05:13] claime: thanks, will keep that in mind for the future. The specific issue seems to have fixed itself [11:05:25] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Update learn.wiki DNS records - https://phabricator.wikimedia.org/T342509 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez ` vgutierrez@dns1004:~$ dig +short learn.wiki @ns0.wikimedia.org 3.33.143.48 15.197.134.113 vgutierrez@dns1004:~$ dig +short studio.le... [11:11:24] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) >>! In T341463#9014217, @Quiddity wrote: > Thanks for the draft, appreciated! I've [[https://meta.wikimedia.org/wiki/Tech/News/2023/29#Tech_News:_... [11:11:36] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:11:48] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05In progress→03Resolved [11:24:09] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/940220 [11:28:16] (03PS1) 10Dreamy Jazz: Enable write new for event table migration for CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) [11:29:46] (03PS2) 10Dreamy Jazz: Enable write new on test wiki for CheckUser event tables migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) [11:32:09] (03PS3) 10Dreamy Jazz: Enable write new on testwiki for CheckUser event tables migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) [11:47:26] (03PS1) 10Arturo Borrero Gonzalez: clouceph: mon: enable more client networks [puppet] - 10https://gerrit.wikimedia.org/r/940930 (https://phabricator.wikimedia.org/T341495) [11:50:10] (03CR) 10CI reject: [V: 04-1] clouceph: mon: enable more client networks [puppet] - 10https://gerrit.wikimedia.org/r/940930 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [11:52:19] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/940887 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [11:52:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] network: data: add new cloud CIDRs for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/940887 (https://phabricator.wikimedia.org/T341063) (owner: 10Arturo Borrero Gonzalez) [11:56:48] (03PS1) 10Jelto: gitlab: alert on ci failure rate instead of failure ratio [alerts] - 10https://gerrit.wikimedia.org/r/940932 (https://phabricator.wikimedia.org/T339370) [12:01:06] (03PS2) 10Arturo Borrero Gonzalez: clouceph: mon: enable more client networks [puppet] - 10https://gerrit.wikimedia.org/r/940930 (https://phabricator.wikimedia.org/T341495) [12:04:06] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/940930/42653/" [puppet] - 10https://gerrit.wikimedia.org/r/940930 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [12:05:52] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] clouceph: mon: enable more client networks [puppet] - 10https://gerrit.wikimedia.org/r/940930 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [12:06:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2169 (s6, s7)', diff saved to https://phabricator.wikimedia.org/P49649 and previous config saved to /var/cache/conftool/dbconfig/20230724-120609-root.json [12:06:10] (03PS1) 10Marostegui: db2169: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940934 (https://phabricator.wikimedia.org/T334650) [12:06:42] (03CR) 10DCausse: [C: 03+1] "@pfischer this should get merged by Ryan or Brian when they'll build the deb package (removing your +2 because it has no effect). This rep" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [12:07:47] (03CR) 10Marostegui: [C: 03+2] db2169: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940934 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [12:08:46] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) @Jelto I think you have the wrong client id. Should be: "gitlab_replica_oidc". CAS will check the serviceId / URL... [12:09:50] RECOVERY - glance-api http on cloudcontrol1005 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 1922 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:10:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49650 and previous config saved to /var/cache/conftool/dbconfig/20230724-121024-root.json [12:10:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49651 and previous config saved to /var/cache/conftool/dbconfig/20230724-121031-root.json [12:12:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr) [12:13:20] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) We are using `gitlab_replica_oidc` on the replicas (`"identifier" => "gitlab_replica_oidc"` in `/etc/gitlab/gitlab.rb`.). T... [12:13:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1187', diff saved to https://phabricator.wikimedia.org/P49652 and previous config saved to /var/cache/conftool/dbconfig/20230724-121329-root.json [12:14:10] !log dcausse@deploy1002 Started deploy [airflow-dags/search@e7b9253]: search: fix table name for wmf_raw.mediawiki_page [12:14:21] (03PS1) 10Marostegui: db1187: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940935 (https://phabricator.wikimedia.org/T334650) [12:14:23] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@e7b9253]: search: fix table name for wmf_raw.mediawiki_page (duration: 00m 12s) [12:15:03] (03CR) 10Marostegui: [C: 03+2] db1187: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940935 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [12:15:31] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[25-54] - https://phabricator.wikimedia.org/T342533 (10RobH) [12:15:58] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[25-54] - https://phabricator.wikimedia.org/T342533 (10RobH) [12:16:41] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['rdb1013.eqiad.wmnet'] [12:16:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49653 and previous config saved to /var/cache/conftool/dbconfig/20230724-121653-root.json [12:16:55] (03PS1) 10Stevemunene: Dummy db for new wmde airflow [labs/private] - 10https://gerrit.wikimedia.org/r/940936 (https://phabricator.wikimedia.org/T340648) [12:17:01] (03PS1) 10Stevemunene: Add dummy keytabs for new an-airflow1007 [labs/private] - 10https://gerrit.wikimedia.org/r/940937 (https://phabricator.wikimedia.org/T340648) [12:17:02] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['rdb1014.eqiad.wmnet'] [12:17:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['rdb1013.eqiad.wmnet'] [12:17:27] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['rdb1014.eqiad.wmnet'] [12:17:38] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['rdb1013.eqiad.wmnet'] [12:17:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['rdb1013.eqiad.wmnet'] [12:18:05] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) When I attempt a login I get the following: ` 2023-07-24 12:14... [12:21:25] 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10RobH) [12:21:52] 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10RobH) [12:25:06] (03PS4) 10Ori: Randomize thumbnail TTL to prevent stampedes [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) [12:25:15] (03PS5) 10Ori: Randomize thumbnail TTL to prevent stampedes [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) [12:25:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49654 and previous config saved to /var/cache/conftool/dbconfig/20230724-122529-root.json [12:25:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49655 and previous config saved to /var/cache/conftool/dbconfig/20230724-122536-root.json [12:28:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/940932 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [12:29:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940221 [12:29:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940221 (owner: 10TrainBranchBot) [12:29:36] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Yes we had the same error before, I reported it in T320390#8930839. After my vacations the error was gone, so I assumed som... [12:30:31] (03PS1) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [12:30:37] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T342535 (10Mabualruz) [12:31:11] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T342535 (10Mabualruz) @NatHillard @Jdlrobson Please provide approval [12:31:23] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye [12:31:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye [12:31:43] (03CR) 10Ori: "@MVernon: I think this can land now, and should make future roll-outs of thumbnail expirations a little safer. PTAL." [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori) [12:31:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49656 and previous config saved to /var/cache/conftool/dbconfig/20230724-123158-root.json [12:32:44] 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) 05Open→03Resolved [12:35:29] (03PS1) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) [12:36:36] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host rdb1014.eqiad.wmnet with OS bullseye [12:36:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye [12:39:53] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10CodeReviewBot) jforrester opened https://gitlab... [12:40:33] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 28458 [12:40:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49658 and previous config saved to /var/cache/conftool/dbconfig/20230724-124034-root.json [12:40:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49659 and previous config saved to /var/cache/conftool/dbconfig/20230724-124040-root.json [12:40:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28458 [12:41:19] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) >>! In T297314#9037120, @JMeyb... [12:47:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49660 and previous config saved to /var/cache/conftool/dbconfig/20230724-124703-root.json [12:51:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940221 (owner: 10TrainBranchBot) [12:55:22] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10RobH) [12:55:35] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10RobH) [12:55:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49661 and previous config saved to /var/cache/conftool/dbconfig/20230724-125538-root.json [12:55:45] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939669 (owner: 10Ayounsi) [12:55:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49662 and previous config saved to /var/cache/conftool/dbconfig/20230724-125545-root.json [12:57:08] (03CR) 10Volans: [C: 03+1] "I haven't tested but LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940514 (owner: 10Ayounsi) [12:59:13] (03CR) 10Ssingh: [C: 03+2] team-traffic: add service restart alert for bird [alerts] - 10https://gerrit.wikimedia.org/r/940359 (owner: 10Ssingh) [12:59:23] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10RobH) [12:59:39] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10RobH) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, taavi, and mo_abualruz: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T1300). [13:00:05] aanzx and mo_abualruz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:38] * TheresNoTime can deploy [13:00:46] o/ [13:01:55] aanzx: starting with yours :) [13:02:00] Ok [13:02:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940464 (https://phabricator.wikimedia.org/T342516) (owner: 10Anzx) [13:02:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49663 and previous config saved to /var/cache/conftool/dbconfig/20230724-130208-root.json [13:02:48] (03Merged) 10jenkins-bot: add citations, concordance, rhymes, reconstruction, therasus, namespaces for mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940464 (https://phabricator.wikimedia.org/T342516) (owner: 10Anzx) [13:03:35] !log samtar@deploy1002 Started scap: Backport for [[gerrit:940464|add citations, concordance, rhymes, reconstruction, therasus, namespaces for mywiktionary (T342516)]] [13:03:39] T342516: create additional namespaces on mywiktionary - https://phabricator.wikimedia.org/T342516 [13:03:53] !log depooling cp4052 for some ATS 9.2.1 testing - T339134 [13:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:57] T339134: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 [13:04:59] !log samtar@deploy1002 anzx and samtar: Backport for [[gerrit:940464|add citations, concordance, rhymes, reconstruction, therasus, namespaces for mywiktionary (T342516)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:05:02] aanzx: live on mwdebug, please test [13:06:06] Ok [13:07:46] ( and let me know if its okay to sync :) ) [13:08:44] TheresNoTime: ok to sync [13:08:52] ack [13:09:33] one sync'd, I will run `mwscript namespaceDupes.php mywiktionary` [13:09:42] s/one/once [13:10:12] Ok thanks [13:10:42] nb. I am waiting on mo_abualruz for T336527 (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/939312/) [13:10:43] T336527: Run a synthetic test for client side preferences - https://phabricator.wikimedia.org/T336527 [13:10:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49664 and previous config saved to /var/cache/conftool/dbconfig/20230724-131043-root.json [13:10:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49665 and previous config saved to /var/cache/conftool/dbconfig/20230724-131050-root.json [13:13:53] mo_abualruz: o/ just doing another deploy at the moment, then you're next :) [13:14:37] hm, just to note, scap seems to be hanging after `13:09:34 Finished Running helmfile -e eqiad --selector name=canary apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 36s)` [13:15:13] (03CR) 10Jelto: [C: 03+2] gitlab: alert on ci failure rate instead of failure ratio [alerts] - 10https://gerrit.wikimedia.org/r/940932 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [13:15:17] That's probably the mw-api-int deployment, it's failing because of a limits issue [13:15:44] TheresNoTime: I'll revert the patch causing it while the patch resolving it is in review, and I'll redeploy [13:15:51] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) @MatthewVernon: my understanding is that rewrite.py is currently setting expiry headers for thumbnails on retrieval from Swi... [13:16:39] claime: ah yes, I see -int hasn't completed (https://phabricator.wikimedia.org/P49666) — will I need to restart this scap? [13:16:54] TheresNoTime: No, I'll do the deployment once you're done [13:16:56] (03CR) 10Volans: "Post-merge -1, see inline for the details." [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [13:17:00] (03Merged) 10jenkins-bot: gitlab: alert on ci failure rate instead of failure ratio [alerts] - 10https://gerrit.wikimedia.org/r/940932 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [13:17:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49667 and previous config saved to /var/cache/conftool/dbconfig/20230724-131712-root.json [13:17:14] (03PS1) 10Clément Goubert: Revert "mw-api-int: Raise php container CPU limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940909 [13:17:22] (03CR) 10Clément Goubert: [C: 03+2] Revert "mw-api-int: Raise php container CPU limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940909 (owner: 10Clément Goubert) [13:17:31] claime: Sorry, to be clear, do I need to stop my current scap sync that is hanging, or will it resume? [13:18:01] (03Merged) 10jenkins-bot: Revert "mw-api-int: Raise php container CPU limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940909 (owner: 10Clément Goubert) [13:18:44] TheresNoTime: it will once it fails [13:19:26] ah yes, it has failed now [13:21:50] (03PS5) 10Samtar: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [13:23:23] (03CR) 10Klausman: [C: 03+1] ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [13:25:04] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:940464|add citations, concordance, rhymes, reconstruction, therasus, namespaces for mywiktionary (T342516)]] (duration: 21m 28s) [13:25:08] T342516: create additional namespaces on mywiktionary - https://phabricator.wikimedia.org/T342516 [13:25:35] !log `[samtar@mwmaint1002 ~]$ mwscript namespaceDupes.php mywiktionary --fix` T342516 [13:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49668 and previous config saved to /var/cache/conftool/dbconfig/20230724-132548-root.json [13:25:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49669 and previous config saved to /var/cache/conftool/dbconfig/20230724-132555-root.json [13:26:47] aanzx: can you test again? [13:26:53] Ok [13:26:58] (03PS8) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [13:27:00] (03PS9) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [13:27:14] PROBLEM - Check systemd state on mw1424 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:09] (03CR) 10CI reject: [V: 04-1] [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:28:31] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1014.eqiad.wmnet with OS bullseye [13:28:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1013.eqiad.wmnet with OS bullseye [13:28:36] TheresNoTime: work fine [13:28:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**) - Remov... [13:28:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Remov... [13:29:05] aanzx: awesome :) there was an error (see https://phabricator.wikimedia.org/T342516#9037754), but I'm not sure of its impact [13:29:35] claime: am I okay to do another deploy first? [13:29:50] mo_abualruz: ready? [13:30:29] TheresNoTime: Deploying the fix rn [13:30:33] (03CR) 10Ayounsi: [C: 03+2] Fix some pylint errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939669 (owner: 10Ayounsi) [13:30:37] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye [13:30:38] claime: ack, will wait [13:30:39] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host rdb1014.eqiad.wmnet with OS bullseye [13:30:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye [13:30:46] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:30:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye [13:30:50] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:31:08] (03CR) 10Ayounsi: [C: 03+2] Add python 3.11 support to Tox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940514 (owner: 10Ayounsi) [13:31:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:31:22] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [13:31:54] TheresNoTime: should be good now [13:32:01] thanks [13:32:09] mo_abualruz: starting yours now, will ping when its ready to test [13:32:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [13:32:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49671 and previous config saved to /var/cache/conftool/dbconfig/20230724-133217-root.json [13:32:53] (03Merged) 10jenkins-bot: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [13:33:09] !log samtar@deploy1002 Started scap: Backport for [[gerrit:939312|Run a synthetic test for client side preferences (T336527 T339268)]] [13:33:14] T339268: [anon prefs] Implement inline script - https://phabricator.wikimedia.org/T339268 [13:33:14] T336527: Run a synthetic test for client side preferences - https://phabricator.wikimedia.org/T336527 [13:33:40] TheresNoTime: did you figure out the namespaceDupes issue already? [13:34:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10karapayneWMDE) [13:34:13] taavi: https://phabricator.wikimedia.org/P49670 you mean? [13:34:17] (if so, no) [13:34:24] yes [13:34:36] !log samtar@deploy1002 samtar and mabualruz: Backport for [[gerrit:939312|Run a synthetic test for client side preferences (T336527 T339268)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:34:50] mo_abualruz: live on mwdebug, can you test please? [13:34:56] taavi: any ideas? [13:35:11] 10SRE, 10ops-eqiad, 10DBA: db1198 faulty network cable - https://phabricator.wikimedia.org/T342129 (10Marostegui) [13:35:45] (03CR) 10AOkoth: [C: 03+2] vrts: enable blackbox check on active_host only [puppet] - 10https://gerrit.wikimedia.org/r/940467 (https://phabricator.wikimedia.org/T342366) (owner: 10Jelto) [13:35:58] TheresNoTime: yes, I had the same thing last week. give me a second [13:36:36] mo_abualruz: let me know if you need any help testing the change/if this is your first deployment :) [13:38:30] !log run `taavi@mwmaint1002 ~ $ mwscript namespaceDupes.php mywiktionary --fix` after purging null editing page #131577 for T342516 [13:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:33] T342516: create additional namespaces on mywiktionary - https://phabricator.wikimedia.org/T342516 [13:39:00] ah, thanks taavi [13:39:05] TheresNoTime: all sorted, I null-edited the problematic page which made the problematic rows disappear. that's T341993 btw [13:39:06] T341993: namespaceDupes.php can fail if new target does not have a linktarget entry - https://phabricator.wikimedia.org/T341993 [13:39:38] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Volans) The actual command that detects a Debian Installer is actually `grep -q "BOOT_IMAGE=debian-installer" /proc/cmdline`. Has this changes in bookworm? I've d... [13:40:24] that's been happening for a while, I wonder if we should block any new namespace changes to get some pressure on fixing it [13:40:38] is there a task for it? [13:40:50] mo_abualruz: I've done some tests, happy to sync if you say so [13:40:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49672 and previous config saved to /var/cache/conftool/dbconfig/20230724-134052-root.json [13:40:58] T341993 as I just said [13:41:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49673 and previous config saved to /var/cache/conftool/dbconfig/20230724-134059-root.json [13:41:35] * TheresNoTime should read.. ^^' [13:42:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:43:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:44:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1013.eqiad.wmnet with OS bullseye [13:44:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1014.eqiad.wmnet with OS bullseye [13:44:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Remov... [13:44:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**) - Remov... [13:45:31] (03PS1) 10Ilias Sarantopoulos: api-gateway: change liftwing hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/940945 (https://phabricator.wikimedia.org/T342266) [13:45:43] mo_abualruz: ping, just need you to test and OK your change [13:46:33] (03CR) 10Volans: sre.hosts.reimage: connect to the micro service port (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond) [13:47:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49674 and previous config saved to /var/cache/conftool/dbconfig/20230724-134721-root.json [13:48:12] (03CR) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [13:48:58] !log samtar@deploy1002 Sync cancelled. [13:49:01] Not syncing `939312: Run a synthetic test for client side preferences` (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/939312) (CC mo_abualruz, feel free to ping us again when you're around, and/or reschedule for another deployment window) [13:49:39] !log [[gerrit:939312]] not synced. T336527 T339268 [13:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:44] T339268: [anon prefs] Implement inline script - https://phabricator.wikimedia.org/T339268 [13:49:44] T336527: Run a synthetic test for client side preferences - https://phabricator.wikimedia.org/T336527 [13:50:01] TheresNoTime: Is the mw-api-int deployment working all right now? [13:50:25] claime: not been able to test since you deployed it [13:50:34] ok [13:51:29] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:51:33] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye [13:51:35] let me make sure then :D [13:51:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye [13:51:57] !log close UTC afternoon backport window [13:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:05] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:52:13] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:52:26] (03PS1) 10Samtar: Revert "Run a synthetic test for client side preferences" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940910 [13:52:50] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [13:53:00] claime: I need to do that revert ^, does that help? [13:53:00] all good [13:53:06] ah cool, I'll do that now anyway [13:53:07] TheresNoTime: You can go ahead [13:53:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940910 (owner: 10Samtar) [13:54:37] (03Merged) 10jenkins-bot: Revert "Run a synthetic test for client side preferences" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940910 (owner: 10Samtar) [13:54:56] !log samtar@deploy1002 Started scap: Backport for [[gerrit:940910|Revert "Run a synthetic test for client side preferences"]] [13:55:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49675 and previous config saved to /var/cache/conftool/dbconfig/20230724-135557-root.json [13:56:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49676 and previous config saved to /var/cache/conftool/dbconfig/20230724-135604-root.json [13:56:25] !log samtar@deploy1002 samtar: Backport for [[gerrit:940910|Revert "Run a synthetic test for client side preferences"]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:56:42] (whoops forgot to disable that step) [13:58:54] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:59:15] (03CR) 10JMeybohm: [C: 03+1] admin_ng: Raise max cpu per pod to 10 for mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/940888 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [14:00:05] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host rdb1014.eqiad.wmnet with OS bullseye [14:00:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye [14:00:40] (03PS1) 10Clément Goubert: Revert "Revert "mw-api-int: Raise php container CPU limits"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940911 [14:01:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "Revert "mw-api-int: Raise php container CPU limits"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940911 (owner: 10Clément Goubert) [14:01:07] (03PS2) 10Clément Goubert: Revert "Revert "mw-api-int: Raise php container CPU limits"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940911 [14:01:09] (03CR) 10CI reject: [V: 04-1] Revert "Revert "mw-api-int: Raise php container CPU limits"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940911 (owner: 10Clément Goubert) [14:02:17] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:940910|Revert "Run a synthetic test for client side preferences"]] (duration: 07m 20s) [14:02:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49677 and previous config saved to /var/cache/conftool/dbconfig/20230724-140226-root.json [14:03:46] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10CodeReviewBot) apine merged https://gitlab.wiki... [14:03:46] it seems reverted before I got the chance to test [14:04:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:04:36] jouncebot: nowandnext [14:04:36] No deployments scheduled for the next 1 hour(s) and 25 minute(s) [14:04:36] In 1 hour(s) and 25 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T1530) [14:04:47] mo_abualruz: if you're free now, I can do it again? :) [14:04:51] if you still doing deployments I am available for testing [14:05:04] great that is awesome [14:05:39] (03PS1) 10Samtar: Revert "Revert "Run a synthetic test for client side preferences"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940912 [14:06:51] (03PS2) 10Samtar: Revert "Revert "Run a synthetic test for client side preferences"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940912 (https://phabricator.wikimedia.org/T336527) [14:07:06] (03CR) 10Mabualruz: [C: 03+1] Revert "Revert "Run a synthetic test for client side preferences"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940912 (https://phabricator.wikimedia.org/T336527) (owner: 10Samtar) [14:07:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1013.eqiad.wmnet with OS bullseye [14:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Remov... [14:07:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940912 (https://phabricator.wikimedia.org/T336527) (owner: 10Samtar) [14:07:55] (03CR) 10TrainBranchBot: "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940912 (https://phabricator.wikimedia.org/T336527) (owner: 10Samtar) [14:09:06] (03Merged) 10jenkins-bot: Revert "Revert "Run a synthetic test for client side preferences"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940912 (https://phabricator.wikimedia.org/T336527) (owner: 10Samtar) [14:09:24] !log samtar@deploy1002 Started scap: Backport for [[gerrit:940912|Revert "Revert "Run a synthetic test for client side preferences"" (T336527 T339268)]] [14:09:29] T339268: [anon prefs] Implement inline script - https://phabricator.wikimedia.org/T339268 [14:09:30] T336527: Run a synthetic test for client side preferences - https://phabricator.wikimedia.org/T336527 [14:09:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:11:14] !log samtar@deploy1002 samtar: Backport for [[gerrit:940912|Revert "Revert "Run a synthetic test for client side preferences"" (T336527 T339268)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:11:14] mo_abualruz: live on mwdebug for testing [14:11:24] ty [14:11:57] TheresNoTime: Can you ping me when done with backports so I can finish up on mw-api-int? No rush, just want to get it done right after [14:12:04] claime: sure [14:12:08] ty <3 [14:12:48] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:14:17] https://en.wikipedia.org/speed-tests/ does not show the new test files [14:14:57] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) >>! In T211661#9037678, @ori wrote: > @MatthewVernon: my understanding is that rewrite.py is currently setting expiry... [14:15:00] mo_abualruz: agreed, though I did note that https://en.wikipedia.org/speed-tests/Brazil.enwiki.1164571109/before/index.html works when using `mwdebug` [14:15:41] sorry I missed that from all the lines [14:15:51] I will be testing 1 moment [14:15:54] guessing https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/docroot/wikipedia.org/speed-tests/index.html needed to be updated too [14:16:11] I can sync this and you can do a follow-up to modify https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/docroot/wikipedia.org/speed-tests/index.html to link to the new pages? [14:17:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:37] all is as expected do I need to post any results or logs are suffecient ? [14:17:48] Nope, I will sync now :) [14:21:08] also https://en.wikipedia.org/speed-tests/Brazil.enwiki.1164571109/after-local-storage-json-parse/index.html works when using mwdebug [14:21:40] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) >>! In T211661#9037959, @Ladsgroup wrote: >>>! In T211661#9037678, @ori wrote: >> @MatthewVernon: my understanding... [14:23:30] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:940912|Revert "Revert "Run a synthetic test for client side preferences"" (T336527 T339268)]] (duration: 14m 05s) [14:23:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:23:35] T339268: [anon prefs] Implement inline script - https://phabricator.wikimedia.org/T339268 [14:23:36] T336527: Run a synthetic test for client side preferences - https://phabricator.wikimedia.org/T336527 [14:23:42] mo_abualruz: live :) all done [14:24:00] thanks a lot [14:24:49] you're welcome :-) [14:24:54] claime: all yours [14:24:59] thanks! [14:25:06] (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Raise max cpu per pod to 10 for mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/940888 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [14:25:12] (03PS1) 10Cory Massaro: Bump evaluator version in order to re-deploy with newest image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940951 [14:25:44] (03PS1) 10Andrew Bogott: Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) [14:26:07] (03CR) 10CI reject: [V: 04-1] Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [14:27:35] (03Merged) 10jenkins-bot: admin_ng: Raise max cpu per pod to 10 for mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/940888 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [14:28:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:51] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:29:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:30:14] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:30:39] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:32:13] (03CR) 10Jforrester: [C: 03+1] Bump evaluator version in order to re-deploy with newest image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940951 (owner: 10Cory Massaro) [14:32:33] (03CR) 10Cory Massaro: [C: 03+2] Bump evaluator version in order to re-deploy with newest image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940951 (owner: 10Cory Massaro) [14:32:43] (03CR) 10Clément Goubert: [C: 03+2] Revert "Revert "mw-api-int: Raise php container CPU limits"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940911 (owner: 10Clément Goubert) [14:32:50] (03CR) 10JMeybohm: "I can't really speak to this apart from that it is syntactically correct. But I would argue to change the version in helmfile.d/services/w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940951 (owner: 10Cory Massaro) [14:33:30] (03Merged) 10jenkins-bot: Bump evaluator version in order to re-deploy with newest image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940951 (owner: 10Cory Massaro) [14:33:49] (03Merged) 10jenkins-bot: Revert "Revert "mw-api-int: Raise php container CPU limits"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940911 (owner: 10Clément Goubert) [14:34:16] (03CR) 10Cory Massaro: Bump evaluator version in order to re-deploy with newest image. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940951 (owner: 10Cory Massaro) [14:34:58] (03CR) 10JMeybohm: [C: 03+1] mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) (owner: 10Giuseppe Lavagetto) [14:34:58] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:35:28] 10SRE, 10Infrastructure-Foundations, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10cmooney) a:03cmooney [14:35:33] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:35:42] (03CR) 10JMeybohm: [C: 03+1] admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:36:03] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:36:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10good first task: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298 (10joanna_borun) [14:36:33] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:36:44] !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:37:17] !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:39:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298 (10joanna_borun) [14:39:55] (03PS1) 10Vgutierrez: trafficserver: Disable SO_LINGER on incoming connections [puppet] - 10https://gerrit.wikimedia.org/r/940953 (https://phabricator.wikimedia.org/T339134) [14:40:34] (03CR) 10Ssingh: "Great catch! LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/940953 (https://phabricator.wikimedia.org/T339134) (owner: 10Vgutierrez) [14:40:36] (03CR) 10JMeybohm: [C: 03+1] "LGTM although I cant really speak to proper resourcing here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 (owner: 10Giuseppe Lavagetto) [14:41:38] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Disable SO_LINGER on incoming connections [puppet] - 10https://gerrit.wikimedia.org/r/940953 (https://phabricator.wikimedia.org/T339134) (owner: 10Vgutierrez) [14:44:32] !log Repooling cp4052 (upload) running ATS 9.2.1 - T339134 [14:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:35] T339134: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 [14:51:33] (03CR) 10JMeybohm: [C: 03+1] wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [14:53:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1014.eqiad.wmnet with OS bullseye [14:53:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**) - Remov... [15:07:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr) [15:08:56] (03PS19) 10Ahmon Dancy: Scap: scap_source Use the "group" consistently [puppet] - 10https://gerrit.wikimedia.org/r/361796 (https://phabricator.wikimedia.org/T342320) (owner: 10Thcipriani) [15:09:14] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/361796 (https://phabricator.wikimedia.org/T342320) (owner: 10Thcipriani) [15:15:01] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10Gehel) a:03bking [15:15:33] (03PS2) 10Andrew Bogott: Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) [15:15:58] (03PS20) 10Ahmon Dancy: Scap: scap_source Use the "group" consistently [puppet] - 10https://gerrit.wikimedia.org/r/361796 (https://phabricator.wikimedia.org/T342320) (owner: 10Thcipriani) [15:16:01] (03CR) 10CI reject: [V: 04-1] Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [15:16:11] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/361796 (https://phabricator.wikimedia.org/T342320) (owner: 10Thcipriani) [15:17:14] (03PS3) 10Andrew Bogott: Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) [15:19:08] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) Right. Now I remember. The initial expiration is indeed supposed to be set by Thumbor. The necessary functionality had some... [15:19:41] (03CR) 10Ahmon Dancy: "PCC results show no change on the the production deployment servers: https://puppet-compiler.wmflabs.org/output/361796/2070/" [puppet] - 10https://gerrit.wikimedia.org/r/361796 (https://phabricator.wikimedia.org/T342320) (owner: 10Thcipriani) [15:20:11] (03CR) 10CI reject: [V: 04-1] Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [15:25:16] (03PS4) 10Andrew Bogott: Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) [15:25:40] (03CR) 10CI reject: [V: 04-1] Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [15:28:22] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/940224 [15:30:05] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T1530). [15:34:43] (03CR) 10Klausman: [C: 03+1] api-gateway: change liftwing hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/940945 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [15:47:36] (03PS1) 10Stevemunene: airflow-wmde: Add a postgresql database and user for airflow wmde [puppet] - 10https://gerrit.wikimedia.org/r/940961 (https://phabricator.wikimedia.org/T340648) [15:48:47] (03CR) 10JMeybohm: [C: 04-1] Kubernetes: add support for deployment apparmor profiles (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [15:49:15] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/939347 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [15:50:58] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:52:05] (03CR) 10JMeybohm: "It's probably kind of a big ask but *if* we configure apparmor profiles on a per cluster basis we could include them into the per-cluster " [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [15:53:40] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) I also don't know how well Swift would handle 15k QPS of object metadata updates (cf T211661#8377883) [15:55:04] (03CR) 10JHathaway: [C: 03+1] "looks good, just a couple of questions" [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:00:56] !log Re-enabling disabled transport from knams to esams after fiber cleaning T337997 [16:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10Jclark-ctr) a:03Jclark-ctr [16:02:20] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:10:56] (03CR) 10Elukey: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [16:38:59] !log restart ATS to pick up CR 940953: T339134 [16:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:05] T339134: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 [16:44:21] (03CR) 10Alexandros Kosiaris: [V: 03+1] Kubernetes: add support for deployment apparmor profiles (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [16:48:12] (03CR) 10JMeybohm: [C: 04-1] Kubernetes: add support for deployment apparmor profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [16:49:19] (03PS8) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [16:50:08] (03CR) 10JHathaway: [C: 03+1] "looks good, one minor suggestion" [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:53:21] (03PS9) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [16:55:56] (03PS1) 10FNegri: irc: Handler customer logging formatters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/940968 (https://phabricator.wikimedia.org/T341793) [16:56:22] (03PS2) 10FNegri: irc: Handle custom logging formatters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/940968 (https://phabricator.wikimedia.org/T341793) [16:57:01] (03PS3) 10FNegri: irc: Handle custom logging formatters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/940968 (https://phabricator.wikimedia.org/T341793) [16:59:25] (03CR) 10JHathaway: [C: 03+1] puppetdb-api: swap the production and next environments [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T1700) [17:00:07] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T1700). [17:01:09] (03CR) 10CI reject: [V: 04-1] irc: Handle custom logging formatters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/940968 (https://phabricator.wikimedia.org/T341793) (owner: 10FNegri) [17:05:55] (03CR) 10JHathaway: [C: 04-1] vtrs: drop bashisms and fix other CI issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940379 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [17:09:38] 10ops-knams: Inbound interface errors - https://phabricator.wikimedia.org/T342097 (10cmooney) 05Open→03Resolved a:03cmooney Resolving as link working following fibre cleaning / re-seat. [17:09:52] (03PS4) 10FNegri: irc: Handle custom logging formatters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/940968 (https://phabricator.wikimedia.org/T341793) [17:11:06] (03PS5) 10FNegri: irc: Handle custom logging formatters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/940968 (https://phabricator.wikimedia.org/T341793) [17:18:17] (03CR) 10JMeybohm: [C: 04-1] Kubernetes: add support for deployment apparmor profiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [17:21:19] (03PS10) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [17:25:08] (03PS11) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [17:38:35] (03PS12) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [17:39:46] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42660/console" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [17:40:04] (03CR) 10DCausse: flink-zk: Initiate new flink::zookeeper role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [17:41:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T342565 (10phaultfinder) [17:47:40] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42661/console" [puppet] - 10https://gerrit.wikimedia.org/r/940245 (https://phabricator.wikimedia.org/T338190) (owner: 10Jforrester) [17:48:02] (03PS1) 10Jforrester: wikifunctions: Add NODE_EXTRA_CA_CERTS to the evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 [17:48:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] apache: Add 'view_urls' rewrite for /view URLs, enable on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/940245 (https://phabricator.wikimedia.org/T338190) (owner: 10Jforrester) [17:51:31] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42662/console" [puppet] - 10https://gerrit.wikimedia.org/r/940246 (https://phabricator.wikimedia.org/T338190) (owner: 10Jforrester) [17:52:41] (03PS5) 10Andrew Bogott: Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) [17:52:57] (03CR) 10JMeybohm: [C: 04-1] "James_F: I would make that change in helmfile.d/services/wikifunctions/values.yaml because if the config.public structure is added there a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 (owner: 10Jforrester) [17:52:59] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Add NODE_EXTRA_CA_CERTS to the evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 (owner: 10Jforrester) [17:53:37] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2023-07-13-234503-production [puppet] - 10https://gerrit.wikimedia.org/r/938021 (owner: 10BryanDavis) [17:54:09] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] "I 'll deploy this tomorrow EU morning" [puppet] - 10https://gerrit.wikimedia.org/r/940246 (https://phabricator.wikimedia.org/T338190) (owner: 10Jforrester) [17:55:06] (03CR) 10Btullis: flink-zk: Initiate new flink::zookeeper role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [17:55:29] (03CR) 10CI reject: [V: 04-1] Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [17:55:57] (03PS2) 10Jforrester: wikifunctions: Add NODE_EXTRA_CA_CERTS to the services [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 [17:57:19] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T342535 (10NHillard-WMF) Approved [17:59:03] (03PS6) 10Andrew Bogott: Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) [18:01:56] (03CR) 10CI reject: [V: 04-1] Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [18:02:01] (03CR) 10JMeybohm: [C: 04-1] "You only need to bump chart versions if you change anything in the charts directory" [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 (owner: 10Jforrester) [18:02:31] (03PS3) 10Jforrester: wikifunctions: Add NODE_EXTRA_CA_CERTS to the services [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 [18:02:34] (03CR) 10Jforrester: wikifunctions: Add NODE_EXTRA_CA_CERTS to the services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 (owner: 10Jforrester) [18:03:54] (03CR) 10JMeybohm: [C: 03+1] wikifunctions: Add NODE_EXTRA_CA_CERTS to the services [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 (owner: 10Jforrester) [18:04:12] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Add NODE_EXTRA_CA_CERTS to the services [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 (owner: 10Jforrester) [18:05:14] (03Merged) 10jenkins-bot: wikifunctions: Add NODE_EXTRA_CA_CERTS to the services [deployment-charts] - 10https://gerrit.wikimedia.org/r/940977 (owner: 10Jforrester) [18:05:19] (03PS7) 10Andrew Bogott: Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) [18:06:41] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:07:10] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:07:43] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:08:29] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: split config into a separate class [puppet] - 10https://gerrit.wikimedia.org/r/940952 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [18:08:31] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:08:48] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:09:28] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:12:13] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [18:47:14] 10SRE, 10Traffic: varnish-frontend-hospital crash upon ATS restart - https://phabricator.wikimedia.org/T342566 (10Vgutierrez) `counterexample 0 Backend_health - vcl-84635598-fffa-4367-86af-05856c435a6e.be_cp3064_esams_wmnet Went sick -------H 2 3 5 0.000000 0.000000 0 Back... [18:51:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:28] (03PS2) 10Majavah: build: Fix printed image name [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938291 [18:53:34] (03PS2) 10Majavah: Add php82 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938292 (https://phabricator.wikimedia.org/T335352) [18:53:40] (03PS1) 10Majavah: Use base directory as build context directory [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/940988 [18:53:49] (03CR) 10Majavah: build: Fix printed image name (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938291 (owner: 10Majavah) [18:54:48] (03CR) 10Majavah: Add php82 images (033 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938292 (https://phabricator.wikimedia.org/T335352) (owner: 10Majavah) [18:56:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:57:31] (03PS1) 10Ssingh: varnish: gracefully handle varnish-frontend-hospital crash [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) [18:57:55] (03CR) 10CI reject: [V: 04-1] varnish: gracefully handle varnish-frontend-hospital crash [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh) [18:58:35] (03PS2) 10Krinkle: Profiler: Include the hostname in the URL for Excimer UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 [18:59:17] (03PS2) 10Ssingh: varnish: gracefully handle varnish-frontend-hospital crash [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) [19:00:43] (03PS2) 10Krinkle: Profiler: Remove "toobig" filter from Arc Lamp ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939755 (https://phabricator.wikimedia.org/T337873) [19:00:45] (03PS3) 10Krinkle: Profiler: Include the hostname in the URL for Excimer UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 [19:02:02] (03CR) 10CI reject: [V: 04-1] Profiler: Include the hostname in the URL for Excimer UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 (owner: 10Krinkle) [19:02:46] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 (owner: 10Krinkle) [19:12:03] (03CR) 10Krinkle: [C: 03+2] Profiler: Remove "toobig" filter from Arc Lamp ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939755 (https://phabricator.wikimedia.org/T337873) (owner: 10Krinkle) [19:12:44] (03Merged) 10jenkins-bot: Profiler: Remove "toobig" filter from Arc Lamp ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939755 (https://phabricator.wikimedia.org/T337873) (owner: 10Krinkle) [19:21:39] (03CR) 10Krinkle: [C: 03+2] Profiler: Include the hostname in the URL for Excimer UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 (owner: 10Krinkle) [19:21:57] !log krinkle@deploy1002 Synchronized src/Profiler.php: Idada376134 (duration: 06m 30s) [19:22:21] (03Merged) 10jenkins-bot: Profiler: Include the hostname in the URL for Excimer UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 (owner: 10Krinkle) [19:23:18] (03CR) 10Krinkle: [C: 03+2] "Staged and verified via https://performance.wikimedia.org/excimer/profile/81e04fd2db8249b2. Hostname now included!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 (owner: 10Krinkle) [19:24:32] (03PS1) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [19:25:36] (03CR) 10BryanDavis: [C: 03+2] build: Fix printed image name [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938291 (owner: 10Majavah) [19:26:11] (03Merged) 10jenkins-bot: build: Fix printed image name [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938291 (owner: 10Majavah) [19:29:18] (03PS2) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [19:29:56] !log krinkle@deploy1002 Synchronized lib/: Iaa0cb0c75d4 (duration: 06m 21s) [19:31:59] (03PS3) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [19:35:25] (03PS4) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [19:39:40] (03PS5) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [19:45:41] 10SRE, 10SRE-swift-storage, 10Thumbor: unsuccessful PNG thumbnail rendering from valid svg file - https://phabricator.wikimedia.org/T342549 (10doctaxon) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:20] indeed, nothing to do [20:04:16] !log dancy@deploy1002 Installing scap version "4.56.0" for 605 hosts [20:06:56] (03PS6) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [20:13:42] (03PS7) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [20:27:14] (03PS2) 10BryanDavis: Use base directory as build context directory [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/940988 (owner: 10Majavah) [20:27:20] (03PS3) 10BryanDavis: Add php82 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938292 (https://phabricator.wikimedia.org/T335352) (owner: 10Majavah) [20:27:26] (03PS1) 10BryanDavis: cleanup: Use COPY instead of ADD [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/940998 [20:28:42] (03CR) 10BryanDavis: [C: 03+2] "Trivial cleanup" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/940998 (owner: 10BryanDavis) [20:29:16] (03Merged) 10jenkins-bot: cleanup: Use COPY instead of ADD [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/940998 (owner: 10BryanDavis) [20:32:18] (03CR) 10BryanDavis: [C: 03+2] "Nice! We can pull things out into shared/ in follow ups." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/940988 (owner: 10Majavah) [20:34:02] (03Merged) 10jenkins-bot: Use base directory as build context directory [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/940988 (owner: 10Majavah) [20:35:25] (03CR) 10BryanDavis: [C: 03+2] Add php82 images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938292 (https://phabricator.wikimedia.org/T335352) (owner: 10Majavah) [20:36:00] (03Merged) 10jenkins-bot: Add php82 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/938292 (https://phabricator.wikimedia.org/T335352) (owner: 10Majavah) [20:52:53] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) 05Open→03In progress a:03... [20:53:24] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) a:05Jdforrester-WMF→03cmass... [20:57:23] (03PS2) 10Jforrester: admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [20:58:37] (03PS8) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [21:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T2100) [21:08:31] Deploying a quick update for T336027 to PS.php [21:13:21] (03PS1) 10BryanDavis: cleanup: Move common python files to shared/python [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941004 [21:13:23] (03PS1) 10BryanDavis: cleanup: Move common ruby files to shared/ruby [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941005 [21:13:29] (03PS1) 10BryanDavis: cleanup: Move common node files to shared/node [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941006 [21:13:35] (03PS1) 10BryanDavis: cleanup: Move common generic files to shared/generic [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941007 [21:13:41] (03PS1) 10BryanDavis: cleanup: move nsswitch.conf to shared/etc/ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941008 [21:14:35] !log Deployed updated mitigation for T336027 [21:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:39] Deploying a patch for T341565 [21:18:32] (03PS9) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [21:19:04] (03CR) 10Majavah: [C: 03+2] cleanup: Move common python files to shared/python [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941004 (owner: 10BryanDavis) [21:19:39] (03Merged) 10jenkins-bot: cleanup: Move common python files to shared/python [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941004 (owner: 10BryanDavis) [21:19:46] (03CR) 10Majavah: "Any reason this one has its own script instead of re-using generic/ from I927753fee6fc2fd6899835ca2c25ebf889f94dc8?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941005 (owner: 10BryanDavis) [21:21:23] (03CR) 10Majavah: [C: 03+1] cleanup: Move common node files to shared/node [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941006 (owner: 10BryanDavis) [21:21:42] (03CR) 10Majavah: [C: 03+1] cleanup: Move common generic files to shared/generic [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941007 (owner: 10BryanDavis) [21:22:18] (03CR) 10Majavah: [C: 03+1] cleanup: move nsswitch.conf to shared/etc/ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941008 (owner: 10BryanDavis) [21:23:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Aklapper) [21:26:20] (03PS1) 10Btullis: Install the ceph-volume and hdparm packages on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941010 (https://phabricator.wikimedia.org/T330151) [21:28:13] !log Deployed patch for T341565 [21:28:15] (03CR) 10BryanDavis: [C: 04-1] cleanup: Move common ruby files to shared/ruby (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941005 (owner: 10BryanDavis) [21:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:33:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:33:55] (03PS2) 10BryanDavis: cleanup: Move common node files to shared/node [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941006 [21:34:01] (03PS2) 10BryanDavis: cleanup: Move common generic files to shared/generic [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941007 [21:34:07] (03PS2) 10BryanDavis: cleanup: move nsswitch.conf to shared/etc/ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941008 [21:34:34] (03Abandoned) 10BryanDavis: cleanup: Move common ruby files to shared/ruby [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941005 (owner: 10BryanDavis) [21:36:08] (03CR) 10BryanDavis: [C: 03+2] cleanup: Move common node files to shared/node [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941006 (owner: 10BryanDavis) [21:36:39] (03Merged) 10jenkins-bot: cleanup: Move common node files to shared/node [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941006 (owner: 10BryanDavis) [21:37:15] (03PS1) 10Btullis: Add a second copy of the bootstrap-osd keyring to cephosd [puppet] - 10https://gerrit.wikimedia.org/r/941011 (https://phabricator.wikimedia.org/T330151) [21:37:57] (03CR) 10CI reject: [V: 04-1] Add a second copy of the bootstrap-osd keyring to cephosd [puppet] - 10https://gerrit.wikimedia.org/r/941011 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [21:39:10] (03CR) 10Majavah: [C: 03+1] cleanup: Move common generic files to shared/generic [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941007 (owner: 10BryanDavis) [21:39:39] (03CR) 10BryanDavis: [C: 03+2] cleanup: Move common generic files to shared/generic [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941007 (owner: 10BryanDavis) [21:39:44] (03CR) 10BryanDavis: [C: 03+2] cleanup: move nsswitch.conf to shared/etc/ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941008 (owner: 10BryanDavis) [21:40:17] (03Merged) 10jenkins-bot: cleanup: Move common generic files to shared/generic [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941007 (owner: 10BryanDavis) [21:40:20] (03Merged) 10jenkins-bot: cleanup: move nsswitch.conf to shared/etc/ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941008 (owner: 10BryanDavis) [21:57:32] jouncebot: nowandnext [21:57:32] For the next 1 hour(s) and 2 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230724T2100) [21:57:32] In 4 hour(s) and 2 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0200) [21:57:59] maryum, sbassett: are you done with deploying? [22:00:00] (03PS1) 10Btullis: Exclude nagios checks of tmpfs mounts on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941014 (https://phabricator.wikimedia.org/T330151) [22:01:08] (03PS1) 10Zabe: HookContainer: avoid instantiation of handlers when calling register() [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940918 (https://phabricator.wikimedia.org/T341102) [22:01:40] (03PS1) 10Zabe: client: Avoid dynamically registering hook handlers [extensions/Wikibase] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940919 (https://phabricator.wikimedia.org/T341102) [22:17:40] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42673/console" [puppet] - 10https://gerrit.wikimedia.org/r/941014 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [22:18:04] (03CR) 10Zabe: [C: 03+2] client: Avoid dynamically registering hook handlers [extensions/Wikibase] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940919 (https://phabricator.wikimedia.org/T341102) (owner: 10Zabe) [22:18:08] (03CR) 10Zabe: [C: 03+2] HookContainer: avoid instantiation of handlers when calling register() [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940918 (https://phabricator.wikimedia.org/T341102) (owner: 10Zabe) [22:19:41] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42674/console" [puppet] - 10https://gerrit.wikimedia.org/r/941014 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [22:22:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Isaac) [22:25:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Isaac) Hey all -- I'm helping Nat to get access to Superset, in particular dashboards/charts that do require private data access per https://wikitech.wikimedia.org/wiki/... [22:28:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Isaac) I'll note as well that we need the same access for Maryana Pinchuk ([[https://wikitech.wikimedia.org/wiki/User:Maryana|User:Maryana]]; mpinchuk@wikimedia.org). If... [22:33:29] (03Merged) 10jenkins-bot: client: Avoid dynamically registering hook handlers [extensions/Wikibase] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940919 (https://phabricator.wikimedia.org/T341102) (owner: 10Zabe) [22:33:37] (03Merged) 10jenkins-bot: HookContainer: avoid instantiation of handlers when calling register() [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940918 (https://phabricator.wikimedia.org/T341102) (owner: 10Zabe) [22:36:21] !log zabe@deploy1002 Started scap: Backport for [[gerrit:940919|client: Avoid dynamically registering hook handlers (T341102)]], [[gerrit:940918|HookContainer: avoid instantiation of handlers when calling register() (T341102 T340113 T339834)]] [22:36:28] T339834: HookContainer change: Error: Class 'PSExtensionHandler' not found - https://phabricator.wikimedia.org/T339834 [22:36:28] T341102: addWiki.php fails with CannotReplaceActiveServiceException - https://phabricator.wikimedia.org/T341102 [22:36:29] T340113: BadMethodCallException: Sessions are disabled for opensearch_desc entry point - https://phabricator.wikimedia.org/T340113 [22:37:46] !log zabe@deploy1002 zabe: Backport for [[gerrit:940919|client: Avoid dynamically registering hook handlers (T341102)]], [[gerrit:940918|HookContainer: avoid instantiation of handlers when calling register() (T341102 T340113 T339834)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experime [22:37:46] ntal XWD option) [22:39:25] 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from esams to knams - https://phabricator.wikimedia.org/T342198 (10Papaul) Here is the last update from kanms support from 5 days ago. ` Hi Papaul, Thank you for your enquiry. I will forward this to our Security department and they will get back in... [22:42:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye [22:42:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [22:46:20] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:940919|client: Avoid dynamically registering hook handlers (T341102)]], [[gerrit:940918|HookContainer: avoid instantiation of handlers when calling register() (T341102 T340113 T339834)]] (duration: 09m 59s) [22:46:27] T339834: HookContainer change: Error: Class 'PSExtensionHandler' not found - https://phabricator.wikimedia.org/T339834 [22:46:27] T341102: addWiki.php fails with CannotReplaceActiveServiceException - https://phabricator.wikimedia.org/T341102 [22:46:27] T340113: BadMethodCallException: Sessions are disabled for opensearch_desc entry point - https://phabricator.wikimedia.org/T340113 [22:47:21] (03PS9) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [22:47:23] (03PS10) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [22:47:25] (03PS1) 10Jforrester: InitialiseSettingsTest::testOnlyExistingWikis: Tell the user what setting is at fault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941019 [22:47:27] (03PS1) 10Jforrester: Drop 'wikifunctions-evaluator' reference, not exposed to the MW layer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941020 [22:47:31] (03PS1) 10Dreamy Jazz: CheckUser event table migration: Write new on group0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941021 (https://phabricator.wikimedia.org/T330158) [22:48:12] (03CR) 10CI reject: [V: 04-1] [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [22:49:23] (03PS11) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [22:49:43] zabe: Are you done deploying? Would like to push out a pair of prod-no-ops. [22:50:51] James_F: yeah, wanted to test addwiki with T335216, but i need to write the patch for that first, so feel free to do your stuff first [22:50:52] T335216: Create Wiktionary Mandailing - https://phabricator.wikimedia.org/T335216 [22:50:59] * James_F nods. [22:51:01] Thanks! [22:51:12] (03CR) 10Jforrester: [C: 03+2] "Test-only code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941019 (owner: 10Jforrester) [22:51:31] (03CR) 10Jforrester: [C: 03+2] "Prod no-op." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941020 (owner: 10Jforrester) [22:51:51] (03Merged) 10jenkins-bot: InitialiseSettingsTest::testOnlyExistingWikis: Tell the user what setting is at fault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941019 (owner: 10Jforrester) [22:52:09] (03Merged) 10jenkins-bot: Drop 'wikifunctions-evaluator' reference, not exposed to the MW layer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941020 (owner: 10Jforrester) [22:55:45] (Clear.) [22:59:40] (03PS12) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [22:59:42] (03PS6) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [22:59:44] (03PS4) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 [22:59:46] (03PS10) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [23:12:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1013.eqiad.wmnet with OS bullseye [23:12:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Rem... [23:21:57] btw I can't create T335216 now because I want to double check whether the creation depends on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/912796/ (which depends on some discussion outcome at https://incubator.wikimedia.org/wiki/Talk:Wt/btm) [23:21:57] T335216: Create Wiktionary Mandailing - https://phabricator.wikimedia.org/T335216