[00:02:21] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048892 (owner: 10TrainBranchBot) [00:03:47] RESOLVED: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:10] 10ops-codfw, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T368211#9916196 (10Andrew) Looks like this server needs a power supply replaced -- please let me know if we need to schedule downtime for this. [01:13:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367856)', diff saved to https://phabricator.wikimedia.org/P65380 and previous config saved to /var/cache/conftool/dbconfig/20240624-011315-marostegui.json [01:13:21] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:28:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P65381 and previous config saved to /var/cache/conftool/dbconfig/20240624-012822-marostegui.json [01:38:02] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916244 (10phaultfinder) [01:38:04] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916245 (10phaultfinder) [01:43:01] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916249 (10phaultfinder) [01:43:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P65382 and previous config saved to /var/cache/conftool/dbconfig/20240624-014329-marostegui.json [01:48:02] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916251 (10phaultfinder) [01:48:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:58:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367856)', diff saved to https://phabricator.wikimedia.org/P65383 and previous config saved to /var/cache/conftool/dbconfig/20240624-015836-marostegui.json [01:58:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [01:58:42] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:58:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [01:58:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T367856)', diff saved to https://phabricator.wikimedia.org/P65384 and previous config saved to /var/cache/conftool/dbconfig/20240624-015859-marostegui.json [02:28:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:38:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:47] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367856)', diff saved to https://phabricator.wikimedia.org/P65385 and previous config saved to /var/cache/conftool/dbconfig/20240624-041747-marostegui.json [04:17:53] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:32:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P65386 and previous config saved to /var/cache/conftool/dbconfig/20240624-043254-marostegui.json [04:38:47] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:48:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P65387 and previous config saved to /var/cache/conftool/dbconfig/20240624-044802-marostegui.json [04:53:41] (03PS1) 10Novem Linguae: enwiki: remove spamblacklistlog from abusefilter-helper [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048896 (https://phabricator.wikimedia.org/T367683) [05:03:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367856)', diff saved to https://phabricator.wikimedia.org/P65388 and previous config saved to /var/cache/conftool/dbconfig/20240624-050309-marostegui.json [05:03:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [05:03:14] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:03:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [05:35:49] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:30] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9916371 (10Joe) p:05Medium→03Unbreak! Dumps for english wikipedia caused a full sized outage as they saturated the network on saturday night: https://grafana.wikimedia.org/d/000000377/hos... [05:41:37] 06SRE, 06Data-Engineering, 10Dumps-Generation: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9916375 (10Joe) [05:42:06] 06SRE, 06Data-Engineering, 10Dumps-Generation: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9916373 (10Joe) [05:42:43] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916376 (10phaultfinder) [05:42:44] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916377 (10phaultfinder) [05:47:45] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916378 (10phaultfinder) [05:52:47] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916379 (10phaultfinder) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:20:49] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update articlequality image and storage URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou) [06:28:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:55:20] (03CR) 10Awight: [C:03+2] CommonSettings: Restore the original behaviour of Reference Previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) (owner: 10Func) [06:55:31] (03CR) 10Awight: [C:03+1] CommonSettings: Restore the original behaviour of Reference Previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) (owner: 10Func) [07:00:03] (03PS2) 10Func: CommonSettings: Restore the original behaviour of Reference Previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) [07:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] * kart_ is here [07:01:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1048443 (https://phabricator.wikimedia.org/T363183) (owner: 10KartikMistry) [07:02:10] ah, I should have done +2 earlier. [07:08:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:14:36] (03PS1) 10Slyngshede: Offboarding for Lea Voget [puppet] - 10https://gerrit.wikimedia.org/r/1049019 (https://phabricator.wikimedia.org/T368139) [07:15:57] (03CR) 10Muehlenhoff: Offboarding for Lea Voget (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1049019 (https://phabricator.wikimedia.org/T368139) (owner: 10Slyngshede) [07:17:12] (03PS2) 10Slyngshede: Offboarding for Lea Voget [puppet] - 10https://gerrit.wikimedia.org/r/1049019 (https://phabricator.wikimedia.org/T368139) [07:18:14] (03CR) 10Slyngshede: Offboarding for Lea Voget (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1049019 (https://phabricator.wikimedia.org/T368139) (owner: 10Slyngshede) [07:23:51] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator, 13Patch-For-Review: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9916418 (10SLyngshede-WMF) I did find the username in data.yaml, see https://gerrit.wikimedia.org/r/c/operations/puppe... [07:24:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049019 (https://phabricator.wikimedia.org/T368139) (owner: 10Slyngshede) [07:24:56] (03CR) 10Slyngshede: [C:03+2] Offboarding for Lea Voget [puppet] - 10https://gerrit.wikimedia.org/r/1049019 (https://phabricator.wikimedia.org/T368139) (owner: 10Slyngshede) [07:25:18] (03CR) 10CI reject: [V:04-1] AX Language selector entrypoint: Fix AX URL [extensions/ContentTranslation] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1048443 (https://phabricator.wikimedia.org/T363183) (owner: 10KartikMistry) [07:30:57] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete motd [puppet] - 10https://gerrit.wikimedia.org/r/1047949 (owner: 10Muehlenhoff) [07:32:20] (03CR) 10Muehlenhoff: [C:03+2] Move update-netboot-image.sh to the puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1047495 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:38:40] (03PS1) 10Kevin Bazira: ml-services: return logo-detection latency metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049082 (https://phabricator.wikimedia.org/T367962) [07:40:00] (03CR) 10Muehlenhoff: [C:03+2] Remove stat1004-1007 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1047926 (https://phabricator.wikimedia.org/T353785) (owner: 10Muehlenhoff) [07:43:06] (03CR) 10Elukey: Add WMF customisations to the upstream ceph-csi-rbd chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [07:45:05] 06SRE, 06Data-Engineering, 10Dumps-Generation: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9916447 (10MatthewVernon) [if a proper fix is going to be time-consuming, at least only running dumps when there are staff around would b... [07:47:18] ah CI failure and then my network failed. [07:47:27] Can someone kill scap if needed. [07:47:41] (03PS1) 10Brouberol: cloudnative-pg: create charts only containing the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049084 (https://phabricator.wikimedia.org/T364797) [07:47:43] I mean ongoing scap process if blocked. [07:47:43] (03PS1) 10Brouberol: cloudnative-pg: disable RBAC management within the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [07:47:45] (03PS1) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [07:47:46] (03PS1) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [07:48:04] (03CR) 10Nikerabbit: "recheck" [extensions/ContentTranslation] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1048443 (https://phabricator.wikimedia.org/T363183) (owner: 10KartikMistry) [07:56:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1048384 (owner: 10Slyngshede) [07:57:12] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9916467 (10DSmit-WMF) 05Invalid→03Open [07:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:37] (03CR) 10Slyngshede: [C:03+2] Syncronize and update templates to support new version of Thymeleaf. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1048384 (owner: 10Slyngshede) [08:02:20] (03CR) 10Slyngshede: [V:03+2 C:03+2] Syncronize and update templates to support new version of Thymeleaf. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1048384 (owner: 10Slyngshede) [08:04:01] (03CR) 10Muehlenhoff: admin: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [08:04:16] 06SRE, 06Data-Engineering, 10Dumps-Generation: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9916477 (10MatthewVernon) [08:08:37] (03CR) 10Elukey: "Left a comment about the metrics exporter, but the rest looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [08:10:49] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:56] (03CR) 10Muehlenhoff: "lists1001 got moved to role::insetup::buster for eventual decom and there are no further accesses in /var/log/nginx/acme-chief.secure.acce" [puppet] - 10https://gerrit.wikimedia.org/r/1047443 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:18:06] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9916509 (10dcaro) @CDanis No problem, let's get that data gathered :) > Given the limited duration expected for thi... [08:28:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1048445 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:34:15] (03PS1) 10Slyngshede: PAC4J secrets, required for CAS7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) [08:36:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3027/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048445 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:40:22] (03PS1) 10Ayounsi: Netbox 4: fix ColorChoices import [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049097 (https://phabricator.wikimedia.org/T336275) [08:40:55] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3029/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048445 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:42:14] (03CR) 10Nikerabbit: "recheck" [extensions/ContentTranslation] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1048443 (https://phabricator.wikimedia.org/T363183) (owner: 10KartikMistry) [08:43:01] (03CR) 10Slyngshede: [V:03+1 C:03+2] C:apereo_cas Add CAS 7 properties [puppet] - 10https://gerrit.wikimedia.org/r/1048445 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:44:17] (03CR) 10Ayounsi: [C:03+2] "Self merging into dev. Tested on netbox-dev." [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049097 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:45:14] (03Merged) 10jenkins-bot: Netbox 4: fix ColorChoices import [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049097 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:51:42] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [08:53:47] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:28] (03CR) 10JMeybohm: opentelemetry: update k8s API IP addresses (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048498 (owner: 10Kamila Součková) [09:01:12] (03PS2) 10JMeybohm: admin_ng: Bind to privileged PSP if restricted PSP is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048453 (https://phabricator.wikimedia.org/T273507) [09:01:34] (03CR) 10JMeybohm: admin_ng: Bind to privileged PSP if restricted PSP is disabled (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048453 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:02:04] (03PS2) 10Slyngshede: PAC4J secrets, required for CAS7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) [09:02:47] (03CR) 10Slyngshede: "Forgot another set of keys." [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [09:05:00] 06SRE, 06Data-Engineering, 10Dumps-Generation: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9916740 (10Ladsgroup) >>! In T368098#9916447, @MatthewVernon wrote: > [if a proper fix is going to be time-consuming, at least only runni... [09:05:32] 06SRE, 06Data-Engineering, 10Dumps-Generation: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9916743 (10Joe) >>! In T368098#9916447, @MatthewVernon wrote: > [if a proper fix is going to be time-consuming, at least only running dum... [09:06:24] (03PS3) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [09:07:13] (03CR) 10JMeybohm: [C:03+2] admin_ng: Bind to privileged PSP if restricted PSP is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048453 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:08:47] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:15] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:10:23] (03Merged) 10jenkins-bot: admin_ng: Bind to privileged PSP if restricted PSP is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048453 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:10:24] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:12:13] (03PS1) 10Slyngshede: C:apereo_cas PAC4J replication secrets [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) [09:13:22] 06SRE, 06Data-Engineering, 10Dumps-Generation: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9916796 (10Joe) I think actually @Ladsgroup's proposal seems like the easier solution on the short term. @xcollazo do you see any complic... [09:14:48] (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) [09:15:30] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [09:15:49] RESOLVED: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:36] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on checker.tools.wmflabs.org with reason: rebooting the toolschecker VM [09:17:49] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on checker.tools.wmflabs.org with reason: rebooting the toolschecker VM [09:18:03] (03PS7) 10Stevemunene: wdqs: add the query main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) [09:19:39] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:20:53] !log taavi@cumin1002 START - Cookbook sre.hosts.remove-downtime for checker.tools.wmflabs.org [09:20:53] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for checker.tools.wmflabs.org [09:21:24] (03CR) 10KCVelaga: "I have changed the changes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) (owner: 10KCVelaga) [09:22:05] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:22:39] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [09:24:54] (03PS2) 10Slyngshede: C:apereo_cas oauth session encryption [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) [09:26:45] (03PS3) 10Slyngshede: C:apereo_cas Additional secrets required for CAS7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) [09:27:46] (03PS4) 10Slyngshede: C:apereo_cas Additional secrets required for CAS7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) [09:28:08] (03PS2) 10Fabfur: hiera: upgrade haproxy to 2.8 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) [09:28:40] (03PS5) 10Slyngshede: C:apereo_cas Additional secrets required for CAS7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) [09:29:07] (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki: Reimage scap proxies as videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1048376 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [09:29:49] (03CR) 10Giuseppe Lavagetto: [C:03+1] scap_proxies: move all proxies to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1048377 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [09:29:50] (03PS3) 10Slyngshede: C:apereo_cas oauth session encryption [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) [09:30:07] (03CR) 10Fabfur: [C:04-2] "Do not merge until T367963 isn't resolved" [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [09:30:14] (03PS1) 10JMeybohm: Revert "admin_ng: Bind to privileged PSP if restricted PSP is disabled" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049107 [09:30:29] jouncebot: now [09:30:30] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [09:30:46] Anyone mind if I deploy now? [09:31:20] (03CR) 10Mvolz: [C:03+2] Log at info level for citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047954 (https://phabricator.wikimedia.org/T364901) (owner: 10Mvolz) [09:31:29] (03PS4) 10Slyngshede: C:apereo_cas oauth session encryption [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) [09:31:33] mvolz: citoid, not mediawiki? [09:31:39] claime: yeah [09:31:44] good for me then [09:32:09] I'm about to break possibly mediawiki deployments for a while working on scap proxies [09:32:13] 👍️ [09:32:25] (03Merged) 10jenkins-bot: Log at info level for citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047954 (https://phabricator.wikimedia.org/T364901) (owner: 10Mvolz) [09:32:35] (03PS1) 10Hnowlan: thumbor: reduce log spam [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049108 (https://phabricator.wikimedia.org/T368180) [09:32:40] (03PS1) 10Brouberol: cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) [09:33:52] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [09:33:54] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [09:34:10] !log Reimaging scap::proxies, mediawiki deployments may be unavailable - T368058 [09:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:15] T368058: Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058 [09:35:40] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049108 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [09:36:05] (03CR) 10Hnowlan: [C:03+2] thumbor: reduce log spam [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049108 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [09:37:26] (03Merged) 10jenkins-bot: thumbor: reduce log spam [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049108 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [09:38:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:38:41] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Reimage scap proxies as videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1048376 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [09:39:11] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [09:40:41] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048892 (owner: 10TrainBranchBot) [09:41:24] PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [09:41:44] (03CR) 10JMeybohm: [C:03+2] Revert "admin_ng: Bind to privileged PSP if restricted PSP is disabled" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049107 (owner: 10JMeybohm) [09:41:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1407.eqiad.wmnet with OS buster [09:42:03] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [09:42:06] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9916968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1407.eqiad.wm... [09:42:17] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [09:42:22] PROBLEM - Host ps1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [09:42:22] PROBLEM - Host ps1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [09:42:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw1420.eqiad.wmnet with OS buster [09:42:41] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9916971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1420.eqiad.wm... [09:42:59] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916974 (10phaultfinder) [09:42:59] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916975 (10phaultfinder) [09:43:06] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:16] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:29] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [09:43:59] err I assume that's related to the PDU room work in mrs? [09:44:08] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [09:44:20] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [09:44:52] (03Merged) 10jenkins-bot: Revert "admin_ng: Bind to privileged PSP if restricted PSP is disabled" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049107 (owner: 10JMeybohm) [09:44:54] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [09:44:57] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [09:45:30] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [09:45:48] (03PS2) 10Clément Goubert: scap_proxies: move all proxies to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1048377 (https://phabricator.wikimedia.org/T368058) [09:45:48] (03PS1) 10Clément Goubert: videoscalers: Pool 2 former appservers [puppet] - 10https://gerrit.wikimedia.org/r/1049110 (https://phabricator.wikimedia.org/T368058) [09:46:06] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [09:48:01] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9916992 (10phaultfinder) [09:48:47] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:48:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049111 [09:48:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049111 (owner: 10TrainBranchBot) [09:50:06] (03PS1) 10JMeybohm: admin_ng: Bind to privileged PSP if restricted PSP is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049112 (https://phabricator.wikimedia.org/T273507) [09:50:52] (03PS2) 10JMeybohm: admin_ng: Bind to privileged PSP if restricted PSP is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049112 (https://phabricator.wikimedia.org/T273507) [09:51:36] (03CR) 10Muehlenhoff: [C:03+2] Move puppetdb.tuning.conf template [puppet] - 10https://gerrit.wikimedia.org/r/1047498 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:52:59] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9917005 (10phaultfinder) [09:54:52] (03PS2) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [09:54:52] (03PS2) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [09:54:52] (03PS2) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [09:54:53] (03PS2) 10Brouberol: cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) [09:56:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1420.eqiad.wmnet with reason: host reimage [09:56:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1407.eqiad.wmnet with reason: host reimage [09:57:22] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:59:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1420.eqiad.wmnet with reason: host reimage [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T1000) [10:00:43] (03CR) 10JMeybohm: [C:03+2] admin_ng: Bind to privileged PSP if restricted PSP is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049112 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [10:01:25] (03PS3) 10Brouberol: cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) [10:01:26] (03PS3) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [10:01:26] (03PS3) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [10:01:26] (03PS3) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [10:03:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1407.eqiad.wmnet with reason: host reimage [10:03:43] (03Merged) 10jenkins-bot: admin_ng: Bind to privileged PSP if restricted PSP is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049112 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [10:04:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [10:05:14] (03PS4) 10Brouberol: cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) [10:05:14] (03PS4) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [10:05:14] (03PS4) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [10:05:15] (03PS4) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [10:05:16] (03PS1) 10Brouberol: cloudnative-pg: add CI fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049114 (https://phabricator.wikimedia.org/T364797) [10:10:49] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:11:08] (03CR) 10Jforrester: [C:03+1] [noop] Remove $wgRedirectScript, not used since MediaWiki 1.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048855 (owner: 10Gergő Tisza) [10:14:05] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:15:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049111 (owner: 10TrainBranchBot) [10:16:18] RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.19 ms [10:16:18] RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 85.52 ms [10:16:20] RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.40 ms [10:18:47] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:20] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.55 ms [10:19:32] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING WARNING - Packet loss = 33%, RTA = 85.27 ms [10:20:23] (03CR) 10Muehlenhoff: [C:03+2] Add a comment why puppet masters are listed for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1047960 (owner: 10Muehlenhoff) [10:20:40] (03PS1) 10Hnowlan: logging: actually use critical when swiftclient/tornado.access use info [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1049117 (https://phabricator.wikimedia.org/T368180) [10:20:54] (03CR) 10Muehlenhoff: [C:03+2] Point codfw and codfw1dev to use the eqiad LDAP ro servers as well [puppet] - 10https://gerrit.wikimedia.org/r/1047488 (https://phabricator.wikimedia.org/T367861) (owner: 10Muehlenhoff) [10:22:34] (03CR) 10Hnowlan: [C:03+1] "The "only videoscaler" comment is probably kinda redundant at this point, but it's usefully explicit." [puppet] - 10https://gerrit.wikimedia.org/r/1049110 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [10:23:01] (03CR) 10Hnowlan: [C:03+1] scap_proxies: move all proxies to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1048377 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [10:23:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:25:42] (03PS6) 10Btullis: Initial import of ceph-csi-rbd chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) [10:25:42] (03PS9) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) [10:25:42] (03PS13) 10Btullis: Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) [10:25:43] (03PS19) 10Btullis: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259) [10:26:17] (03CR) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [10:28:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:29:19] (03CR) 10AOkoth: prometheus: puppetise sql_exporter (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [10:29:26] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/945872/3032/" [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [10:32:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1420.eqiad.wmnet with OS buster [10:32:16] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1420.eqiad.wmnet... [10:32:38] (03CR) 10Clément Goubert: [C:03+2] videoscalers: Pool 2 former appservers [puppet] - 10https://gerrit.wikimedia.org/r/1049110 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [10:32:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:32:58] (03CR) 10Clément Goubert: [C:03+2] scap_proxies: move all proxies to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1048377 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [10:33:28] (03PS1) 10JMeybohm: admin_ng: disableRestrictedPSP in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049123 (https://phabricator.wikimedia.org/T273507) [10:35:43] (03CR) 10Slyngshede: [C:03+2] C:apereo_cas Additional secrets required for CAS7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:35:44] (03CR) 10Slyngshede: [V:03+2 C:03+2] C:apereo_cas Additional secrets required for CAS7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049095 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:36:13] (03CR) 10Jgiannelos: [C:03+2] push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048446 (owner: 10Jgiannelos) [10:36:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1407.eqiad.wmnet with OS buster [10:37:04] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1407.eqiad.wmnet... [10:37:18] (03Merged) 10jenkins-bot: push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048446 (owner: 10Jgiannelos) [10:39:19] !log pooling mw1420.eqiad.wmnet,mw1407.eqiad.wmnet as videoscalers - T368058 [10:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:24] T368058: Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058 [10:40:07] (03CR) 10Btullis: Initial import of ceph-csi-rbd chart for inspection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [10:40:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1407.eqiad.wmnet [10:40:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1407.eqiad.wmnet [10:40:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw1420.eqiad.wmnet [10:40:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1420.eqiad.wmnet [10:40:58] (03PS2) 10Brouberol: cloudnative-pg: Import the upstream chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) [10:41:01] (03PS2) 10Brouberol: Add upstream version annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037733 (https://phabricator.wikimedia.org/T364797) [10:41:12] (03PS7) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) [10:41:15] (03PS1) 10Brouberol: Add values overrides for the cloudnative-pg-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037735 (https://phabricator.wikimedia.org/T364797) [10:41:17] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/push-notifications: apply [10:41:23] !log cgoubert@cumin1002 conftool action : set/pooled=no:weight=10; selector: name=(mw1420.eqiad.wmnet|mw1407.eqiad.wmnet),dc=eqiad,cluster=videoscaler [10:41:35] (03Abandoned) 10Brouberol: Add values overrides for the cloudnative-pg-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037735 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:47:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:50:43] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3035/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:50:56] !log cgoubert@cumin1002 conftool action : set/pooled=no:weight=10; selector: name=(mw1420.eqiad.wmnet|mw1407.eqiad.wmnet),dc=eqiad,cluster=jobrunner [10:51:28] !log cgoubert@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=(mw1420.eqiad.wmnet|mw1407.eqiad.wmnet),dc=eqiad,cluster=jobrunner [10:54:00] (03PS1) 10Urbanecm: Growth: Enable CommunityConfiguration at idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049126 (https://phabricator.wikimedia.org/T366629) [10:54:01] (03PS1) 10Urbanecm: Growth: Enable CommunityConfiguration on round 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049127 (https://phabricator.wikimedia.org/T368121) [10:54:48] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3037/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:55:24] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917115 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [10:55:34] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3038/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:55:44] (03PS1) 10Clément Goubert: appserver: Remove all canaries [puppet] - 10https://gerrit.wikimedia.org/r/1049128 (https://phabricator.wikimedia.org/T368058) [10:56:37] (03CR) 10Michael Große: [C:03+1] Growth: Enable CommunityConfiguration at idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049126 (https://phabricator.wikimedia.org/T366629) (owner: 10Urbanecm) [10:58:19] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [10:58:37] (03CR) 10Michael Große: [C:03+1] Growth: Enable CommunityConfiguration on round 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049127 (https://phabricator.wikimedia.org/T368121) (owner: 10Urbanecm) [11:01:24] (03PS5) 10Slyngshede: C:apereo_cas oauth session encryption [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) [11:01:26] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove AAAA records from an-redacteddb1001 - btullis@cumin1002" [11:04:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3039/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:08:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:09:17] (03PS1) 10Slyngshede: C:apereo_cas Add dummy secrets for CAS 7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049129 (https://phabricator.wikimedia.org/T367487) [11:13:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove AAAA records from an-redacteddb1001 - btullis@cumin1002" [11:13:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:14:08] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1047090 (owner: 10Muehlenhoff) [11:17:14] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9917184 (10SCherukuwada) Manager is currently on leave, skip manager approves. [11:17:28] (03CR) 10JMeybohm: [C:03+2] admin_ng: disableRestrictedPSP in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049123 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [11:19:14] (03CR) 10Klausman: [C:03+2] kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [11:19:19] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Cloud VPS-specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1047090 (owner: 10Muehlenhoff) [11:20:43] (03CR) 10Slyngshede: [C:03+2] C:apereo_cas Add dummy secrets for CAS 7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049129 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:20:45] (03Merged) 10jenkins-bot: admin_ng: disableRestrictedPSP in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049123 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [11:20:46] (03CR) 10Slyngshede: [V:03+2 C:03+2] C:apereo_cas Add dummy secrets for CAS 7 [labs/private] - 10https://gerrit.wikimedia.org/r/1049129 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:21:22] (03CR) 10Slyngshede: [V:03+1 C:03+2] C:apereo_cas oauth session encryption [puppet] - 10https://gerrit.wikimedia.org/r/1049103 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:22:48] (03CR) 10Santiago Faci: [C:04-1] "Almost done. Just a duplicate attribute in the stream configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) (owner: 10KCVelaga) [11:23:13] (03Merged) 10jenkins-bot: kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [11:24:28] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9917198 (10SGupta-WMF) Thank you @Scott_French for the detailed explanation .... [11:25:32] 06SRE, 10Cloud-Services, 06serviceops, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9917200 (10MoritzMuehlenhoff) CAS 7.0 (what we are currently migrating to) removed the memcached backend. As such, this change won't be nee... [11:25:45] 06SRE, 10Cloud-Services, 06serviceops, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9917201 (10MoritzMuehlenhoff) [11:25:48] (03Abandoned) 10Muehlenhoff: Configure memcached on idp hosts to run as 'memcache' [puppet] - 10https://gerrit.wikimedia.org/r/1039229 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff) [11:26:43] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:28:22] (03CR) 10Muehlenhoff: [C:03+1] "I've merged changes to point the clients to use eqad instead of codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez) [11:28:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1049128 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [11:29:56] (03PS3) 10Brouberol: cloudnative-pg: Import the upstream chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) [11:29:56] (03PS3) 10Brouberol: Add upstream version annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037733 (https://phabricator.wikimedia.org/T364797) [11:29:56] (03PS2) 10Brouberol: cloudnative-pg: add CI fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049114 (https://phabricator.wikimedia.org/T364797) [11:29:57] (03PS5) 10Brouberol: cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) [11:29:58] (03PS5) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [11:29:59] (03PS5) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [11:30:02] (03PS5) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [11:30:06] (03PS8) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) [11:32:18] (03CR) 10DCausse: [C:03+1] sre.wdqs.data-transfer: new graph split instances [cookbooks] - 10https://gerrit.wikimedia.org/r/1048060 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [11:32:48] (03PS2) 10Brouberol: cloudnative-pg: create charts only containing the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049084 (https://phabricator.wikimedia.org/T364797) [11:33:06] (03PS1) 10Jgiannelos: push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049132 [11:33:13] (03CR) 10CI reject: [V:04-1] push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049132 (owner: 10Jgiannelos) [11:33:35] (03PS2) 10Jgiannelos: push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049132 [11:33:45] (03CR) 10Jgiannelos: [C:03+2] push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049132 (owner: 10Jgiannelos) [11:34:39] (03Merged) 10jenkins-bot: push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049132 (owner: 10Jgiannelos) [11:39:42] (03CR) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:40:49] (03PS9) 10Hnowlan: Add shellbox-video vars/config, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) [11:42:53] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [11:44:42] !log installing php8.2 security updates [11:44:44] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/push-notifications: apply [11:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:08] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [11:46:30] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:49:33] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [11:49:37] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [11:50:19] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [11:51:01] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [11:51:36] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw1403.eqiad.wmnet [11:51:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw1403.eqiad.wmnet [11:51:45] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [11:51:50] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw1406.eqiad.wmnet [11:51:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw1406.eqiad.wmnet [11:52:16] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2339.codfw.wmnet [11:52:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2339.codfw.wmnet [11:52:18] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [11:52:28] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2358.codfw.wmnet [11:52:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2358.codfw.wmnet [11:52:35] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mw2360.codfw.wmnet [11:52:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mw2360.codfw.wmnet [11:53:57] 06SRE, 10Cassandra, 06Data-Persistence: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9917253 (10MoritzMuehlenhoff) Very nice! [11:54:16] (03PS1) 10Slyngshede: C:apereo_cas check for tomcat 10 on CAS 7 only variables. [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) [11:55:08] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:55:22] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [11:56:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [11:57:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [11:58:03] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3041/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:59:29] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dreamy Jazz - https://phabricator.wikimedia.org/T368260 (10Dreamy_Jazz) 03NEW [12:01:52] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:02:59] (03PS2) 10Slyngshede: C:apereo_cas check for tomcat 10 on CAS 7 only variables. [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) [12:03:08] (03CR) 10Clément Goubert: [C:03+2] appserver: Remove all canaries [puppet] - 10https://gerrit.wikimedia.org/r/1049128 (https://phabricator.wikimedia.org/T368058) (owner: 10Clément Goubert) [12:03:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ContentTranslation] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1048443 (https://phabricator.wikimedia.org/T363183) (owner: 10KartikMistry) [12:03:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9917295 (10Dreamy_Jazz) [12:03:40] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3042/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:03:59] (03PS1) 10Muehlenhoff: irc.wikimedia.org: Stop sending broadcast events to the old buster nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049137 (https://phabricator.wikimedia.org/T331702) [12:04:07] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:04:48] (03CR) 10Btullis: cloudnative-pg: create charts only containing the CRDs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049084 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:05:30] (03PS3) 10Slyngshede: C:apereo_cas check for tomcat 10 on CAS 7 only variables. [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) [12:05:54] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917301 (10Clement_Goubert) [12:05:55] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:06:06] (03CR) 10CI reject: [V:04-1] C:apereo_cas check for tomcat 10 on CAS 7 only variables. [puppet] - 10https://gerrit.wikimedia.org/r/1049134 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:06:26] !log Setting all legacy appservers to inactive - T368058 [12:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:31] T368058: Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058 [12:06:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:07:29] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: cluster=appserver [12:07:39] !log Setting all legacy api_appservers to inactive - T368058 [12:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:49] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: cluster=api_appserver [12:08:33] (03CR) 10Btullis: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:08:41] (03PS1) 10Muehlenhoff: Remove apereo spec test [puppet] - 10https://gerrit.wikimedia.org/r/1049139 (https://phabricator.wikimedia.org/T367487) [15:27:47] (03PS1) 10Ayounsi: Netbox 4: scripts self.log renamed self.messages [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049204 (https://phabricator.wikimedia.org/T336275) [15:27:56] (03PS3) 10Arnaudb: mariadb: monitoring memory pressure [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) [15:28:12] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9918074 (10XiaoXiao-WMF) Do we have an estimate of when this access can be given to Muniza? My understanding is that we are sort of blocked by this access, if it can be resolved sooner tha... [15:29:03] (03PS2) 10Arnaudb: mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) [15:29:05] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [15:29:28] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [15:29:40] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9918085 (10LSobanski) a:03Dzahn [15:29:58] (03PS1) 10Eevans: restbase: Upgrade to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049205 (https://phabricator.wikimedia.org/T354970) [15:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T1530). [15:33:52] (03PS1) 10Eevans: aqs1010: Upgrade (canary) to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049209 (https://phabricator.wikimedia.org/T354970) [15:35:15] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049209 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [15:36:49] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9918178 (10LSobanski) p:05Medium→03Low [15:37:00] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9918194 (10MoritzMuehlenhoff) >>! In T367757#9905191, @Dzahn wrote: > Then the next steps will be creating an SSH key and signing the access agreement. Here are the details: This isn't nee... [15:37:31] (03CR) 10Eevans: [C:03+2] aqs1010: Upgrade (canary) to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049209 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [15:40:10] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:40:12] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:40:49] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:40:50] ^ someone working on these? [15:41:10] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:41:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:41:40] It may be because we're sending metrics from mw-jobrunner to the statsd exporter, but not sure [15:41:47] It may also just be a bad query [15:42:10] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=titan1001&var-datasource=thanos&var-cluster=titan&viewPanel=8 [15:42:11] (03CR) 10Scott French: [C:03+2] kubernetes: split unavailable-replicas alert per team [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [15:42:30] something is saturating the network afaics [15:42:37] (03PS1) 10MVernon: hiera: add cluster_label to cephadm::rgw services [puppet] - 10https://gerrit.wikimedia.org/r/1049218 (https://phabricator.wikimedia.org/T279621) [15:42:54] we've been stable on the rate of metrics since 1520 [15:42:54] (03CR) 10Ssingh: [C:03+1] hiera: add cluster_label to cephadm::rgw services [puppet] - 10https://gerrit.wikimedia.org/r/1049218 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:42:57] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [15:43:01] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [15:43:17] the request rate is up though? [15:43:22] (03Merged) 10jenkins-bot: kubernetes: split unavailable-replicas alert per team [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [15:43:27] (03CR) 10MVernon: [C:03+2] hiera: add cluster_label to cephadm::rgw services [puppet] - 10https://gerrit.wikimedia.org/r/1049218 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:43:34] claime: yeah for sure [15:43:36] * sukhe holds breath [15:43:47] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:51] !log updated termination_state cache haproxy metrics, expect higher CD and CR rates - T367963 [15:43:53] claime: see? it worked :) [15:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:56] T367963: Investigate increase in CD termination state after upgrading eqsin/ulsfo to HAProxy 2.8.10 - https://phabricator.wikimedia.org/T367963 [15:44:00] lol [15:44:05] it smells like a bad query [15:44:09] Don't blow on it [15:45:42] ... heat? https://grafana.wikimedia.org/goto/SHnkSNwSg?orgId=1 [15:46:30] hnowlan: yes [15:46:33] fwiw titan1001 fell over last week also but that was a query that got OOMkilled, doesn't seem that was a risk here [15:46:40] 06SRE, 06Data-Engineering, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9918255 (10colewhite) [15:47:05] hnowlan: what that means is that the incoming NIC queues on titan1001 were saturating in brief bursts [15:47:08] so it was running close to line rate for a while [15:47:27] cdanis: ahhh okay, thanks [15:48:12] https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/prometheus/files/usr/local/bin/prometheus-nic-saturation-exporter.py [15:48:47] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:24] !log restart pybal on lvs1020 [15:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:20] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [15:50:27] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [15:52:41] (03CR) 10Eevans: [C:03+2] restbase: Upgrade to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049205 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [15:55:22] (03PS1) 10MVernon: hieradata: set apus service to apus not envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1049222 (https://phabricator.wikimedia.org/T279621) [15:56:16] (03CR) 10Ssingh: [C:03+1] hieradata: set apus service to apus not envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1049222 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:56:49] (03CR) 10MVernon: [C:03+2] hieradata: set apus service to apus not envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1049222 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:57:13] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[1031,1034-1036].eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [15:57:18] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [15:59:35] !log dancy@deploy1002 Installing scap version "4.89.0" for 248 hosts [15:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:43] !log restart pybal on lvs1020 [15:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:25] (03Abandoned) 10Slyngshede: data.yaml: Add user superpes to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/929623 (https://phabricator.wikimedia.org/T338468) (owner: 10Slyngshede) [16:00:40] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:00:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_apus.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:00:46] !log dancy@deploy1002 Installing scap version "4.89.0" for 1 hosts [16:01:00] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#9918318 (10Jhancock.wm) got the other rail type and will test it out this week. [16:01:01] !log dancy@deploy1002 Installation of scap version "4.89.0" completed for 1 hosts [16:01:39] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049139 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [16:01:40] 06SRE, 06Data-Engineering, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9918320 (10xcollazo) Ok, I want to revise my previous assessment. Nominal network usage with bursts up to ~90 MB/s... [16:02:10] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - apus_443: Servers moss-fe1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:02:19] ^ yeah, looking into it [16:03:32] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9918323 (10xcollazo) [16:03:36] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 114 connections established with conf1007.eqiad.wmnet:4001 (min=114) https://wikitech.wikimedia.org/wiki/PyBal [16:04:54] (03PS1) 10RLazarus: mesh.configuration: Stricter compliance with config proto [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049225 [16:05:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_apus.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:06:31] (03PS2) 10RLazarus: mesh.configuration: Stricter compliance with config proto [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049225 [16:06:58] (03CR) 10Ayounsi: [C:03+2] "Tested locally, self merging into dev" [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049204 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [16:08:30] (03Merged) 10jenkins-bot: Netbox 4: scripts self.log renamed self.messages [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049204 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [16:09:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:10:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9918333 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:10:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_apus.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:15:19] (03PS1) 10MVernon: hieradata: set apus ProxyFetch url to https [puppet] - 10https://gerrit.wikimedia.org/r/1049227 (https://phabricator.wikimedia.org/T279621) [16:15:38] (03CR) 10Ssingh: [C:03+1] "Thanks to @vgutierrez@wikimedia.org for pointing this out!" [puppet] - 10https://gerrit.wikimedia.org/r/1049227 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:15:41] RESOLVED: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_apus.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:16:04] (03CR) 10MVernon: [C:03+2] hieradata: set apus ProxyFetch url to https [puppet] - 10https://gerrit.wikimedia.org/r/1049227 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:16:27] (03CR) 10Vgutierrez: [C:03+1] hieradata: set apus ProxyFetch url to https [puppet] - 10https://gerrit.wikimedia.org/r/1049227 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:18:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9918409 (10Jhancock.wm) @klausman when would you like to schedule this one? I am free Wednesday from 8 to 11 and Thursday/Friday from 8 to 12 cen... [16:19:20] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368209#9918415 (10Jhancock.wm) 05Open→03Resolved [16:20:51] !log restart pybal on lvs1020 [16:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:10] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:21:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9918431 (10klausman) All of those work for me, with a preference for Thursday, so I'll drain&power-off the machine Thu before 8am your time (1300... [16:24:10] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - apus_443: Servers moss-fe1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:27:25] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bookworm [16:27:55] (03CR) 10BCornwall: [C:03+1] cp5017: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049168 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:28:07] (03CR) 10BCornwall: [C:03+1] cp5020: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049171 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:28:24] (03CR) 10BCornwall: [C:03+1] cp5018: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049169 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:28:31] (03CR) 10BCornwall: [C:03+1] cp5019: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049170 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:28:44] (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt1053 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049231 (https://phabricator.wikimedia.org/T364457) [16:28:47] (03CR) 10BCornwall: [C:03+1] cp5021: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049172 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:28:54] (03CR) 10BCornwall: [C:03+1] cp5022: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049173 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:29:02] (03CR) 10BCornwall: [C:03+1] cp5023: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049174 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:29:08] (03CR) 10BCornwall: [C:03+1] cp5024: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049175 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:31:00] (03CR) 10Dzahn: [C:03+2] gitlab: remove unused ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [16:31:46] (03PS1) 10BCornwall: depool ulsfo for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1049232 (https://phabricator.wikimedia.org/T365763) [16:32:16] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt1053 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049231 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [16:32:22] (03PS2) 10BCornwall: depool eqsin for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1049232 (https://phabricator.wikimedia.org/T365763) [16:32:54] (03PS3) 10BCornwall: depool eqsin for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1049232 (https://phabricator.wikimedia.org/T365763) [16:33:10] (03CR) 10Ssingh: [C:03+1] depool eqsin for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1049232 (https://phabricator.wikimedia.org/T365763) (owner: 10BCornwall) [16:33:51] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9918514 (10cmooney) >>! In T326322#9615636, @fgiunchedi wrote: > Yeah having some ballpark numbers will be a great help @cmooney, unless... [16:33:56] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[1031,1034-1036].eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [16:34:01] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [16:34:10] (03CR) 10Dzahn: "These names don't exist in DNS. While it shouldn't hurt the webserver I would still add them first so that we can add tests and run them." [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene) [16:35:11] (03CR) 10Dzahn: "fwiw I got an unecpected error here "You might have not enough privileges." as if this was still in drafts" [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene) [16:39:48] (03CR) 10Jasmine_: "thank you rzl!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049225 (owner: 10RLazarus) [16:40:00] (03PS2) 10Clément Goubert: trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) [16:41:36] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [16:43:09] (03CR) 10Clément Goubert: [C:03+1] Sampled tracing (0.1%) for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049202 (https://phabricator.wikimedia.org/T367915) (owner: 10CDanis) [16:44:21] jouncebot: nowandnext [16:44:21] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [16:44:21] In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T1700) [16:44:21] In 0 hour(s) and 15 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T1700) [16:44:55] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [16:45:02] swfrench-wmf: hi, if you don't mind, I'm going to deploy this for mw-api-ext now: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1049201 [16:45:05] (03CR) 10Dzahn: [C:03+2] "no problems but the user has not been removed on gitlab1004 and gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [16:45:08] (03PS1) 10MVernon: hiera: set hostname in apus probe [puppet] - 10https://gerrit.wikimedia.org/r/1049235 (https://phabricator.wikimedia.org/T279621) [16:45:13] should be done before your window starts [16:45:40] (03CR) 10Ssingh: [C:03+1] hiera: set hostname in apus probe [puppet] - 10https://gerrit.wikimedia.org/r/1049235 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:45:59] cdanis: ack, thanks for the heads-up [16:46:01] (03CR) 10CDanis: [C:03+2] Sampled tracing (0.1%) for mw-api-ext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049201 (https://phabricator.wikimedia.org/T367915) (owner: 10CDanis) [16:46:02] (03CR) 10MVernon: [C:03+2] hiera: set hostname in apus probe [puppet] - 10https://gerrit.wikimedia.org/r/1049235 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:47:02] (03Merged) 10jenkins-bot: Sampled tracing (0.1%) for mw-api-ext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049201 (https://phabricator.wikimedia.org/T367915) (owner: 10CDanis) [16:47:49] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [16:48:45] !log restart pybal on lvs1020 [16:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:19] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [16:49:43] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9918580 (10KFrancis) >>! In T367757#9917976, @MoritzMuehlenhoff wrote: >>>! In T367757#9913302, @kamila wrote: >> @KFrancis can you please make sure @MunizaA's NDA is signed? Thank you! >... [16:51:20] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [16:52:18] (03CR) 10Vgutierrez: [C:04-1] "new multi-dc.lua port parameter needs to be set for all use cases" [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [16:53:24] (03PS1) 10Dzahn: admin: add user mnz to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049236 (https://phabricator.wikimedia.org/T367757) [16:54:38] (03CR) 10Vgutierrez: [C:04-1] trafficserver: complete switch to mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [16:54:44] (03PS2) 10Bking: search-platform/data-platform: route alerts to data-platform-alerts IRC [puppet] - 10https://gerrit.wikimedia.org/r/1048467 (https://phabricator.wikimedia.org/T368107) [16:55:23] (03PS1) 10MVernon: hiera: also use apus.discovery.wmnet for ProxyFetch [puppet] - 10https://gerrit.wikimedia.org/r/1049237 (https://phabricator.wikimedia.org/T279621) [16:55:28] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [16:57:01] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9918623 (10Dzahn) >>! In T367757#9918194, @MoritzMuehlenhoff wrote: > This isn't needed, there is already existing shell access for the mnz user. Who is approving whe... [16:58:42] (03PS2) 10MVernon: hiera: allow http 404 for apus healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/1049237 (https://phabricator.wikimedia.org/T279621) [16:59:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:00:04] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T1700). [17:00:04] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T1700). [17:00:14] (03PS8) 10Ottomata: Configurably remove varnish handling of /beacon/event [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) [17:00:32] here, will start work shortly [17:00:34] (03CR) 10Ottomata: Configurably remove varnish handling of /beacon/event (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:00:58] cdanis: would you like me to hold off before merging anything? not sure how much time you need to verify [17:01:05] swfrench-wmf: you're good! [17:01:10] ack, thanks [17:01:20] (03CR) 10Scott French: [C:03+2] mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:02:06] (03PS1) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T367757) [17:02:58] (03Merged) 10jenkins-bot: mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:04:48] (03PS3) 10MVernon: hiera: set a suitable hostname for the health checks [puppet] - 10https://gerrit.wikimedia.org/r/1049237 (https://phabricator.wikimedia.org/T279621) [17:04:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9918658 (10Dzahn) >>! In T367757#9907306, @fkaelin wrote: > As for an approvers list, please add myself and @XiaoXiao-WMF (assuming your access is setup). Thanks, I... [17:05:33] (03PS4) 10MVernon: hiera: set a suitable hostname for the health checks and probes [puppet] - 10https://gerrit.wikimedia.org/r/1049237 (https://phabricator.wikimedia.org/T279621) [17:05:41] (03CR) 10Ssingh: [C:03+1] "Should return a 200, so let's try this" [puppet] - 10https://gerrit.wikimedia.org/r/1049237 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [17:07:24] (03CR) 10Ssingh: [C:03+1] hiera: set a suitable hostname for the health checks and probes [puppet] - 10https://gerrit.wikimedia.org/r/1049237 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [17:07:39] (03CR) 10MVernon: [C:03+2] hiera: set a suitable hostname for the health checks and probes [puppet] - 10https://gerrit.wikimedia.org/r/1049237 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [17:08:03] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:09:02] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:11:13] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:11:13] (03PS1) 10Cathal Mooney: Update gnmic config to allow processing of all interface stats [puppet] - 10https://gerrit.wikimedia.org/r/1049242 (https://phabricator.wikimedia.org/T326322) [17:13:29] !log restart pybal on lvs1020 and lvs1019 [17:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:32] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1054.eqiad.wmnet with OS bookworm [17:13:37] andrew@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [17:14:53] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 84 connections established with conf1007.eqiad.wmnet:4001 (min=84) https://wikitech.wikimedia.org/wiki/PyBal [17:15:41] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:16:19] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - apus_443: Servers moss-fe1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:18:15] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "b"} and A:restbase and A:eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [17:18:19] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:18:20] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [17:19:51] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:20:18] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:20:28] (03PS2) 10Reedy: hieradata/mediawiki.yaml: Move foundation.wm.o to wm.o docroot folder [puppet] - 10https://gerrit.wikimedia.org/r/1036744 (https://phabricator.wikimedia.org/T366005) [17:22:37] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 98 connections established with conf2004.codfw.wmnet:4001 (min=98) https://wikitech.wikimedia.org/wiki/PyBal [17:22:46] (03CR) 10Scott French: [C:03+2] mediawiki: enable securityContext in all canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:23:18] !log restart pybal on lvs2013 [17:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:32] (03PS4) 10Scott French: mediawiki: enable securityContext in all canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) [17:23:32] (03PS4) 10Scott French: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) [17:23:35] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - apus_443: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:24:08] (03CR) 10Bking: [C:03+1] cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [17:24:31] (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:25:36] (03Merged) 10jenkins-bot: mediawiki: enable securityContext in all canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:25:43] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:26:23] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:26:29] (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt1055 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049243 (https://phabricator.wikimedia.org/T364457) [17:27:16] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:27:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - apus_443: Servers moss-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:28:09] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:28:13] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 80 connections established with conf2004.codfw.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [17:28:33] (03CR) 10Fabfur: hiera: upgrade haproxy to 2.8 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [17:28:46] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bookworm [17:29:02] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt1055 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049243 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [17:29:14] (03CR) 10Kamila Součková: [C:03+1] admin: add user mnz to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049236 (https://phabricator.wikimedia.org/T367757) (owner: 10Dzahn) [17:30:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9918782 (10kamila) [17:32:20] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:33:10] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:34:10] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=apus,dc=codfw [17:34:31] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9918796 (10VRiley-WMF) [17:34:42] (03CR) 10Bking: [C:03+1] cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [17:35:17] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9918794 (10VRiley-WMF) Hey @Andrew we upon looking at this ticket, I'm guessing we are seeing some thermal issues on this server? We can check the thermal pa... [17:37:43] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:38:19] (03CR) 10Bking: [C:03+1] "Much cleaner this way. Honestly, it could be an upstream CR as well (not to assign you work ;P )" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [17:38:21] (03CR) 10Kamila Součková: [C:03+1] admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T367757) (owner: 10Dzahn) [17:38:23] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:38:31] ^ fixing with revert of apus service [17:39:22] (03PS1) 10Ssingh: apus: set to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1049245 [17:39:52] (03PS1) 10Scott French: Revert "mediawiki: enable securityContext in all canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049246 (https://phabricator.wikimedia.org/T362978) [17:40:05] (03CR) 10Ssingh: [C:03+2] apus: set to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1049245 (owner: 10Ssingh) [17:40:28] (03PS3) 10SBassett: Update security.wikimedia.org helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 (https://phabricator.wikimedia.org/T365644) [17:40:36] (03CR) 10SBassett: [C:03+1] Update security.wikimedia.org helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 (https://phabricator.wikimedia.org/T365644) (owner: 10SBassett) [17:41:48] (03CR) 10Scott French: [C:03+2] Revert "mediawiki: enable securityContext in all canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049246 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:42:43] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:42:49] (03Merged) 10jenkins-bot: Revert "mediawiki: enable securityContext in all canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049246 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:43:23] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:43:42] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:43:48] (03PS4) 10SBassett: Update security.wikimedia.org helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 (https://phabricator.wikimedia.org/T365644) [17:44:23] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9918849 (10xcollazo) Coming back to: >>! In T368098#9916796, @Joe wro... [17:44:34] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [17:44:53] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:45:06] (03CR) 10SBassett: [C:03+2] Update security.wikimedia.org helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 (https://phabricator.wikimedia.org/T365644) (owner: 10SBassett) [17:45:06] (03PS1) 10Scott French: Revert "mediawiki: add securityContext to all containers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049248 (https://phabricator.wikimedia.org/T362978) [17:45:31] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:46:05] (03Merged) 10jenkins-bot: Update security.wikimedia.org helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 (https://phabricator.wikimedia.org/T365644) (owner: 10SBassett) [17:46:12] (03PS1) 10Dzahn: admin: fix email and realname for user mnz [puppet] - 10https://gerrit.wikimedia.org/r/1049250 (https://phabricator.wikimedia.org/T367757) [17:46:31] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9918865 (10Andrew) Sure, let's try moving it. You can do that at your convenience since the server doesn't have any workload on it. I don't have a theory ab... [17:46:32] !log sbassett@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:46:35] !log sbassett@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:46:51] !log sbassett@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:46:52] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:46:55] !log sbassett@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:47:07] !log sbassett@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:47:39] !log sbassett@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:47:40] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs [17:47:41] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:47:41] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:47:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:47:54] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [17:47:56] !log sbassett@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:47:58] er hmm [17:48:03] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:48:16] hi [17:48:19] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs [17:48:30] !log sbassett@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:48:35] !log sbassett@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:48:37] !log sbassett@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:49:04] mutante: there was some tangential work on apus but nothing related to swift directly [17:49:05] in the process of reverting the changes I'd planned to make during this infra window (needs more work), but I don't think that should affect availability in any way [17:49:09] I am trying to figure out if it might be related [17:49:38] it's esams [17:49:42] so not likely to be related anyway [17:49:50] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs [17:49:51] (03PS2) 10Scott French: Revert "mediawiki: add securityContext to all containers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049248 (https://phabricator.wikimedia.org/T362978) [17:50:13] swfrench-wmf: your change was related to this? [17:50:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:50:29] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs [17:50:39] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [17:50:44] sukhe: I don't think so, no - that's what I meant above ^ [17:50:46] (03CR) 10SBassett: [C:03+2] "Deployed: https://sal.toolforge.org/log/QEpdS5ABhuQtenzvPQSG" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 (https://phabricator.wikimedia.org/T365644) (owner: 10SBassett) [17:50:47] looking, though [17:50:52] it's coming down fwiw [17:51:00] confirm it's esams [17:51:54] digg more [17:51:57] ing [17:52:20] "only" like 3 req/s but compared to base line that's high [17:52:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:52:58] :] [17:53:19] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=apus,dc=eqiad [17:53:26] (03PS3) 10Scott French: Revert "mediawiki: add securityContext to all containers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049248 (https://phabricator.wikimedia.org/T362978) [17:53:34] mutante: yep [17:55:05] (03CR) 10Muehlenhoff: [C:03+1] admin: fix email and realname for user mnz [puppet] - 10https://gerrit.wikimedia.org/r/1049250 (https://phabricator.wikimedia.org/T367757) (owner: 10Dzahn) [17:55:07] so, it's a trickle of errors from swift.discovery.wmnet seen from ~ all cache clusters? (but only esams was high enough to alert?) [17:55:16] swfrench-wmf: I am not sure it is related to your change but fwiw it started at 17;27 [17:55:56] I see it in all I think [17:56:19] https://grafana.wikimedia.org/goto/8XP0hHwIg?orgId=1 [17:56:21] example eqiad [17:56:24] magru as well [17:56:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [17:57:04] exactly, yeah - it seems like ATS ~ everywhere is seeing this, but only esams was enough to revert [17:57:17] *alert - heh, I have reverts on the brain at the moment [17:57:19] I still don't see the correlation with your patch though [17:57:41] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:57:54] !log restart on pybal lvs1019 [17:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:19] correct, yeah - a think I updated "close" to :27 was the canary deployment of mw-api-int in codfw. I'm having a hard time seeing how that (or indeed, the content of my change) would relate [17:58:37] (03CR) 10Brennen Bearnes: "Hmm. I guess maybe `ensure => 'absent'` isn't actually the way to remove a user?" [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [17:58:49] magru has values 0 but somehow still a graphh is drawn [18:00:05] (03CR) 10Scott French: [C:03+2] Revert "mediawiki: add securityContext to all containers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049248 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:00:17] swfrench-wmf: there is only way to confirm :) [18:00:25] you can merge it again and mutante and I can see what happens :) [18:00:39] (03CR) 10Dzahn: [C:03+2] "I asked about this and it's by design that the user isn't removed." [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [18:01:41] (03Merged) 10jenkins-bot: Revert "mediawiki: add securityContext to all containers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049248 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:02:32] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "b"} and A:restbase and A:eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [18:02:37] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [18:02:50] sukhe: so, the canary mw-api-int changes have been rolled back since ~ 17:45 or so and shouldn't have a "lingering" effect so to speak [18:02:59] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "d"} and A:restbase and A:eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [18:04:01] does anyone know what a 504 from swift means in this context? is that a gateway timeout emitted by ATS or a timeout upstream? [18:04:42] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1020.eqiad.wmnet [18:04:42] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1020.eqiad.wmnet [18:04:49] I am definitely seeing some swift issues [18:04:53] Jun 24 18:04:46 lvs1020 pybal[4145286]: [swift_80] INFO: Server ms-fe1012.eqiad.wmnet (enabled/partially up/not pooled) is up [18:05:14] ouch ouch [18:05:49] phew, no, I thought I changed lvs state for swift and not apus [18:05:51] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:05:53] that was heart-attack inducing [18:05:58] swfrench-wmf: 504 is from swift [18:06:15] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9918925 (10xcollazo) Ok after observing the dumps for a bit now, here... [18:06:27] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:06:40] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:07:05] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:07:57] (03CR) 10Jcrespo: "I felt the need to compliment you for *not* choosing disk usage, which I have been for ages predicating at WMF why that is really bad metr" [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [18:08:23] sukhe: almost done cleaning up after these reverts (need to make sure everything is clean so they don't get reapplied), then can help take a look [18:08:32] sukhe: swfrench-wmf: I found a SAL entry where swift-proxy on that machine was restarted and it links to this: [18:08:35] https://phabricator.wikimedia.org/T360913 [18:08:51] "Swift proxy server misbehaviour (no longer calling `accept`?)" [18:09:23] "We have over time sometimes seen rises in 5xx error codes reported by ATS from swift. " [18:09:29] mutante: nice find - I vaguely remember that happening [18:09:44] The effect of briefly repooling the server a couple of times can be seen on grafana - connection timeouts and failures. [18:09:52] was it .. briefly depooled? [18:10:26] mutante: good find [18:10:36] but I can't see where it was depooled, unless the other depools I am doing for apus count [18:10:38] tempting to restart swift-proxy [18:10:40] but they should not, it's a new service [18:12:27] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:13:28] ok [18:14:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:15:03] 10SRE-swift-storage: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#9918960 (10Dzahn) We saw a spike of 504s on ms-fe1012 today that resulted in paging. [18:15:13] ha [18:15:16] what timing [18:15:18] !incidents [18:15:18] 4791 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [18:15:18] 4790 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [18:15:19] 4789 (RESOLVED) [2x] ProbeDown sre (ip4 appservers-https:443 probes/service http_appservers-https_ip4) [18:15:21] !ack 4791 [18:15:22] 4791 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [18:15:30] acked .. [18:15:50] mutante: I guess +1 on the restart [18:16:06] mutante, sukhe: alright, I think all my changes are cleaned up. let me know if you want / need additional hands with this. [18:16:18] swfrench-wmf: thanks! [18:16:31] !log ms-fe1012:~] $ sudo systemctl restart swift-proxy T360931 [18:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:36] T360931: Add autorescue for ideal - https://phabricator.wikimedia.org/T360931 [18:16:39] arrr [18:16:50] mutante: we could use autorescue :P [18:16:56] !log ms-fe1012:~] $ sudo systemctl restart swift-proxy T360913 [18:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:01] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [18:17:13] sukhe: lol [18:17:30] (03CR) 10Brennen Bearnes: "Makes sense. Will just remove the remaining reference to it. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [18:17:30] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1055.eqiad.wmnet with OS bookworm [18:18:47] hmm [18:19:01] where did you see the hostname earlier? [18:19:07] it's more than one [18:19:28] which one? [18:19:38] ms-fe1012 [18:19:41] oh right [18:19:44] there is also moss-fe [18:19:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:19:55] well :) [18:20:02] mutante: I think your restart did it [18:20:06] great, ack [18:20:10] nice find! [18:20:13] :) [18:20:26] SAL logging worth it once again [18:20:28] thanks, going AFK for the eqsin work at night now [18:20:33] cu [18:20:34] hopefully it is all silent [18:21:36] (03CR) 10Dzahn: [C:03+2] admin: add user mnz to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049236 (https://phabricator.wikimedia.org/T367757) (owner: 10Dzahn) [18:21:55] (03PS1) 10Brennen Bearnes: gitlab: remove last reference to ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1049253 (https://phabricator.wikimedia.org/T355097) [18:22:24] (03CR) 10CI reject: [V:04-1] gitlab: remove last reference to ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1049253 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [18:22:52] 06SRE: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#9918991 (10Aklapper) [18:24:02] (03PS2) 10Brennen Bearnes: gitlab: remove last reference to ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1049253 (https://phabricator.wikimedia.org/T355097) [18:24:44] (03PS1) 10CDanis: Scrape temporary deployment of ebpf_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1049254 (https://phabricator.wikimedia.org/T348643) [18:24:53] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049254 (https://phabricator.wikimedia.org/T348643) (owner: 10CDanis) [18:25:04] (03CR) 10CI reject: [V:04-1] Scrape temporary deployment of ebpf_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1049254 (https://phabricator.wikimedia.org/T348643) (owner: 10CDanis) [18:28:07] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9919001 (10Dzahn) 05Open→03Resolved a:03Dzahn The user has been created on `an-airflow1002`, the host that has the research airflow role applied. This is r... [18:29:36] (03CR) 10Dzahn: [C:03+2] admin: fix email and realname for user mnz [puppet] - 10https://gerrit.wikimedia.org/r/1049250 (https://phabricator.wikimedia.org/T367757) (owner: 10Dzahn) [18:29:49] (03PS2) 10Dzahn: admin: fix email and realname for user mnz [puppet] - 10https://gerrit.wikimedia.org/r/1049250 (https://phabricator.wikimedia.org/T367757) [18:30:52] (03PS2) 10CDanis: Scrape temporary deployment of ebpf_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1049254 (https://phabricator.wikimedia.org/T348643) [18:31:27] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049254 (https://phabricator.wikimedia.org/T348643) (owner: 10CDanis) [18:31:38] jouncebot: nowandnext [18:31:38] sukhe, mutante: sorry to be the bearer of weird news, but something odd is still happening with swift: [18:31:38] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [18:31:38] In 1 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T2000) [18:31:47] https://usercontent.irccloud-cdn.com/file/szPXJyzb/traffic_sloshing.png [18:31:50] (03PS2) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) [18:32:13] ^ there's quite a bit of traffic sloshing across backends ms-fes in eqiad [18:32:21] starts at ~ the same time as the uptick in errors [18:32:45] is there anything that might be flapping them in / out of pooled state? [18:33:17] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#9919027 (10Dzahn) [18:41:19] !log ladsgroup@deploy1002 ladsgroup: Rotate ChronologyProtector secret synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:43:35] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [18:44:34] (03CR) 10Dzahn: [C:04-1] "needs approval from data-engineering management since it's for their machines" [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [18:44:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:45:10] ha [18:45:43] sukhe: see about comments about traffic sloshing ^ [18:45:43] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9919050 (10CDanis) Unfortunately `cloudcephosd1020` has too old a Debian / kernel for this without some more work. Is there another good machine t... [18:45:46] swfrench-wmf: just saw this now, I was away [18:45:54] no worries :) [18:45:54] (03CR) 10CDanis: [C:03+2] Scrape temporary deployment of ebpf_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1049254 (https://phabricator.wikimedia.org/T348643) (owner: 10CDanis) [18:45:55] yeah, something is up. I guess time to look into it [18:46:01] the restart didn't help clearly [18:46:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "d"} and A:restbase and A:eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [18:46:10] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [18:46:13] for a sustained period that is because there was certainly a drop [18:47:22] mutante: any other tickets related to this that you found? [18:47:41] Jun 24 18:47:21 lvs1019 pybal[2562454]: [swift_80] ERROR: Monitoring instance ProxyFetch reports server ms-fe1010.eqiad.wmnet (enabled/up/pooled) down: Getting http://localhost/monitoring/frontend took longer than 5 seconds. [18:48:07] what I can see around that time is pybal restart and the mw-api deployment [18:48:12] no [18:48:24] so you restarted 1012 right? [18:48:26] but we only did one host [18:48:28] and not this one [18:48:32] right [18:48:46] the log on that ticket looks like it's usually multiple [18:48:48] !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: Rotate ChronologyProtector secret (duration: 11m 33s) [18:49:00] yeah, 1010 as well [18:49:03] checking more [18:49:19] 1013 [18:49:28] matches the ticket [18:50:32] !log ms-fe1010,ms-fe1013 - restart swift-proxy - T360913 [18:50:33] !log sudo cumin -s1 -b60 'ms-fe1010*,ms-fe1013*' 'systemctl restart swift-proxy' [18:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:37] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [18:50:37] oh [18:50:39] you did it? [18:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:45] yes [18:50:48] ok great thanks [18:50:49] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [18:50:55] !log eevans@cumin1002 END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching A:restbase-eqiad: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [18:51:41] 10SRE-swift-storage: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#9919104 (10Dzahn) p:05Medium→03High [18:51:45] changes priority to high [18:52:44] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for 15 hosts [18:52:49] (03CR) 10Bking: [C:03+2] "As the last change is extremely minor and the previous patch set was approved, I'm going to go ahead and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1048467 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [18:52:50] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts [18:52:51] swfrench-wmf: how does the "traffic sloshing" graph look ? [18:53:04] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [18:53:08] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [18:53:25] mutante: https://thanos.wikimedia.org/graph?g0.expr=rate(envoy_http_downstream_rq_total%7Bcluster%3D~%22swift%22%7D%5B2m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=30m&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D (sorry for link gore) [18:53:34] (03CR) 10Dzahn: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1049250 (https://phabricator.wikimedia.org/T367757) (owner: 10Dzahn) [18:54:21] split about if it is looking better or not [18:54:23] we will see [18:54:29] request rates started to converge, but not sure if that trend is going to continue =/ [18:54:34] looks flat though.. unlike the earlier screenshot [18:54:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:55:30] I need a different ring tone for CRIT vs RESOLVED :) [18:55:36] heh [18:55:40] but I dont think my phone can do that, heh [18:55:55] if you zoom out to 3h+ you can see what steady state looks like (~ uniform request rates) [18:56:19] https://grafana.wikimedia.org/goto/BNb3JNQSR?orgId=1 one more view [18:56:27] we just restarted 1010 [18:57:20] I see:) ack [18:57:25] (03PS1) 10Andrew Bogott: hieradata: Move cloudvirt1056 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049255 (https://phabricator.wikimedia.org/T364457) [18:57:32] "Connection establishment" is very clear [18:58:15] healthchecks on lvs also recovered [18:58:26] great [18:58:33] I suspect we might need to do 1011 as well [18:58:44] not fun [18:59:12] !log ms-fe1011 - restarted swift-proxy [18:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:02] icinga eventhandler -> if alert is CRIT then restart the proxy [19:02:25] the degree of pool-state "flappiness" seems to be improving: https://grafana.wikimedia.org/goto/6YTVxHwSg?orgId=1 [19:02:39] !log ms-fe1009: restart swift-proxy: T360913 [19:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:45] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [19:06:20] (03CR) 10Dzahn: [C:03+2] admin: add Daphne Smit to ldap_only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/1048552 (https://phabricator.wikimedia.org/T368140) (owner: 10Dzahn) [19:08:52] !log LDAP - added daphnesmit to group 'wmf' - Phabricator: added dsmit-wmf to WMF-NDA group T368140 [19:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:59] T368140: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140 [19:11:36] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#9919143 (10MoritzMuehlenhoff) p:05Triage→03Medium [19:14:37] it's looking stable [19:14:37] and even looking at swfrench-wmf's link above [19:14:37] ok, stepping out again [19:14:38] yeah, things are settling down nicely [19:14:38] hopefully things are quiet :) [19:16:48] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1056.eqiad.wmnet with OS bookworm [19:16:58] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Move cloudvirt1056 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1049255 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [19:25:32] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:51] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:26:53] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:27:06] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:27:26] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 10.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:32:44] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [19:35:19] PROBLEM - SSH on mwlog1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:35:56] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [19:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 14.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:44:18] RECOVERY - SSH on mwlog1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 13.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:47:20] PROBLEM - SSH on mwlog1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:49:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:51:32] PROBLEM - Host mwlog1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:14] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9919252 (10Dzahn) Hi @DSmit-WMF you have been added to the "wmf" LDAP group. This means now you should be able to login at web-based tools like logstash or icinga. Feel free to try. [19:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 13.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:53:02] RECOVERY - Host mwlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [19:53:12] RECOVERY - SSH on mwlog1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:54:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:58:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 12.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:58:54] (03PS1) 10Scott French: mw-on-k8s: extend envoy_cluster_name to new format [alerts] - 10https://gerrit.wikimedia.org/r/1049260 (https://phabricator.wikimedia.org/T362978) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:00:32] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:16] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1056.eqiad.wmnet with OS bookworm [20:03:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:04:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 6.955s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:04:29] * sukhe whistles [20:04:34] About to get noisy [20:04:36] it's coming [20:05:00] RECOVERY - mailman3_runners on lists1004 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:05:01] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:05:06] PROBLEM - NTP peers on dns1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [20:05:08] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:05:10] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:14] PROBLEM - NTP anycast VIP 10.3.0.6 ntp-b.anycast.wmnet on ntp-b.anycast.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:05:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers wikikube-worker1007.eqiad.wmnet, parse1013.eqiad.wmnet, mw1380.eqiad.wmnet, mw1492.eqiad.wmnet, mw1425.eqiad.wmnet, mw1409.eqiad.wmnet, mw1494.eqiad.wmnet, mw1473.eqiad.wmnet, mw1475.eqiad.wmnet, mw1459.eqiad.wmnet, mw1434.eqiad.wmnet, mw1439.eqiad.wmnet, mw1399.eqiad.wmnet, kubernetes1049.eqiad.wmnet, parse1002.eqiad.wmnet, [20:05:20] es1031.eqiad.wmnet, kubernetes1007.eqiad.wmnet, mw1470.eqiad.wmnet, mw1396.eqiad.wmnet, mw1491.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1480.eqiad.wmnet, mw1351.eqiad.wmnet, mw1482.eqiad.wmnet, mw1449.eqiad.wmnet, mw1495.eqiad.wmnet, mw1421.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1382.eqiad.wmnet, kubernetes1035.eqiad.wmnet, mw1419.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, parse1004.eqiad.wmnet, mw1393.eqiad.wm [20:05:20] 463.eqiad.wmnet, kubernetes1052.eqiad.wmnet, kubernetes1032.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, mw1391.eqiad.wmnet, mw1370.eqiad.wmnet, mw1397.eqiad.wmnet, mw1451.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [20:05:21] wow [20:05:22] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [20:05:22] PROBLEM - NTP peers on dns4003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [20:05:24] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:24] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:24] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [20:05:26] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:32] PROBLEM - NTP anycast VIP on 10.3.0.2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:05:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:05:48] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:48] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:05:50] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:05:58] PROBLEM - Recursive DNS on 208.80.153.48 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:05:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:06:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:06:10] PROBLEM - Recursive DNS on 208.80.154.153 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:06:10] RECOVERY - NTP anycast VIP 10.3.0.6 ntp-b.anycast.wmnet on ntp-b.anycast.wmnet is OK: NTP OK: Offset -0.000176844 secs https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:06:12] PROBLEM - NTP anycast VIP 10.3.0.5 ntp-a.anycast.wmnet on ntp-a.anycast.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:06:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:06:18] RECOVERY - NTP peers on dns4003 is OK: NTP OK: Offset 0.00012398 secs https://wikitech.wikimedia.org/wiki/NTP [20:06:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers parse1013.eqiad.wmnet, mw1380.eqiad.wmnet, mw1492.eqiad.wmnet, mw1434.eqiad.wmnet, mw1462.eqiad.wmnet, mw1415.eqiad.wmnet, mw1480.eqiad.wmnet, parse1009.eqiad.wmnet, mw1399.eqiad.wmnet, mw1463.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1033.eqiad.wmnet, kubernetes10 [20:06:18] .wmnet, kubernetes1059.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1048.eqiad.wmnet, mw1371.eqiad.wmnet, mw1453.eqiad.wmnet, parse1006.eqiad.wmnet, kubernetes1028.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1031.eqiad.wmnet, mw1439.eqiad.wmnet, mw1381.eqiad.wmnet, mw1391.eqiad.wmnet, mw1352.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, kubernetes1037.eqiad.wmnet, kubernetes1035.eqiad.wmnet, mw1379.eqiad.wmnet, kuberne [20:06:18] eqiad.wmnet, mw1409.eqiad.wmnet, kubernetes1036.eqiad.wmnet, mw1383.eqiad.wmnet, mw1354.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, parse1014.eqiad.wmnet, mw1475.eqiad.wmnet, mw1374.e https://wikitech.wikimedia.org/wiki/PyBal [20:06:20] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 107154 bytes in 3.170 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [20:06:20] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 134226 bytes in 1.503 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [20:06:22] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:06:22] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 201, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:06:24] PROBLEM - NTP peers on dns4004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [20:06:26] RECOVERY - NTP anycast VIP on 10.3.0.2 is OK: NTP OK: Offset -0.000176844 secs https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:06:28] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:06:32] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:06:40] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:06:42] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 559, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:06:56] RECOVERY - Recursive DNS on 208.80.153.48 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:07:02] RECOVERY - NTP anycast VIP 10.3.0.5 ntp-a.anycast.wmnet on ntp-a.anycast.wmnet is OK: NTP OK: Offset 0.000410074 secs https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:07:04] RECOVERY - NTP peers on dns1005 is OK: NTP OK: Offset -0.000176844 secs https://wikitech.wikimedia.org/wiki/NTP [20:07:06] RECOVERY - Recursive DNS on 208.80.154.153 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:07:07] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:10] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [20:07:10] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:07:16] RECOVERY - NTP peers on dns4004 is OK: NTP OK: Offset -0.000889177 secs https://wikitech.wikimedia.org/wiki/NTP [20:07:17] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:07:24] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:07:32] PROBLEM - SSH on mwlog1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:07:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:08:03] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [20:08:07] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:08:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52197 bytes in 0.712 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:08:14] PROBLEM - CirrusSearch comp_suggest eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [20:08:22] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [20:08:25] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [20:08:40] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 3.274% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:08:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:09:14] FIRING: [23x] ProbeDown: Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:20] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 3.992s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:09:27] (03PS1) 10Ladsgroup: rdbms: Reduce log severity of "found writes pending" [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049261 (https://phabricator.wikimedia.org/T368289) [20:10:01] FIRING: [23x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:32] FIRING: [27x] ProbeDown: Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:45] RESOLVED: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:10:58] PROBLEM - Host mwlog1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:11:17] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:11:30] I /win 19 [20:11:38] (03CR) 10Ladsgroup: [C:03+2] rdbms: Reduce log severity of "found writes pending" [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049261 (https://phabricator.wikimedia.org/T368289) (owner: 10Ladsgroup) [20:11:41] FIRING: [2x] ProbeDown: Service miscweb1003:30443 has failed probes (http_annual_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:29] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [20:12:33] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:12:39] RESOLVED: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:12:56] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [20:13:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 1.553% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:13:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:14:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [20:14:14] FIRING: [24x] ProbeDown: Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:36] FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [20:15:01] FIRING: [21x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:15:07] !incidents [20:15:08] 4793 (ACKED) [2x] ProbeDown sre (ip4 probes/service eqiad) [20:15:08] 4796 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [20:15:08] 4795 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [20:15:08] 4794 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [20:15:09] 4792 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [20:15:09] 4791 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [20:15:09] 4790 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [20:15:10] 4789 (RESOLVED) [2x] ProbeDown sre (ip4 appservers-https:443 probes/service http_appservers-https_ip4) [20:15:12] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [20:15:12] !ack 4793 [20:15:13] 4793 (ACKED) [2x] ProbeDown sre (ip4 probes/service eqiad) [20:15:32] FIRING: [22x] ProbeDown: Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:16:24] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [20:17:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [20:17:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:18:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.32% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:18:19] !log taavi@snapshot1017 ~ $ sudo systemctl stop commons*.service [20:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:18:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:19:14] RESOLVED: [13x] ProbeDown: Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:15] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.38s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:19:36] RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [20:20:01] RESOLVED: [17x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:20:15] (03CR) 10CI reject: [V:04-1] rdbms: Reduce log severity of "found writes pending" [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049261 (https://phabricator.wikimedia.org/T368289) (owner: 10Ladsgroup) [20:21:11] RESOLVED: [2x] ProbeDown: Service miscweb1003:30443 has failed probes (http_annual_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:12] (03CR) 10Ladsgroup: [C:03+2] "..." [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049261 (https://phabricator.wikimedia.org/T368289) (owner: 10Ladsgroup) [20:21:16] RECOVERY - Host mwlog1002 is UP: PING WARNING - Packet loss = 66%, RTA = 9.38 ms [20:21:34] RECOVERY - SSH on mwlog1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:21:44] !log snapsho1017 - systemctl mask commonsrdf-dump ; systemctl mask commonsjson-dump T368098 [20:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:49] T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [20:22:22] !log ladsgroup@deploy1002 Synchronized php-1.43.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php: (no justification provided) (duration: 11m 04s) [20:22:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [20:22:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:23:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:24:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [20:24:15] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.38s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:24:18] RECOVERY - CirrusSearch comp_suggest eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [20:29:52] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9919306 (10bd808) Some info on the sanitization (dropping columns, tables) and redaction (content hidden from end users via views) of the replicated... [20:34:40] (03Merged) 10jenkins-bot: rdbms: Reduce log severity of "found writes pending" [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049261 (https://phabricator.wikimedia.org/T368289) (owner: 10Ladsgroup) [20:36:39] !log bking@alert1001 install `ripgrep` deb pkg T368107 [20:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:44] T368107: DPE SRE: Increase visibility of Search Platform alerts - https://phabricator.wikimedia.org/T368107 [20:36:45] (03PS1) 10Ayounsi: Netbox 4: enable validators in test instance [puppet] - 10https://gerrit.wikimedia.org/r/1049263 (https://phabricator.wikimedia.org/T336275) [20:36:47] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9919314 (10bd808) What sort of data y'all are concerned about exposing to new roots on the replica db hosting nodes themselves? These boxes already e... [20:37:16] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049263 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [20:38:06] inflatador: why are we installing packages on (rather important) production hosts by hand? [20:38:33] taavi ripgrep is a fast, brute-force tool to sift thru configs [20:38:57] yes? [20:39:33] if it's useful on those hosts (or anywhere else), it should be installed via puppet [20:39:35] (03PS1) 10Ayounsi: Netbox 4: fix breaking change around connected_endpoints [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049264 (https://phabricator.wikimedia.org/T336275) [20:41:00] taavi I'm just trying to troubleshoot alert routing without having to reason thru spaghetti puppet code. I do think ripgrep should be installed everywhere, but we're talking about a utility that's available thru Debian [20:42:00] I'm not trying to hide what I'm doing and if anyone has any objections, they're free to reach out [20:42:45] (03CR) 10Ayounsi: [C:03+2] "Description on https://github.com/netbox-community/netbox/releases/tag/v3.3.0" [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049264 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [20:43:39] is debmonitor going to track that? [20:44:08] pretty sure it does, but will verify when I get back in ~20 [20:44:58] debmonitor yes, since it talks to apt/dpkg for the list of installed things [20:46:31] but it (and any of it's dependencyes) is going to be lost via any reimages, going to be inconsistent in inconsistent states between alert1001/2001, etc. just install it via puppet if you need it, it's not hard and is going to significantly reduce confusion down in the future [20:46:59] (03Merged) 10jenkins-bot: Netbox 4: fix breaking change around connected_endpoints [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1049264 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [20:47:07] there's a reason https://wikitech.wikimedia.org/wiki/Puppet_coding says "all package installs and configurations should happen via Puppet", and I don't see any reason why that's not the case here too [20:52:57] I think it these concerns are a bit overblown for a single, self-contained package that doesn't run a service, but I'll remove the package once I'm done with it [20:53:34] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-redacteddb1001.eqiad.wmnet [20:54:35] I agree with taavi, I think we should still have reviews before installing new packages, given that it's also the main monitoring server and not a random test host or just one of a cluster. [20:54:39] and in the future, I will start a discussion about installing it fleet-wide. Thank you for the feedback taavi and mutante [20:54:53] we can of course argue that the policy should be revised though [20:55:19] I think inflatador logging it helps but yeah, a shared common host such as alert1001 (and not a host that is team-specific and not critical) should go via Puppet [20:55:53] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9919385 (10ssingh) This happened today again, starting at 20:05 and re... [20:56:32] I'm not arguing against the policy, it's just the effort of "start making a case to install a package fleet wide" vs "install a package that will allow me to solve a specific problem much more quickly" [20:58:42] Anyway, feel free to share this feedback with my team lead or anyone else if you like. I just want everyone to understand that I'm not running around randomly installing packages, but installing a specific package on a specific host with a link to the task I'm working [21:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240624T2100) [21:00:13] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [21:00:31] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [21:00:32] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:03:46] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-redacteddb1001.eqiad.wmnet [21:08:19] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9919438 (10Dzahn) This took basically everything down. Wikis returned... [21:09:47] 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9919440 (10Dzahn) [21:10:42] 06SRE, 10DNS, 10fundraising-tech-ops, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9919442 (10Dzahn) [21:14:19] 06SRE, 10DNS, 10fundraising-tech-ops, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9919450 (10Dzahn) Hello, fundraising-tech-ops, is https://benefactors.wikimedia.org/ still being used somehow for email campaigns? (mandrillapp.com ?) Would you agree with the sugges... [21:16:08] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9919456 (10Dzahn) 05In progress→03Resolved a:03Dzahn Also added Daphne to the [[ https://phabricator.wikimedia.org/project/members/61/ | wmf-nda ]] group for access to private tickets here in... [21:17:32] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9919459 (10Dzahn) >>! In T368139#9916418, @SLyngshede-WMF wrote: > I did find the username in data.yaml, see https://gerrit.wikimedia.org/r/... [21:17:59] (03PS1) 10Ahmon Dancy: buildkitd: Bump to version wmf-v0.14.1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) [21:18:11] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [21:18:19] (03CR) 10CI reject: [V:04-1] buildkitd: Bump to version wmf-v0.14.1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [21:18:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9919461 (10BCornwall) a:05Papaul→03Jhancock.wm [21:19:26] (03PS2) 10Ahmon Dancy: buildkitd: Bump to version wmf-v0.14.1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) [21:19:44] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9919463 (10Dzahn) a:03Mcastro [21:19:47] (03CR) 10CI reject: [V:04-1] buildkitd: Bump to version wmf-v0.14.1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [21:20:39] (03PS3) 10Ahmon Dancy: buildkitd: Bump to version wmf-v0.14.1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) [21:21:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9919468 (10Dzahn) [21:22:03] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [21:24:38] (03PS4) 10Ahmon Dancy: buildkitd: Bump to version wmf-v0.14.1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) [21:24:43] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [21:31:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9919486 (10Dzahn) Hello, does this have approval from one of the owners of analytics-privatedata? ` approval: - Olja Dimitrjevic -... [21:31:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9919473 (10Dzahn) - Confirmed approving party via orgchart in Betterworks. - Tagging Data-Engineering for group approval [21:34:55] !log bking@alerts1001 uninstall deb pkg `ripgrep` T368107 [21:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:00] T368107: DPE SRE: Increase visibility of Search Platform alerts - https://phabricator.wikimedia.org/T368107 [21:35:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9919504 (10Dzahn) Hello @Dreamy_Jazz are you requesting membership with or without Kerberos? Here is more on the difference: https://wikitech.wiki... [21:36:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9919505 (10Dzahn) 05Open→03In progress [21:36:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9919506 (10Dzahn) p:05Triage→03High [21:39:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9919508 (10Dzahn) Hello @cwylo are you requesting membership with or without Kerberos? Here is more on the difference: https://wik... [21:40:34] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9919509 (10Dzahn) 05Open→03In progress p:05Triage→03High [21:40:34] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9919513 (10Dzahn) [21:42:44] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9919514 (10Dzahn) a:03leila Hello Leila, does this have your approval? It's just about the web-based logins, right? [21:43:02] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9919526 (10Dzahn) a:03AndyRussG [21:43:07] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9919527 (10Dzahn) 05Open→03In progress [21:43:29] (03CR) 10Dzahn: [C:04-1] "We are waiting for email address, I will amend and merge once we get that." [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [21:45:26] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9919533 (10Dzahn) [21:45:47] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9919530 (10Dzahn) @thcipriani A request for addition to the deployment group for your consideration. [21:46:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9919535 (10Dzahn) confirmed approval chain in betterworks [21:46:19] (03CR) 10Dzahn: [C:03+1] "this has the required approvals now" [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) (owner: 10Cwhite) [21:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:49:45] (03PS2) 10Dzahn: admin: add tchanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) (owner: 10Cwhite) [21:50:29] (03CR) 10CI reject: [V:04-1] admin: add tchanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) (owner: 10Cwhite) [21:50:35] (03PS3) 10Cwhite: admin: add tchanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) [21:51:35] (03PS4) 10Dzahn: admin: add tchanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) (owner: 10Cwhite) [21:52:19] (03CR) 10CI reject: [V:04-1] admin: add tchanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) (owner: 10Cwhite) [21:53:01] (03PS5) 10Dzahn: admin: add tchanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) (owner: 10Cwhite) [21:54:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9919544 (10Dzahn) 05Stalled→03In progress [21:54:13] (03CR) 10Dzahn: [C:03+2] admin: add tchanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) (owner: 10Cwhite) [21:56:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9919555 (10Dzahn) You have been added to the group as requested. The assumption was made that this is for "ssh login to analytics client servers (AKA... [21:57:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9919569 (10Dzahn) [21:57:16] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9919570 (10Dzahn) 05In progress→03Resolved [21:58:35] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9919571 (10Dzahn) a:05kamila→03Milimetric [22:02:50] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919586 (10RobH) [22:03:19] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9919587 (10Papaul) We replace PEM0 today with the new one that was send to us from Juniper same error. When we move PEM0 to the PDU where PEM1 is plugged in the error clear. So I open ticker... [22:03:49] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919588 (10RobH) [22:11:38] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash2001.codfw.wmnet [22:18:31] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [22:19:20] (03PS1) 10Cwhite: logstash: fix curator typo [puppet] - 10https://gerrit.wikimedia.org/r/1049272 (https://phabricator.wikimedia.org/T368180) [22:20:19] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9919609 (10thcipriani) Reason for access looks good to me: approved. [22:24:42] (03CR) 10Cwhite: [C:03+2] logstash: fix curator typo [puppet] - 10https://gerrit.wikimedia.org/r/1049272 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [22:24:47] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [22:26:13] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [22:26:13] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:26:14] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash2001.codfw.wmnet [22:26:58] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash2002.codfw.wmnet [22:38:31] (03PS33) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [22:38:36] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [22:41:17] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [22:46:15] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [22:46:15] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:46:16] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash2002.codfw.wmnet [22:46:51] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash2003.codfw.wmnet [22:48:51] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9919665 (10Scott_French) I just spot-checked some of the `runPrimaryTr... [22:53:17] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [22:54:14] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:56:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [22:57:16] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [23:00:04] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [23:00:04] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:00:05] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash2003.codfw.wmnet [23:00:45] (03CR) 10BCornwall: [C:03+2] depool eqsin for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1049232 (https://phabricator.wikimedia.org/T365763) (owner: 10BCornwall) [23:01:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:02:05] !log Running authdns-update on dns1004 to depool eqsin - T365763 [23:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:16] T365763: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763 [23:02:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:02:39] oh boy [23:02:50] and we haven't even started yet ha [23:03:11] turn the radio up louder [23:03:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:03:32] fun fun [23:03:36] ugh [23:03:39] this is not related to us anyway bu tyeah [23:04:59] let's see how it plays along [23:07:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:07:32] :) [23:07:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:07:41] doubt this is the end of it [23:10:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet, mw1446.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:10:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet, mw1446.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:10:39] hmmm [23:13:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:14:01] no data on the RED dash? [23:14:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:14:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:17:19] brett: yeah, it should be in this https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 [23:17:25] it's flapping currently [23:17:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1437.eqiad.wmnet, mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:17:34] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1437.eqiad.wmnet, mw1438.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:18:01] I had meant there was no data returned in the graphs. But it seems back now after waiting a bit... [23:18:14] oh really. hmm [23:18:35] I wonder if it's a symptom of some other storm :) [23:21:36] (03PS1) 10Cwhite: logstash: clean up remnants of logstash200[123] [puppet] - 10https://gerrit.wikimedia.org/r/1049274 (https://phabricator.wikimedia.org/T368327) [23:29:14] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:33:23] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919721 (10BCornwall) [23:33:52] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919722 (10BCornwall) [23:38:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049275 [23:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049275 (owner: 10TrainBranchBot) [23:38:32] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:38:36] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:41:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet, mw1446.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:41:36] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1445.eqiad.wmnet, mw1446.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:45:04] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [23:48:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:48:45] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [23:50:32] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed