[00:00:19] (03CR) 10CI reject: [V: 04-1] OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [00:01:18] (03PS2) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) [00:01:42] (03CR) 10CI reject: [V: 04-1] OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [00:03:40] (03PS3) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) [00:05:44] (03CR) 10CI reject: [V: 04-1] OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [00:13:34] (03PS4) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) [00:15:40] (03CR) 10CI reject: [V: 04-1] OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [00:17:09] (03PS5) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) [00:28:05] (03PS1) 10Andrew Bogott: renamed the misnamed control.pp hiera file, added a fake password [labs/private] - 10https://gerrit.wikimedia.org/r/905330 [00:28:26] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] renamed the misnamed control.pp hiera file, added a fake password [labs/private] - 10https://gerrit.wikimedia.org/r/905330 (owner: 10Andrew Bogott) [00:31:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:47] (03PS6) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) [00:37:01] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:37:31] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:38:29] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:39:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904917 [00:39:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904917 (owner: 10TrainBranchBot) [00:39:40] (03PS7) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) [00:40:09] (03CR) 10CI reject: [V: 04-1] OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [00:42:00] (03PS1) 10Andrew Bogott: Add fake profile::wmcs::services::maintain_dbusers::tools_replica_cnf_htpassword [labs/private] - 10https://gerrit.wikimedia.org/r/905331 [00:44:19] (03PS8) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) [00:44:44] (03CR) 10CI reject: [V: 04-1] OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [00:45:52] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add fake profile::wmcs::services::maintain_dbusers::tools_replica_cnf_htpassword [labs/private] - 10https://gerrit.wikimedia.org/r/905331 (owner: 10Andrew Bogott) [00:47:26] (03PS9) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) [00:51:50] (03PS1) 10Andrew Bogott: Add a missing : [labs/private] - 10https://gerrit.wikimedia.org/r/905332 [00:52:07] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add a missing : [labs/private] - 10https://gerrit.wikimedia.org/r/905332 (owner: 10Andrew Bogott) [00:53:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [00:54:47] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/905327/40514/" [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [00:56:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904917 (owner: 10TrainBranchBot) [01:07:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10phaultfinder) [01:30:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T0200) [02:07:30] (JobUnavailable) firing: (2) Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.3 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/904919 (https://phabricator.wikimedia.org/T330209) [02:08:03] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.3 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/904919 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [02:21:34] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.3 [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/904919 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [02:25:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2163:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2163 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:27:30] (JobUnavailable) firing: (2) Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T0300) [03:11:06] (03PS20) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [03:12:13] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [03:18:30] (Storage /var over 50%) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [03:27:20] (03PS21) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [03:40:38] (03PS22) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [04:05:55] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:06:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:29:19] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:29:44] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [04:29:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:32:28] (03PS15) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [04:34:19] (ProbeDown) resolved: (2) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:34:44] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [04:34:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:35:29] (03CR) 10Slyngshede: C:httpd move htcacheclean to httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [04:39:46] (03PS23) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [04:40:52] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [05:08:30] (03PS6) 10Hashar: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [05:30:24] (03CR) 10Hashar: gerrit: replace Icinga with Prometheus monitoring (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [05:34:20] (03CR) 10Hashar: [C: 03+1] "I feel it is odd to have to create a specific receiver, then I guess Alertmanager does not have a way to "merge" different receivers ;)" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [05:37:06] (03CR) 10Hashar: [C: 03+1] "... continuing my comment" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [05:43:18] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [05:43:31] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10MoritzMuehlenhoff) >> Then on second login a systemd user service is started for the user, which automatically renews they kerberos ticket up to its maximum renewable lifetime: >>... [05:53:05] (03CR) 10Ayounsi: Varnish: prefix 403 and 429 with a unique ID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973) (owner: 10Ayounsi) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T0600) [06:00:05] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T0600). [06:04:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 127 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:09:48] !log stage new Junos on asw2-c-eqiad - T331882 [06:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:52] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [06:15:35] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:31] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:18:20] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/905467 [06:25:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2163:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2163 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:27:30] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:29:18] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/905467 (owner: 10Muehlenhoff) [06:35:27] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:39:57] (03PS1) 10Muehlenhoff: Failover urldownloader in eqiad for row C switch maintenance [dns] - 10https://gerrit.wikimedia.org/r/905469 (https://phabricator.wikimedia.org/T331882) [06:59:13] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10Ladsgroup) I'll try to take a look at the grants (it's a bit unusual for me given that's behind haproxy) but while we are at it, can you increase number of CPUs? two... [07:00:04] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:15] cool [07:00:22] o/ [07:00:27] that was easy :] [07:06:38] (03CR) 10Muehlenhoff: [C: 03+2] Failover urldownloader in eqiad for row C switch maintenance [dns] - 10https://gerrit.wikimedia.org/r/905469 (https://phabricator.wikimedia.org/T331882) (owner: 10Muehlenhoff) [07:09:09] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MoritzMuehlenhoff) [07:10:40] I am going to restart Gerrit for some plugins updates and switching from `git-fat` to `git-lfs` [07:11:08] (03CR) 10Hashar: [C: 03+2] Extract and deploy upstream plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar) [07:11:21] (03CR) 10Hashar: [C: 03+2] wm-zuul-status: fix items having no build [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/902718 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [07:12:07] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:18] (03Merged) 10jenkins-bot: wm-zuul-status: fix items having no build [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/902718 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [07:12:53] (03CR) 10Hashar: [C: 03+2] Migrate from git fat to git lfs [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [07:13:01] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:13:39] (03Merged) 10jenkins-bot: Migrate from git fat to git lfs [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [07:13:42] (03Merged) 10jenkins-bot: Extract and deploy upstream plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar) [07:23:30] (Storage /var over 50%) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [07:23:31] !log hashar@deploy2002 Started deploy [gerrit/gerrit@453b038]: Gerrit plugin update and switching from git-fat to git-lfs [07:23:36] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@453b038]: Gerrit plugin update and switching from git-fat to git-lfs (duration: 00m 05s) [07:24:23] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Ladsgroup) MW section masters: - db1100: s5 - db1131: s6 - db1181: s7 Need to downtime the whole sections for these. I'll do it a bit later. [07:24:48] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/904925 (https://phabricator.wikimedia.org/T333918) [07:27:09] !log hashar@deploy2002 Started deploy [gerrit/gerrit@453b038]: Gerrit plugin update and switching from git-fat to git-lfs [07:27:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T333918 [07:27:15] T333918: Switchover s2 master (db1162 -> db1122) - https://phabricator.wikimedia.org/T333918 [07:27:17] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@453b038]: Gerrit plugin update and switching from git-fat to git-lfs (duration: 00m 08s) [07:27:29] I don't have to restart Gerrit anymore ;) [07:27:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T333918 [07:28:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1122 with weight 0 T333918', diff saved to https://phabricator.wikimedia.org/P46013 and previous config saved to /var/cache/conftool/dbconfig/20230404-072817-ladsgroup.json [07:28:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2002.codfw.wmnet [07:31:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2002.codfw.wmnet [07:31:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [07:35:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [07:35:54] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs [07:36:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2001.codfw.wmnet [07:36:04] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs [07:38:47] (03CR) 10Slyngshede: [C: 03+2] P:installserver::proxy fix typo in log message. [puppet] - 10https://gerrit.wikimedia.org/r/904784 (owner: 10Slyngshede) [07:40:35] (03CR) 10Elukey: "The chart looks good! I need to check the resources produced in CI to see if they fit a deployment in ml-staging-codfw and ml-serve, since" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:40:37] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40515/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [07:41:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2001.codfw.wmnet [07:41:33] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:11] (03CR) 10Jelto: [V: 03+1] "looks mostly good, but I'd like to check-in with RelEng first and create a small announcement before merging" [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [07:42:27] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:56] 10SRE, 10Traffic, 10serviceops-collab, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Jelto) [07:43:34] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Jelto) [07:44:04] (03CR) 10Filippo Giunchedi: "See task too, I don't think this is wanted" [puppet] - 10https://gerrit.wikimedia.org/r/905244 (https://phabricator.wikimedia.org/T333838) (owner: 10Herron) [07:45:43] (03PS2) 10Ladsgroup: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/904925 (https://phabricator.wikimedia.org/T333918) (owner: 10Gerrit maintenance bot) [07:45:48] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/904925 (https://phabricator.wikimedia.org/T333918) (owner: 10Gerrit maintenance bot) [07:46:32] !log Starting s2 eqiad failover from db1162 to db1122 - T333918 [07:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:36] T333918: Switchover s2 master (db1162 -> db1122) - https://phabricator.wikimedia.org/T333918 [07:46:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1122 to s2 primary T333918', diff saved to https://phabricator.wikimedia.org/P46014 and previous config saved to /var/cache/conftool/dbconfig/20230404-074656-ladsgroup.json [07:48:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1162 T333918', diff saved to https://phabricator.wikimedia.org/P46015 and previous config saved to /var/cache/conftool/dbconfig/20230404-074848-ladsgroup.json [07:49:41] (03PS1) 10Filippo Giunchedi: aptrepo: update jenkins key [puppet] - 10https://gerrit.wikimedia.org/r/905542 [07:50:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:50:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:50:30] (03CR) 10Filippo Giunchedi: "reprepro update is currently broken" [puppet] - 10https://gerrit.wikimedia.org/r/905542 (owner: 10Filippo Giunchedi) [07:51:24] (03CR) 10Filippo Giunchedi: "Key comes from https://pkg.jenkins.io/debian/jenkins.io-2023.key as per https://www.jenkins.io/blog/2023/03/27/repository-signing-keys-cha" [puppet] - 10https://gerrit.wikimedia.org/r/905542 (owner: 10Filippo Giunchedi) [07:57:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [08:00:05] hashar and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T0800). nyaa~ [08:00:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs [08:00:29] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs [08:01:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/905542 (owner: 10Filippo Giunchedi) [08:01:07] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams [08:01:12] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams [08:01:19] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Jelto) @brennen @hashar there is a [open change](https://gerrit.wikimedia.org/r/c/operations/puppet/... [08:03:22] (03Abandoned) 10David Caro: cloud.yaml: pass a yaml formatter to it [puppet] - 10https://gerrit.wikimedia.org/r/905159 (owner: 10David Caro) [08:03:34] (03PS4) 10David Caro: ceph: Allow setting a crush location hook for the rack [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) [08:03:42] (03PS4) 10David Caro: p:cloudceph::osd: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) [08:03:53] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: update jenkins key [puppet] - 10https://gerrit.wikimedia.org/r/905542 (owner: 10Filippo Giunchedi) [08:04:27] (03CR) 10David Caro: [C: 03+2] maintain-dbusers: fix systemd service description [puppet] - 10https://gerrit.wikimedia.org/r/895814 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [08:05:32] (03CR) 10David Caro: [C: 03+2] maintain-dbusers: run isort and black and use pep563 types [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [08:06:48] (03CR) 10David Caro: [C: 03+2] maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [08:06:57] (03PS12) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [08:09:11] (03PS1) 10Muehlenhoff: Also add component/pybal for pybaltest hosts [puppet] - 10https://gerrit.wikimedia.org/r/905543 [08:10:35] (03PS1) 10Urbanecm: ckbwiktionary: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905544 (https://phabricator.wikimedia.org/T331831) [08:10:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P46016 and previous config saved to /var/cache/conftool/dbconfig/20230404-081039-ladsgroup.json [08:11:51] i am there for the train [08:12:24] so is jouncebot ! :) [08:12:34] great, I am not alone! [08:12:42] (03PS7) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) [08:13:19] there are no risky patches on T330209 [08:13:19] T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209 [08:13:40] [mediawiki/core@wmf/1.41.0-wmf.3] Branch commit for wmf/1.41.0-wmf.3 got merged [08:14:34] hashar: "there are no risky patches on " <- do you want me to fix that? :P [08:14:47] sure! [08:15:03] (03CR) 10David Caro: [C: 03+2] maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) (owner: 10David Caro) [08:15:13] :D [08:15:19] nah, I don't have anything [08:16:58] so the train presync step failed for some reason [08:17:57] (03CR) 10Ladsgroup: [C: 03+2] Add add_af_actor_T333332.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/904786 (https://phabricator.wikimedia.org/T333332) (owner: 10Ladsgroup) [08:18:24] (03Merged) 10jenkins-bot: Add add_af_actor_T333332.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/904786 (https://phabricator.wikimedia.org/T333332) (owner: 10Ladsgroup) [08:22:08] !log upgrade grafana* to grafana 9.3.11 - T333915 [08:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:54] (03PS4) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [08:23:17] (03CR) 10CI reject: [V: 04-1] mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [08:25:19] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams [08:25:41] hashar: that `branch_cut_pretest` is interfering: [08:25:41] jnuche@erebor:~ $ git ls-remote --sort=version:refname https://gerrit.wikimedia.org/r/mediawiki/core refs/heads/wmf/* [08:25:41] f231c3b2bfef44b6f1bf49949c68b84f3c629102 refs/heads/wmf/1.40.0-wmf.24 [08:25:41] [...] [08:25:41] 1b1a0ab85027e440b2b2ba52afcad05f5773f837 refs/heads/wmf/1.41.0-wmf.2 [08:25:42] 16e55ac99a08f8460615610404a685bf0c76a8cb refs/heads/wmf/1.41.0-wmf.3 [08:25:42] 430d25d1a1858edfa4a6199dfe1f0eb3743a219a refs/heads/wmf/branch_cut_pretest [08:25:43] (03PS5) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [08:25:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P46017 and previous config saved to /var/cache/conftool/dbconfig/20230404-082543-ladsgroup.json [08:26:42] jnuche: yup. I am going to delete that branch, run the script again and recreate the branch after with the same commit [08:26:51] or is the script creating it? [08:27:02] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams [08:27:17] no, it was created on friday or saturday, not sure why, I asked in -releng at the time and got no answer [08:28:20] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/output/904514/40517/backup1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [08:28:35] !log Deleting mediawiki/core branch `wmf/branch_cut_pretest` pointing at `430d25d1a1858edfa4a6199dfe1f0eb3743a219a` # T330209 [08:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:39] T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209 [08:28:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [08:29:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [08:29:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T333332)', diff saved to https://phabricator.wikimedia.org/P46019 and previous config saved to /var/cache/conftool/dbconfig/20230404-082911-ladsgroup.json [08:29:15] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [08:29:51] trying /usr/bin/scap stage-train -Dfull_image_build:True --yes auto [08:29:51] (03PS1) 10David Caro: maintaint_dbusers: fix usage of host+port [puppet] - 10https://gerrit.wikimedia.org/r/905567 [08:30:59] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905568 (https://phabricator.wikimedia.org/T330209) [08:31:01] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905568 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [08:31:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T333332)', diff saved to https://phabricator.wikimedia.org/P46020 and previous config saved to /var/cache/conftool/dbconfig/20230404-083120-ladsgroup.json [08:31:44] (03CR) 10CI reject: [V: 04-1] testwikis wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905568 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [08:31:48] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905568 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [08:32:12] !log hashar@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.3 refs T330209 [08:32:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [08:35:17] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad [08:35:23] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad [08:35:35] stage-train is running [08:36:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:38:19] (03PS6) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [08:39:49] (03PS2) 10David Caro: maintaint_dbusers: generalize usage of host+port [puppet] - 10https://gerrit.wikimedia.org/r/905567 [08:40:36] (03CR) 10Jcrespo: mediabackups: Add static console port for easier remote management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [08:40:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P46021 and previous config saved to /var/cache/conftool/dbconfig/20230404-084048-ladsgroup.json [08:41:18] (03PS7) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [08:41:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:42:59] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) Hi, I think slot 19 must be the missing-from-the-OS drive (`/dev/sdx`) - typically drives marked "foreign" are not visible to the host OS. I'm afraid that the RAID... [08:43:09] (03PS8) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [08:43:51] (03PS2) 10Alexandros Kosiaris: prometheus: Add proper ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/905260 (owner: 10Elukey) [08:45:19] (03CR) 10Muehlenhoff: "Seems fine for production (one comment inline), but not sure if anything in WMCS uses/needs DHCP?" [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [08:45:52] (03CR) 10Majavah: [C: 03+1] maintaint_dbusers: generalize usage of host+port [puppet] - 10https://gerrit.wikimedia.org/r/905567 (owner: 10David Caro) [08:45:54] (03CR) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [08:46:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P46022 and previous config saved to /var/cache/conftool/dbconfig/20230404-084627-ladsgroup.json [08:46:28] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:46:33] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:47:34] (03CR) 10David Caro: [C: 03+2] maintaint_dbusers: generalize usage of host+port [puppet] - 10https://gerrit.wikimedia.org/r/905567 (owner: 10David Caro) [08:51:03] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [08:51:48] (03CR) 10Elukey: [C: 03+1] prometheus: Add proper ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/905260 (owner: 10Elukey) [08:53:15] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad [08:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:54:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] prometheus: Add proper ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/905260 (owner: 10Elukey) [08:55:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad [08:55:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P46023 and previous config saved to /var/cache/conftool/dbconfig/20230404-085553-ladsgroup.json [08:57:30] (03CR) 10Santhosh: [C: 04-1] WIP: Add new self hosted machinetranslation service (MinT) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [08:58:05] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10hashar) Gitlab being fairly recent, I don't think it ever got advertised with `http` rather than `ht... [08:58:15] (03PS1) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [08:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:59:12] (03Merged) 10jenkins-bot: prometheus: Add proper ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/905260 (owner: 10Elukey) [08:59:39] (03CR) 10Hashar: [C: 03+1] "AFAIK our GitLab as always been exposed with https and I don't think there are any use case for http." [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [09:00:35] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [09:01:31] (03CR) 10Jbond: [C: 03+1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [09:01:33] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [09:01:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P46024 and previous config saved to /var/cache/conftool/dbconfig/20230404-090133-ladsgroup.json [09:01:35] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [09:02:19] !log akosiaris@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [09:02:23] !log akosiaris@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [09:02:33] !log akosiaris@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:02:36] !log akosiaris@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:02:54] !log akosiaris@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:02:59] !log akosiaris@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:03:05] !log akosiaris@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:03:08] !log akosiaris@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:03:22] (03PS2) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [09:03:38] !log akosiaris@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:03:41] !log akosiaris@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:04:01] !log akosiaris@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:04:05] !log akosiaris@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:04:30] !log akosiaris@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [09:04:34] !log akosiaris@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:04:34] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [09:05:02] (03CR) 10Majavah: [C: 04-1] "Cloud VPS instances do use DHCP to get IPs assigned to their interfaces, so this needs some adjustments to ensure the packages get purged " [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [09:07:02] !log installing libdatetime-timezone-perl updates [09:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:56] (03CR) 10Ayounsi: "Quick looks shows some false positives, but still plenty to fix." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [09:09:08] !log installing libmicrohttpd security updates [09:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:33] !log hashar@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.3 refs T330209 (duration: 40m 20s) [09:14:47] doing the clean step now [09:14:52] !log hashar@deploy2002 Pruned MediaWiki: 1.41.0-wmf.1 (duration: 02m 16s) [09:15:03] (03PS1) 10Filippo Giunchedi: sre: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905571 (https://phabricator.wikimedia.org/T309182) [09:16:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T333332)', diff saved to https://phabricator.wikimedia.org/P46025 and previous config saved to /var/cache/conftool/dbconfig/20230404-091639-ladsgroup.json [09:16:57] requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://train-blockers.toolforge.org/api.php [09:17:03] this is never ending [09:17:04] :) [09:17:36] looks like the train-blockers toolforge tool is not responding anymore :-( [09:17:54] well stashbot has vanished as well... [09:17:59] looks like toolforge might be dead? [09:18:26] (03CR) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [09:18:42] I can still SSH to toolforge at least [09:19:19] train-blockers has “Temporary failure in name resolution” in its error log [09:19:49] I am filing an UBN [09:20:11] (03PS1) 10David Caro: maintain_dbusers: update config with host+port [puppet] - 10https://gerrit.wikimedia.org/r/905578 [09:21:28] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905571 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:21:33] let me fix that [09:21:46] (03PS24) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [09:21:56] (03CR) 10Muehlenhoff: "I'm going to revert two of the patch hunks: lists and mw_rc_irc are not owned by SRE IF; while Jesse and myself have stepped up to upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi) [09:22:05] `webservice status` also dies with a read timeout from k8s.tools.eqiad1.wikimedia.cloud [09:22:14] I’ll leave it to t.aavi then [09:22:18] (03CR) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [09:22:20] (03PS1) 10KartikMistry: WIP: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [09:22:41] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [09:23:09] (03PS1) 10Filippo Giunchedi: sre: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905580 (https://phabricator.wikimedia.org/T309182) [09:23:11] (03PS1) 10Filippo Giunchedi: data-engineering: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905581 (https://phabricator.wikimedia.org/T309182) [09:24:21] filed as https://phabricator.wikimedia.org/T333922 [09:25:00] :) [09:25:18] stashbot is back, maybe it went pass whatever timeout is occuring [09:25:20] (03CR) 10David Caro: [C: 03+2] maintain_dbusers: update config with host+port [puppet] - 10https://gerrit.wikimedia.org/r/905578 (owner: 10David Caro) [09:25:23] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [09:26:09] (03PS1) 10Filippo Giunchedi: data-persistence: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905582 (https://phabricator.wikimedia.org/T309182) [09:28:14] (03PS1) 10Muehlenhoff: Remove ownership annotations for two roles [puppet] - 10https://gerrit.wikimedia.org/r/905584 [09:28:58] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [09:29:30] hashar: toolforge k8s hiccup [09:29:39] (03CR) 10Ayounsi: [C: 03+2] [k8s ml/dse/wiki] Add policy to export prefixes to nodes [homer/public] - 10https://gerrit.wikimedia.org/r/905171 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:29:53] (03PS1) 10David Caro: maintain-dbusers: fix bad key [puppet] - 10https://gerrit.wikimedia.org/r/905585 [09:30:33] (03Merged) 10jenkins-bot: [k8s ml/dse/wiki] Add policy to export prefixes to nodes [homer/public] - 10https://gerrit.wikimedia.org/r/905171 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:31:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove ownership annotations for two roles [puppet] - 10https://gerrit.wikimedia.org/r/905584 (owner: 10Muehlenhoff) [09:34:29] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905586 (https://phabricator.wikimedia.org/T330209) [09:34:31] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905586 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [09:35:21] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905586 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [09:35:26] (03PS6) 10Lucas Werkmeister (WMDE): Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) [09:35:28] (03PS5) 10Lucas Werkmeister (WMDE): DNM: Stop using $wmgExtraLanguageNames in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628774 (https://phabricator.wikimedia.org/T263441) [09:35:30] (03PS5) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628775 (https://phabricator.wikimedia.org/T263441) [09:35:32] (03PS4) 10Lucas Werkmeister (WMDE): Remove $wmgExtraLanguageNames from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628776 (https://phabricator.wikimedia.org/T263441) [09:36:17] (03CR) 10CI reject: [V: 04-1] Remove unused $wgExtraLanguageNames['qqq'] assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [09:39:12] (03PS1) 10Filippo Giunchedi: wmcs: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905587 (https://phabricator.wikimedia.org/T309182) [09:39:17] (03CR) 10Lucas Werkmeister (WMDE): "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628773 (https://phabricator.wikimedia.org/T263441) (owner: 10Lucas Werkmeister (WMDE)) [09:39:24] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905581 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:39:26] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905580 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:39:28] (03CR) 10Filippo Giunchedi: [C: 03+2] data-persistence: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905582 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:40:52] (03PS3) 10Jbond: base::standard_packages: remove isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) [09:41:24] taavi: arturo: may you comment on https://phabricator.wikimedia.org/T333922 about the toolforge / k8s issue and eventually mark it resolved? ;) [09:41:49] or maybe it is still being investigated. At least scap got a reply so the task is no more a blocker to the mediawiki train [09:42:01] (03Merged) 10jenkins-bot: sre: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905580 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:42:03] (03Merged) 10jenkins-bot: data-engineering: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905581 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:42:05] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.3 refs T330209 [09:42:05] (03Merged) 10jenkins-bot: data-persistence: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905582 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:42:09] T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209 [09:42:19] hashar: that specific issue is resolved, but we're still looking into the root cause and more permanent fixes [09:42:22] 10SRE, 10Wikimedia-Planet, 10Patch-For-Review: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10MoritzMuehlenhoff) >>! In T281219#8726273, @Legoktm wrote: > I'd also be willing to write a new planet that addresses stuff like T207244 if I'm allowed... [09:42:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40518/console" [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [09:42:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/900633 (owner: 10Slyngshede) [09:43:04] (03PS1) 10Majavah: Revert "hieradata: swap eqiad1 dns server order" [puppet] - 10https://gerrit.wikimedia.org/r/905493 [09:43:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:44:35] (03PS4) 10Jbond: base::standard_packages: remove isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) [09:44:43] (03CR) 10Filippo Giunchedi: [C: 03+2] wmcs: add missing deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/905587 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:45:21] (03CR) 10Slyngshede: [V: 03+2] Remove cleanup command [software/bitu] - 10https://gerrit.wikimedia.org/r/900633 (owner: 10Slyngshede) [09:45:23] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Remove cleanup command [software/bitu] - 10https://gerrit.wikimedia.org/r/900633 (owner: 10Slyngshede) [09:46:08] (03CR) 10David Caro: [C: 03+2] Revert "hieradata: swap eqiad1 dns server order" [puppet] - 10https://gerrit.wikimedia.org/r/905493 (owner: 10Majavah) [09:46:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "hieradata: swap eqiad1 dns server order" [puppet] - 10https://gerrit.wikimedia.org/r/905493 (owner: 10Majavah) [09:46:17] (03CR) 10CI reject: [V: 04-1] base::standard_packages: remove isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [09:46:49] and I can't push to gerrit from the deployment host anymore bah [09:47:16] (03PS5) 10Jbond: base::standard_packages: remove isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) [09:48:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:48:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40520/console" [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [09:49:26] (03PS1) 10Hashar: Revert "group0 wikis to 1.41.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905590 (https://phabricator.wikimedia.org/T330209) [09:49:28] (03CR) 10Hashar: [C: 03+2] Revert "group0 wikis to 1.41.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905590 (https://phabricator.wikimedia.org/T330209) (owner: 10Hashar) [09:49:49] (03CR) 10Jbond: [V: 03+1] base::standard_packages: remove isc-dhcp-client (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [09:50:11] (03CR) 10CI reject: [V: 04-1] Revert "group0 wikis to 1.41.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905590 (https://phabricator.wikimedia.org/T330209) (owner: 10Hashar) [09:50:16] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.41.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905590 (https://phabricator.wikimedia.org/T330209) (owner: 10Hashar) [09:51:21] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.41.0-wmf.3" | T330209 [09:51:25] T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209 [09:52:30] (JobUnavailable) resolved: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:04] (03PS1) 10Muehlenhoff: Update Java images to OpenJDK 11.0.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905592 [09:57:31] well the MediaWiki train is blocked on T333926 [09:57:32] T333926: PHP Deprecated: Accessing $wgHooks directly is deprecated, use HookContainer::getHandlers() or HookContainer::register() instead. [Called from {closure}] - https://phabricator.wikimedia.org/T333926 [09:59:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1000) [10:00:38] I am done with the train, I rolled back the group0 promotion [10:00:51] I am going to have lunch with the kids [10:15:29] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [10:17:49] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [10:18:04] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM overall though" [puppet] - 10https://gerrit.wikimedia.org/r/905181 (owner: 10Jbond) [10:25:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2163:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2163 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:27:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:28:38] (03CR) 10Alexandros Kosiaris: "Adding Kartik, Niklas and Santhosh for their information." [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:29:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [10:29:08] (03PS10) 10Elukey: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:31:34] (03CR) 10JMeybohm: [C: 03+1] Update Java images to OpenJDK 11.0.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905592 (owner: 10Muehlenhoff) [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:32] (03CR) 10Muehlenhoff: smart_data_dump: adapt for newer ssacli (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro) [10:38:20] (03CR) 10David Caro: [C: 03+2] maintain-dbusers: fix bad key [puppet] - 10https://gerrit.wikimedia.org/r/905585 (owner: 10David Caro) [10:41:06] (03PS1) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [10:41:58] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [10:42:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert) [10:43:32] (03CR) 10CI reject: [V: 04-1] sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [10:45:17] (03PS1) 10Elukey: Stop Yarn queues and Gobblin timers [puppet] - 10https://gerrit.wikimedia.org/r/905596 (https://phabricator.wikimedia.org/T331882) [10:52:08] (03PS1) 10Lucas Werkmeister (WMDE): Use HookContainer to register hooks inside hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905598 (https://phabricator.wikimedia.org/T333926) [10:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:53:32] (03PS1) 10Majavah: P:toolforge::prometheus: add probe for coredns pods [puppet] - 10https://gerrit.wikimedia.org/r/905599 [10:53:55] (03CR) 10CI reject: [V: 04-1] P:toolforge::prometheus: add probe for coredns pods [puppet] - 10https://gerrit.wikimedia.org/r/905599 (owner: 10Majavah) [10:54:20] (03PS2) 10Majavah: P:toolforge::prometheus: add probe for coredns pods [puppet] - 10https://gerrit.wikimedia.org/r/905599 [10:56:45] (03CR) 10Jbond: [C: 03+2] logrotate: add coumentations and fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/905175 (owner: 10Jbond) [10:57:07] (03PS4) 10Jbond: logrotate: add logrotate profile [puppet] - 10https://gerrit.wikimedia.org/r/905181 [10:58:04] (03CR) 10Effie Mouzeli: [C: 03+1] "just a nit on the commit message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert) [10:58:14] (03PS3) 10Effie Mouzeli: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert) [10:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:58:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40521/console" [puppet] - 10https://gerrit.wikimedia.org/r/905181 (owner: 10Jbond) [11:00:50] (03CR) 10Jbond: [V: 03+1] logrotate: add logrotate profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905181 (owner: 10Jbond) [11:01:08] (03PS2) 10Jbond: O:aphlict: update to use profile::logrotate to configure hourly [puppet] - 10https://gerrit.wikimedia.org/r/905182 [11:02:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40522/console" [puppet] - 10https://gerrit.wikimedia.org/r/905182 (owner: 10Jbond) [11:04:18] (03CR) 10Stevemunene: [C: 03+1] Stop Yarn queues and Gobblin timers [puppet] - 10https://gerrit.wikimedia.org/r/905596 (https://phabricator.wikimedia.org/T331882) (owner: 10Elukey) [11:04:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/905181 (owner: 10Jbond) [11:11:06] (03CR) 10Volans: "This might be a bit overkill for this repo, but YMMV :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [11:12:26] (03CR) 10Muehlenhoff: LDAP attribute editor (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede) [11:12:43] (03PS1) 10Majavah: P:ldap::client: split config and utils to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/905600 [11:14:55] (03CR) 10CI reject: [V: 04-1] P:ldap::client: split config and utils to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:15:03] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40523/console" [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:18:33] (03CR) 10Slyngshede: LDAP attribute editor (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede) [11:21:29] (03PS1) 10Phuedx: VisualEditorFeatureUse sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) [11:22:38] (03CR) 10Muehlenhoff: LDAP attribute editor (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede) [11:24:44] !log installing joblib security updates [11:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:30] (Storage /var over 50%) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [11:29:27] (03PS2) 10Majavah: P:ldap::client: split config and utils to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/905600 [11:29:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge::prometheus: add probe for coredns pods [puppet] - 10https://gerrit.wikimedia.org/r/905599 (owner: 10Majavah) [11:29:54] (03CR) 10CI reject: [V: 04-1] P:ldap::client: split config and utils to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:31:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40524/console" [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:32:02] (03PS3) 10Majavah: P:ldap::client: split config and utils to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/905600 [11:32:25] (03CR) 10CI reject: [V: 04-1] P:ldap::client: split config and utils to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:33:46] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 52310952 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:34:50] (03PS4) 10Majavah: P:ldap::client: split config and utils to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/905600 [11:35:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, we can tweak the options to generate the captcha when we see it being bypassed/targeted. One question inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede) [11:35:14] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:35:33] 10SRE, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Integrate In-App Internet censorship circumvention by domain fronting - https://phabricator.wikimedia.org/T327286 (10Naruse_shiroha) @diskdance , I saw you added exmaple of Signal and ProtonVPN, that in China, neither works.... [11:35:52] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 48632 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:36:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40526/console" [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [11:38:47] (03PS2) 10Slyngshede: Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) [11:42:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, minor comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [11:42:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] p:cloudceph::osd: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [11:42:43] (03CR) 10Muehlenhoff: "Two post-merge comments/questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [11:46:31] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) >>! In T330693#8701662, @gmodena wrote: [...] >> @MatthewVernon brou... [11:46:53] (03CR) 10Ayounsi: [C: 03+1] reports: exclude recycled devices from accounting (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905183 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [11:46:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "this seems like is ready to be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [11:47:10] (03PS2) 10Elukey: Stop Hadoop Yarn queues to ease network maintenance [puppet] - 10https://gerrit.wikimedia.org/r/905596 (https://phabricator.wikimedia.org/T331882) [11:48:07] (03PS3) 10Slyngshede: Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) [11:48:24] (03CR) 10Slyngshede: Signup: Add captcha to signups. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede) [11:48:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo what Moritz pointed out" [puppet] - 10https://gerrit.wikimedia.org/r/905181 (owner: 10Jbond) [11:53:01] (03PS1) 10DCausse: rdf-streaming-updater: tune mem overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/905602 (https://phabricator.wikimedia.org/T328675) [11:58:11] 10SRE, 10Observability-Alerting, 10observability: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10fgiunchedi) [11:58:47] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Jelto) >>! In T238720#8753483, @hashar wrote: > Gitlab being fairly recent, I don't think it ever go... [12:02:19] 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10Sustainability (Incident Followup), 10Wikimedia-Slow-DB-Query: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983 (10Ladsgroup) >>! In T160983#8717219, @Marostegui wrote: > @Ladsgroup thought on t... [12:03:30] (Storage /var over 50%) resolved: Device cloudsw1-b1-codfw.mgmt.codfw.wmnet recovered from Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [12:04:04] 10SRE, 10Observability-Alerting, 10observability: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10fgiunchedi) We could also fix this on the karma side instead, I've asked upstream about it here: https://github.com/prymitive/karma/issues/5154 [12:06:12] 10SRE, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Integrate In-App Internet censorship circumvention by domain fronting - https://phabricator.wikimedia.org/T327286 (10Diskdance) That list is just for everyone's reference. They may not work in China currently, but we can possib... [12:06:24] (03PS3) 10Tim Starling: Temporarily disable xenon/excimer for mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) [12:06:46] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:00] (03CR) 10CI reject: [V: 04-1] Temporarily disable xenon/excimer for mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling) [12:07:06] (03CR) 10Muehlenhoff: "Looks good, two comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede) [12:08:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) (owner: 10Slyngshede) [12:09:11] (03CR) 10Tim Starling: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling) [12:09:26] !disable puppet and stop bird on doh1001: T331882 [12:09:27] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [12:09:59] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ssingh) [12:10:09] (03CR) 10Tim Starling: [C: 03+2] Temporarily disable xenon/excimer for mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling) [12:10:50] (03Merged) 10jenkins-bot: Temporarily disable xenon/excimer for mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling) [12:11:46] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:07] (03CR) 10MVernon: [C: 04-1] "[-1 until I'm done wrestling with linters]" [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [12:13:18] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:13:24] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:13:38] PROBLEM - Bird Internet Routing Daemon on doh1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:14:22] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:14:28] ^ expected [12:14:47] (03CR) 10Volans: [C: 03+2] reports: exclude recycled devices from accounting [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905183 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [12:15:50] (03Merged) 10jenkins-bot: reports: exclude recycled devices from accounting [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905183 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [12:17:30] !log tstarling@deploy2002 Synchronized src/Profiler.php: T331882 disable profiling for switch maintenance (duration: 05m 58s) [12:17:34] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [12:18:49] (03PS1) 10Tim Starling: Re-enable xenon/excimer after mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905608 [12:23:01] (03CR) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [12:24:40] !log I noticed that mw2382 was still talking to mwlog1002. It still had old php-fpm7.4 processes despite the scap. So I manually restarted php-fpm on it. [12:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:06] (03CR) 10Stevemunene: Stop Hadoop Yarn queues to ease network maintenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905596 (https://phabricator.wikimedia.org/T331882) (owner: 10Elukey) [12:26:40] !log stevemunene@puppetmaster1001 conftool action : set/pooled=no; selector: name=datahubsearch1003.eqiad.wmnet [12:27:03] (03CR) 10Elukey: Stop Hadoop Yarn queues to ease network maintenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905596 (https://phabricator.wikimedia.org/T331882) (owner: 10Elukey) [12:27:55] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox-canary [12:28:05] !log volans@cumin1001 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling update on A:netbox-canary [12:28:05] !log stevemunene@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1012.eqiad.wmnet [12:28:10] (03PS1) 10Ayounsi: Depool eqiad for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/905603 (https://phabricator.wikimedia.org/T331882) [12:28:11] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [12:28:16] !log stevemunene@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1013.eqiad.wmnet [12:28:18] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [12:28:25] !log stevemunene@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1018.eqiad.wmnet [12:29:29] (03PS1) 10Muehlenhoff: Also broadcast RCFeed/IRC events to irc1002/irc2002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905626 (https://phabricator.wikimedia.org/T331702) [12:30:26] (03CR) 10Ssingh: [C: 03+1] Depool eqiad for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/905603 (https://phabricator.wikimedia.org/T331882) (owner: 10Ayounsi) [12:30:46] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [12:31:00] (03PS5) 10Jbond: logrotate: add logrotate profile [puppet] - 10https://gerrit.wikimedia.org/r/905181 [12:31:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on 38 hosts with reason: Row c switch maint T331882 [12:31:04] (03CR) 10Ssingh: [C: 03+2] Depool eqiad for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/905603 (https://phabricator.wikimedia.org/T331882) (owner: 10Ayounsi) [12:31:08] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [12:31:08] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/905181 (owner: 10Jbond) [12:31:09] !log run authdns-update for CR: 905603 depool eqiad [12:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:12] (03CR) 10Jbond: [C: 03+2] logrotate: add logrotate profile [puppet] - 10https://gerrit.wikimedia.org/r/905181 (owner: 10Jbond) [12:31:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 38 hosts with reason: Row c switch maint T331882 [12:32:03] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10hashar) That does not provide much information :) I say go for it, I don't think anything accesses... [12:32:16] !log [finished] run authdns-update for CR: 905603 depool eqiad [12:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:11] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:35:00] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:35:27] (03CR) 10Stevemunene: [C: 03+2] Stop Hadoop Yarn queues to ease network maintenance [puppet] - 10https://gerrit.wikimedia.org/r/905596 (https://phabricator.wikimedia.org/T331882) (owner: 10Elukey) [12:35:29] (03CR) 10EoghanGaffney: [C: 03+1] O:aphlict: update to use profile::logrotate to configure hourly [puppet] - 10https://gerrit.wikimedia.org/r/905182 (owner: 10Jbond) [12:36:21] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy this is now deployed and should exlude the devices list in the accounting spreadsheet, Recycled sheet. FYI from line 340 on that sheet there are a bunch... [12:37:43] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:38:18] !log depool ms-fe1011 re T331882 [12:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:23] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [12:39:10] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [12:40:16] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10Trizek-WMF) Just checking: is it approved or still pending my manager's approval? Thank you both! [12:40:38] (03PS1) 10Arturo Borrero Gonzalez: clouddumps: depool clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/905628 (https://phabricator.wikimedia.org/T331882) [12:41:45] (03PS1) 10Ladsgroup: Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 [12:42:03] jouncebot: nowandnext [12:42:03] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [12:42:03] In 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1300) [12:42:03] In 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1300) [12:42:11] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905628 (https://phabricator.wikimedia.org/T331882) (owner: 10Arturo Borrero Gonzalez) [12:42:44] (03CR) 10David Caro: [C: 03+1] "I don't know if that's all that's needed, but looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/905628 (https://phabricator.wikimedia.org/T331882) (owner: 10Arturo Borrero Gonzalez) [12:43:01] !log depool thanos-fe1003 re T331882 [12:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] clouddumps: depool clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/905628 (https://phabricator.wikimedia.org/T331882) (owner: 10Arturo Borrero Gonzalez) [12:44:16] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [12:44:54] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row C switches upgrade - T331882 [12:45:01] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [12:45:15] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row C switches up... [12:45:29] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:46:20] (03PS1) 10Arturo Borrero Gonzalez: Revert "clouddumps: depool clouddumps1002" [puppet] - 10https://gerrit.wikimedia.org/r/905610 [12:47:05] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:48:02] (03PS3) 10Jbond: O:aphlict: update to use profile::logrotate to configure hourly [puppet] - 10https://gerrit.wikimedia.org/r/905182 [12:50:34] (03CR) 10Jbond: [C: 03+2] O:aphlict: update to use profile::logrotate to configure hourly [puppet] - 10https://gerrit.wikimedia.org/r/905182 (owner: 10Jbond) [12:52:31] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:52:50] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 228 hosts with reason: eqiad row C upgrade [12:52:52] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on 228 hosts with reason: eqiad row C upgrade [12:57:04] RECOVERY - Check systemd state on aphlict2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:10] (03CR) 10Elukey: "I think that the chart looks good, added Alex and Janis (ServiceOps) to be on the same page of what the new chart will do and how it will " [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [12:57:20] !log putting pdfs into safe mode as part of T331882 [12:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:26] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [12:57:40] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 227 hosts with reason: eqiad row C upgrade [12:59:36] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1300). [13:00:06] Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1300) [13:00:10] o/ [13:01:20] (03CR) 10Jbond: [C: 03+2] "now in place" [puppet] - 10https://gerrit.wikimedia.org/r/905182 (owner: 10Jbond) [13:01:29] (03PS2) 10Ladsgroup: Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800) [13:01:36] i'll self-deploy [13:01:42] go ahead [13:02:04] I also have a possible fix for the train blocker, if someone can review it https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/905598 [13:02:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905544 (https://phabricator.wikimedia.org/T331831) (owner: 10Urbanecm) [13:02:52] (03Merged) 10jenkins-bot: ckbwiktionary: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905544 (https://phabricator.wikimedia.org/T331831) (owner: 10Urbanecm) [13:02:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 227 hosts with reason: eqiad row C upgrade [13:03:15] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=80a32cef-9700-4047-8185-415ffca1aaa2) set by ayounsi@cumin1001 for 2:00:00 on 227 host(s... [13:03:24] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:905544|ckbwiktionary: Add logo (T331831)]] [13:03:29] T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831 [13:03:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:05:03] (03CR) 10Jforrester: [C: 03+1] "Yeah, this should fix it. Meh. Thanks, Lucas!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905598 (https://phabricator.wikimedia.org/T333926) (owner: 10Lucas Werkmeister (WMDE)) [13:05:14] ok, I can deploy that later :) [13:05:25] (I’ll test it by trying a login on mobile meta, as that’s what the relevant old task is about) [13:05:42] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row C switches upgrade - T331882 [13:05:47] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [13:06:02] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row C switches up... [13:06:07] interesting. I'm curious why do we register hook handlers in a hook handler though :) [13:06:13] anyway, I'll ping you once done [13:06:46] PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100% [13:06:58] ok thanks [13:08:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:09:02] (03PS2) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [13:10:24] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:905544|ckbwiktionary: Add logo (T331831)]] (duration: 07m 00s) [13:10:30] T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831 [13:10:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:10:36] okay... [13:10:47] scap finished, but it complains [13:10:57] https://www.irccloud.com/pastebin/LgJkf1At/ [13:10:59] what's the complaint ? [13:11:06] !log asw2-c-eqiad> request system reboot all-members - T331882 [13:11:07] was pasting it claime ^^ [13:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:11] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [13:11:12] meh >< [13:11:16] * WMDE-Fisch quickly adding a beta config deply patch [13:11:16] "returned non-zero exit status 1", with no details [13:11:20] (03CR) 10CI reject: [V: 04-1] sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:11:29] huh [13:11:31] oh, there's something higher up [13:11:38] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1009.eqiad.wmnet [13:12:03] FYI eqiad row C switches going down shortly for upgrade [13:12:18] https://www.irccloud.com/pastebin/0gyRozzk/ [13:12:24] (03PS1) 10David Caro: maintain_dbusers: drop supporte for portless host in config [puppet] - 10https://gerrit.wikimedia.org/r/905629 [13:12:26] (03PS1) 10David Caro: maintain_dbusers: don't skip the whole clouddb if one user fails [puppet] - 10https://gerrit.wikimedia.org/r/905630 (https://phabricator.wikimedia.org/T332762) [13:12:29] claime: ^^ [13:12:45] urbanecm: Awesome, helm being difficult [13:12:56] I'll check it out [13:12:59] ty [13:13:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=thumbor1006.eqiad.wmnet [13:13:03] hnowlan@puppetmaster1001: Failed to log message to wiki. Somebody should check the error logs. [13:13:16] (03PS2) 10WMDE-Fisch: [beta] remove flag for experimental mapdata geoshape expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902670 (https://phabricator.wikimedia.org/T332973) (owner: 10Awight) [13:13:40] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:14:17] urbanecm: capacity issues [13:14:28] I'm not sure what that means [13:14:33] * claime grumbles about needing more kube nodes [13:14:33] (Emergency syslog message) firing: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:14:36] (virtual-chassis crash) firing: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:14:43] We are lacking in kube nodes basically [13:14:49] ah, that sort of capacity issues [13:14:52] I'll redeploy once the switch upgrade is done [13:14:55] (ProbeDown) firing: (2) Service doc2001.codfw.wmnet:443 has failed probes (http_doc2001_codfw_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:01] With a lower replica count I guess [13:15:05] (ProbeDown) firing: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:07] so, no need to worry about it now? [13:15:10] And try to raise it again afterwards [13:15:11] (03CR) 10CI reject: [V: 04-1] maintain_dbusers: don't skip the whole clouddb if one user fails [puppet] - 10https://gerrit.wikimedia.org/r/905630 (https://phabricator.wikimedia.org/T332762) (owner: 10David Caro) [13:15:23] and...can Lucas_WMDE continue with his deployment, or should we wait with the window until the upgrade finishes? [13:15:23] urbanecm: It just means test2wiki and test.wikidata are still on the old version [13:15:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:35] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10hnowlan) [13:15:50] and if I deploy, will that bump them to the new version? ^^ [13:15:50] PROBLEM - aqs endpoints health on aqs1017 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:52] PROBLEM - aqs endpoints health on aqs1019 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:54] PROBLEM - aqs endpoints health on aqs1020 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:54] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:15:57] oh nevermind you said you’ll redeploy [13:15:58] PROBLEM - aqs endpoints health on aqs1021 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:16:16] (I’m fine with waiting for a bit, I’m actually in a meeting anyway) [13:16:18] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:16:23] (03CR) 10David Caro: [C: 03+2] maintain_dbusers: drop supporte for portless host in config [puppet] - 10https://gerrit.wikimedia.org/r/905629 (owner: 10David Caro) [13:16:24] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [13:16:30] Lucas_WMDE: yeah it will, but there's quite a few unreachable k8s nodes right now because of the switch upgrade [13:16:34] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 12 down 4: https://wikitech.wikimedia.org/wiki/HAProxy [13:16:43] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:16:44] So if y'all can wait until that's done, it'd be better [13:16:48] PROBLEM - aqs endpoints health on aqs1016 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:16:48] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:16:49] sure [13:16:53] claime: urbanecm: Lucas_WMDE I just added a tiny beta only config patch for deployment not sure it will also run into the same issues here. [13:17:02] WMDE-Fisch: you probably will [13:17:04] But it could also wait. [13:17:07] probably not, if it's beta only [13:17:13] that doesn't trigger the k8s deployment [13:17:18] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:17:20] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 230, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:28] ah right, if it's beta only it won't matter [13:17:30] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:32] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 4 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:17:37] (03PS3) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [13:17:40] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-a-s4_3314: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s1_3311: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s2_3312: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s8_3318: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a- [13:17:40] Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s3_3313: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s6_3316: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s7_3317: Servers dbproxy1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:17:50] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [13:17:54] WMDE-Fisch: I'll deploy yours now [13:18:00] (03PS4) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [13:18:00] urbanecm: thanks [13:18:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902670 (https://phabricator.wikimedia.org/T332973) (owner: 10Awight) [13:18:08] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-a-s4_3314: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s1_3311: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s2_3312: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s8_3318: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a- [13:18:08] Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s3_3313: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s6_3316: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s7_3317: Servers dbproxy1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:18:14] (03CR) 10Vgutierrez: "looks good but we're currently mostly duplicating add_upload_cors_headers, could we make sense to leverage it here as well?" [puppet] - 10https://gerrit.wikimedia.org/r/904883 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:18:25] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:18:30] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:18:35] (JobUnavailable) firing: (12) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:18:50] (03CR) 10CI reject: [V: 04-1] sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:18:52] (03Merged) 10jenkins-bot: [beta] remove flag for experimental mapdata geoshape expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902670 (https://phabricator.wikimedia.org/T332973) (owner: 10Awight) [13:19:21] WMDE-Fisch: should be done [13:19:38] urbanecm: perfect [13:19:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:19:55] (ProbeDown) firing: (4) Service doc2001.codfw.wmnet:443 has failed probes (http_doc2001_codfw_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:05] (ProbeDown) resolved: (7) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:15] (03PS1) 10Filippo Giunchedi: sre: move prometheus/wmcs scrapefailure [alerts] - 10https://gerrit.wikimedia.org/r/905632 (https://phabricator.wikimedia.org/T309182) [13:20:17] (03PS1) 10Filippo Giunchedi: wmcs: fix deploy-tag for novafullstack [alerts] - 10https://gerrit.wikimedia.org/r/905633 (https://phabricator.wikimedia.org/T309182) [13:21:32] PROBLEM - configured eth on lvs1018 is CRITICAL: ens1f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:21:34] (03CR) 10CI reject: [V: 04-1] sre: move prometheus/wmcs scrapefailure [alerts] - 10https://gerrit.wikimedia.org/r/905632 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:21:43] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:22:02] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 33.92 ms [13:22:17] (03CR) 10CI reject: [V: 04-1] wmcs: fix deploy-tag for novafullstack [alerts] - 10https://gerrit.wikimedia.org/r/905633 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:22:32] (03PS5) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [13:22:44] (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:22:56] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:23:25] (KafkaUnderReplicatedPartitions) firing: (3) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:23:25] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on db2163:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2163 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:23:30] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:34] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:23:38] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service,netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:42] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:23:50] (ProbeDown) firing: Service doc2001.codfw.wmnet:443 has failed probes (http_doc2001_codfw_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc2001.codfw.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:56] RECOVERY - aqs endpoints health on aqs1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:23:57] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:22] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:24:27] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:24:28] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:24:30] (03CR) 10CI reject: [V: 04-1] sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:24:32] (MediaWikiMemcachedHighErrorRate) resolved: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:24:38] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:24:38] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:42] (JobUnavailable) firing: (12) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:50] RECOVERY - aqs endpoints health on aqs1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:50] RECOVERY - aqs endpoints health on aqs1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:52] RECOVERY - aqs endpoints health on aqs1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:53] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:24:54] (03PS2) 10Filippo Giunchedi: sre: move prometheus/wmcs scrapefailure [alerts] - 10https://gerrit.wikimedia.org/r/905632 (https://phabricator.wikimedia.org/T309182) [13:24:55] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:24:56] (03PS2) 10Filippo Giunchedi: wmcs: fix deploy-tag for novafullstack [alerts] - 10https://gerrit.wikimedia.org/r/905633 (https://phabricator.wikimedia.org/T309182) [13:24:58] RECOVERY - aqs endpoints health on aqs1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:25:04] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:25:14] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:25:24] (JobUnavailable) firing: (10) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:25:40] (Emergency syslog message) resolved: Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:25:41] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:25:44] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:25:50] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2163:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2163 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:25:55] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:25:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:08] (ProbeDown) resolved: (4) Service doc2001.codfw.wmnet:443 has failed probes (http_doc2001_codfw_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:27:13] db2163 is row B [13:27:28] and it's codfw 🤦 [13:27:38] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:28:19] (virtual-chassis crash) resolved: Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:28:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:28:46] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:29:22] (JobUnavailable) resolved: (12) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:29:24] (03PS1) 10Ssingh: Revert "Depool eqiad for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/905612 [13:29:28] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move prometheus/wmcs scrapefailure [alerts] - 10https://gerrit.wikimedia.org/r/905632 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:29:30] (03CR) 10Filippo Giunchedi: [C: 03+2] wmcs: fix deploy-tag for novafullstack [alerts] - 10https://gerrit.wikimedia.org/r/905633 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:29:51] (virtual-chassis crash) firing: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:30:04] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:30:42] (03Merged) 10jenkins-bot: sre: move prometheus/wmcs scrapefailure [alerts] - 10https://gerrit.wikimedia.org/r/905632 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:30:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:30:53] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thumbor1006.eqiad.wmnet [13:31:33] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:31:41] (03Merged) 10jenkins-bot: wmcs: fix deploy-tag for novafullstack [alerts] - 10https://gerrit.wikimedia.org/r/905633 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:33:05] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 6 failed LD(s) (Offline, Offline, Offline, Offline, Offline, Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T333960 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:33:10] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10ops-monitoring-bot) [13:33:30] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:33:56] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:34:20] (03CR) 10Ssingh: [C: 03+2] Revert "Depool eqiad for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/905612 (owner: 10Ssingh) [13:34:39] !log run authdns-update for CR 905612, reverting depool of eqiad [13:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:51] (virtual-chassis crash) resolved: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:35:03] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:35:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:36:26] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Serve an HTTP response for measurement domains directly from Varnish - https://phabricator.wikimedia.org/T332028 (10CDanis) [13:36:40] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) 05Open→03Resolved a:03ayounsi Closing the task as the upgrade is done. It went extremely smoothly, thank you everybody! See you in 2 week... [13:36:54] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:05] (03PS1) 10Jbond: squid: add docs and small refactor [puppet] - 10https://gerrit.wikimedia.org/r/905634 [13:37:07] (03PS1) 10Jbond: squid: update squid to use logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/905635 [13:37:09] (03PS1) 10Jbond: logrotate: add support for hourly [puppet] - 10https://gerrit.wikimedia.org/r/905636 [13:37:11] (03PS1) 10Jbond: squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 [13:37:13] (03PS1) 10Jbond: url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 [13:37:14] RECOVERY - Host elastic2038 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [13:37:35] (03CR) 10CI reject: [V: 04-1] squid: add docs and small refactor [puppet] - 10https://gerrit.wikimedia.org/r/905634 (owner: 10Jbond) [13:37:41] (03CR) 10CI reject: [V: 04-1] squid: update squid to use logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/905635 (owner: 10Jbond) [13:38:03] !log reboot elastic2038 to clear soft lock [13:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:20] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:38:42] (SystemdUnitFailed) firing: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:45] (03CR) 10MVernon: "I think I've got as far as I'm going to with the CI myself - the transferpy Debian package is installed on the cumin nodes; I'm not sure h" [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:38:54] !log leave hdfs safemode T331882 [13:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:58] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [13:39:43] are we good to deploy again? [13:40:04] (03CR) 10CI reject: [V: 04-1] squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 (owner: 10Jbond) [13:40:05] (and congrats on the smooth upgrade!) [13:40:27] (03CR) 10CI reject: [V: 04-1] logrotate: add support for hourly [puppet] - 10https://gerrit.wikimedia.org/r/905636 (owner: 10Jbond) [13:41:12] 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) >>! In T332024#8738957, @BCornwall wrote: > Thanks for that, @ayounsi! Are you aware of https://gerrit.wikimedia.org/g/operations/software/latency-measurement ? It may or may not b... [13:41:54] !log repool ms-fe1011 re T331882 [13:41:56] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:06] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) >>! In T333863#8754335, @Trizek-WMF wrote: > Just checking: is it approved or still pending my manager's approval? > > Thank you both! Hi @Trizek-WMF: We have @Ottoma... [13:42:14] !log repool thanos-fe1003 re T331882 [13:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:42] (SystemdUnitFailed) resolved: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:00] I am around ;) [13:44:03] jouncebot: next [13:44:03] In 2 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1600) [13:44:09] (03PS1) 10Stevemunene: Revert "Stop Hadoop Yarn queues to ease network maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/905613 [13:44:15] jouncebot: now [13:44:15] For the next 0 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1300) [13:44:15] For the next 0 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1300) [13:44:15] :) [13:44:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool es1022 T333961', diff saved to https://phabricator.wikimedia.org/P46027 and previous config saved to /var/cache/conftool/dbconfig/20230404-134415-ladsgroup.json [13:44:16] I’ll probably wait until after the end of the backport window with my change [13:44:20] T333961: Replication broke on es1022 (es4) - https://phabricator.wikimedia.org/T333961 [13:44:23] so I’m not distraacted by the ongoing meeting ^^ [13:44:30] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:30] and I hope it’s okay to resume deploying byt hen [13:44:45] * hashar reads backlog [13:45:10] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10ssingh) @Ottomata @odimitrijevic: This requires your approval please, thank you. @Atieno: Please read and sign the L3 "Acknowledgement of Wikim... [13:45:14] (T331882 being the main reason we paused deploying for a bit) [13:45:15] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [13:45:20] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:46:07] (03CR) 10Volans: sre.swift.remove-ghost-objects: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:48:36] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:48:43] (03CR) 10Stevemunene: [C: 03+2] Revert "Stop Hadoop Yarn queues to ease network maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/905613 (owner: 10Stevemunene) [13:49:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew) @Jclark-ctr these servers are about to become a blocker for a separate project (decomming some very old different cloudvirts). Are these hosts... [13:49:53] (03PS6) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [13:50:25] (03CR) 10CI reject: [V: 04-1] sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:50:27] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Ottomata) Approved. I believe this will require kerberos access too. [13:50:30] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:50:36] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CR [13:50:36] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:51:17] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10ssingh) [13:52:14] RECOVERY - configured eth on lvs1018 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:52:20] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:41] (03CR) 10MVernon: sre.swift.remove-ghost-objects: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:53:56] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:54:25] (03PS2) 10David Caro: maintain_dbusers: don't skip the whole clouddb if one user fails [puppet] - 10https://gerrit.wikimedia.org/r/905630 (https://phabricator.wikimedia.org/T332762) [13:54:39] (03CR) 10Volans: sre.swift.remove-ghost-objects: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:58:12] RECOVERY - Bird Internet Routing Daemon on doh1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:58:16] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:58:36] (CirrusSearchNodeIndexingNotIncreasing) resolved: Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:58:50] (03CR) 10MVernon: sre.swift.remove-ghost-objects: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:59:06] RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:38] (03CR) 10Volans: sre.swift.remove-ghost-objects: new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [13:59:53] (03PS3) 10David Caro: maintain_dbusers: don't skip the whole clouddb if one user fails [puppet] - 10https://gerrit.wikimedia.org/r/905630 (https://phabricator.wikimedia.org/T332762) [14:01:04] hashar: I notice there’s still a fair amount of wgHooks in logspam-watch, are some wikis still on the new version? [14:01:20] I should have rolled back [14:01:26] I guess group0 is rolled back but the test wikis aren’t? [14:01:39] yes that is correct [14:01:44] ah ok [14:01:47] * Lucas_WMDE no longer in a meeting \o/ [14:01:48] so we can at least try a fix via testwikis [14:01:55] nice [14:02:00] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) [14:02:01] then there is mwdebug [14:02:14] aha, I was going to test the patch on meta, but of course meta isn’t on the new version [14:02:15] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Peachey88) [14:02:16] hrm [14:02:27] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) [14:02:31] maybe I’ll just try deploying it then [14:02:39] and hope that meta logins don’t break [14:02:51] (03PS2) 10Lucas Werkmeister (WMDE): Use HookContainer to register hooks inside hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905598 (https://phabricator.wikimedia.org/T333926) [14:03:08] well maybe we can cherry pick the promote commit on the deployment server than scap pull from mwdebug1001 [14:03:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905598 (https://phabricator.wikimedia.org/T333926) (owner: 10Lucas Werkmeister (WMDE)) [14:03:20] I’m just using scap backport now [14:03:22] at least in log stash the error is shown for a few requests made to testwikis [14:03:36] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10Trizek-WMF) @Elitre, can you approve this request? It would allow me to access Superset charts and dashboards, specifically those built for the Growth team, that require private... [14:03:36] PROBLEM - puppetmaster backend https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:03:52] (03Merged) 10jenkins-bot: Use HookContainer to register hooks inside hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905598 (https://phabricator.wikimedia.org/T333926) (owner: 10Lucas Werkmeister (WMDE)) [14:04:15] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905598|Use HookContainer to register hooks inside hooks (T333926)]] [14:04:20] T333926: PHP Deprecated: Accessing $wgHooks directly is deprecated, use HookContainer::getHandlers() or HookContainer::register() instead. [Called from {closure}] - https://phabricator.wikimedia.org/T333926 [14:05:17] Lucas_WMDE: I'm back, tell me if helm fails again for mw-web [14:05:24] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:05:27] ok will do, thanks [14:05:50] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:905598|Use HookContainer to register hooks inside hooks (T333926)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [14:06:42] RECOVERY - puppetmaster backend https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:06:52] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:07:46] ok, I logged in on mobile test wikidata on mwdebug [14:07:56] and I don’t see a wgHooks message in mwdebug logstash [14:08:11] I think that’s enough to continue with the sync, unless anyone screams :) [14:09:19] !log stevemunene@puppetmaster1001 conftool action : set/pooled=yes; selector: name=datahubsearch1003.eqiad.wmnet [14:09:27] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 33 [14:09:40] syncing [14:09:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 33 [14:09:48] !log stevemunene@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1012.eqiad.wmnet [14:09:55] !log stevemunene@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1013.eqiad.wmnet [14:10:10] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 128 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:10:17] !log stevemunene@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1018.eqiad.wmnet [14:11:02] (03CR) 10David Caro: [C: 03+2] maintain_dbusers: don't skip the whole clouddb if one user fails [puppet] - 10https://gerrit.wikimedia.org/r/905630 (https://phabricator.wikimedia.org/T332762) (owner: 10David Caro) [14:11:04] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:11:16] (03PS2) 10Jbond: logrotate: add support for hourly [puppet] - 10https://gerrit.wikimedia.org/r/905636 [14:15:06] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905598|Use HookContainer to register hooks inside hooks (T333926)]] (duration: 10m 50s) [14:15:10] T333926: PHP Deprecated: Accessing $wgHooks directly is deprecated, use HookContainer::getHandlers() or HookContainer::register() instead. [Called from {closure}] - https://phabricator.wikimedia.org/T333926 [14:15:21] claime: no error this time (just fyi) [14:15:51] Lucas_WMDE: Great, thanks [14:15:56] !log UTC afternoon backport+config window done [14:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:29] last wgHooks error in logstash was at 14:14:08 UTC afaict [14:16:39] so I think that’s looking like a success so far [14:18:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40528/console" [puppet] - 10https://gerrit.wikimedia.org/r/905636 (owner: 10Jbond) [14:19:31] (03PS1) 10Lucas Werkmeister (WMDE): Extend mobile login hack for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905642 (https://phabricator.wikimedia.org/T318138) [14:19:49] if no one else needs to deploy at the moment, I’d like to test ^ on mwdebug for a sec [14:20:10] +1 [14:20:35] (03PS2) 10Jbond: squid: add docs and small refactor [puppet] - 10https://gerrit.wikimedia.org/r/905634 [14:20:37] (03PS2) 10Jbond: squid: update squid to use logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/905635 [14:20:39] (03PS2) 10Jbond: squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 [14:20:41] (03PS2) 10Jbond: url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 [14:20:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] logrotate: add support for hourly [puppet] - 10https://gerrit.wikimedia.org/r/905636 (owner: 10Jbond) [14:21:02] (03PS1) 10Vgutierrez: hiera: Use a single socket on haproxy/varnish on cp60[08,16] [puppet] - 10https://gerrit.wikimedia.org/r/905643 (https://phabricator.wikimedia.org/T333965) [14:21:15] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/905548 [14:21:47] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [14:21:47] ok, pulled to mwdebug2001 [14:21:48] !log stop es1022 for debugging T333961 [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] T333961: Replication broke on es1022 (es4) - https://phabricator.wikimedia.org/T333961 [14:22:38] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40530/console" [puppet] - 10https://gerrit.wikimedia.org/r/905643 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez) [14:22:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40529/console" [puppet] - 10https://gerrit.wikimedia.org/r/905634 (owner: 10Jbond) [14:23:07] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/905548 (owner: 10PipelineBot) [14:23:36] (03CR) 10Jbond: [V: 03+1 C: 03+2] squid: add docs and small refactor [puppet] - 10https://gerrit.wikimedia.org/r/905634 (owner: 10Jbond) [14:24:01] nope, doesn’t work [14:24:04] ok ^^ [14:24:28] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Use a single socket on haproxy/varnish on cp60[08,16] [puppet] - 10https://gerrit.wikimedia.org/r/905643 (https://phabricator.wikimedia.org/T333965) (owner: 10Vgutierrez) [14:24:55] (03CR) 10Ayounsi: [C: 03+2] BGP: remove local-as 14907 loops 2 for anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/827950 (owner: 10Ayounsi) [14:24:57] (03Abandoned) 10Lucas Werkmeister (WMDE): Extend mobile login hack for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905642 (https://phabricator.wikimedia.org/T318138) (owner: 10Lucas Werkmeister (WMDE)) [14:25:24] (03PS3) 10Jbond: squid: update squid to use logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/905635 [14:25:25] * Lucas_WMDE done [14:25:26] (03PS3) 10Jbond: squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 [14:25:29] (03PS3) 10Jbond: url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 [14:25:33] (03Merged) 10jenkins-bot: BGP: remove local-as 14907 loops 2 for anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/827950 (owner: 10Ayounsi) [14:25:37] (moved deploy2002 back to the main commit and pulled mwdebug2001 again) [14:25:54] still no new wgHooks errors in logstash, yay [14:26:08] (03CR) 10Jbond: [V: 03+2 C: 03+2] squid: update squid to use logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/905635 (owner: 10Jbond) [14:26:18] (03CR) 10Ayounsi: [C: 03+2] BGP: remove local-as 14907 loops 2 for anycast peers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/827950 (owner: 10Ayounsi) [14:26:27] (03PS1) 10Andrew Bogott: glance: set enforce_secure_rbac [puppet] - 10https://gerrit.wikimedia.org/r/905645 (https://phabricator.wikimedia.org/T330759) [14:27:50] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/905548 (owner: 10PipelineBot) [14:28:23] Lucas_WMDE: congratulations :] [14:28:29] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: tune mem overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/905602 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [14:28:33] !log switch cp6008 (upload) and cp6016 (text) to use a single UDS socket between haproxy and varnish - T333965 [14:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:37] T333965: Check if it still makes sense to have 8 varnish sockets being used by HAProxy - https://phabricator.wikimedia.org/T333965 [14:29:09] (03PS2) 10Ayounsi: Manage drmrs LVS/bird BGP with Homer [homer/public] - 10https://gerrit.wikimedia.org/r/905257 [14:29:57] (03CR) 10Andrew Bogott: [C: 03+2] glance: set enforce_secure_rbac [puppet] - 10https://gerrit.wikimedia.org/r/905645 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:30:41] (03PS1) 10Jbond: squid: add compress [puppet] - 10https://gerrit.wikimedia.org/r/905646 [14:30:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] squid: add compress [puppet] - 10https://gerrit.wikimedia.org/r/905646 (owner: 10Jbond) [14:32:13] (03CR) 10Ssingh: [C: 03+1] "Very much welcomed and looks good, the parts I could review :) [layout, IPs, hosts]." [homer/public] - 10https://gerrit.wikimedia.org/r/905257 (owner: 10Ayounsi) [14:33:16] (03PS4) 10Jbond: squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 [14:33:48] (03Merged) 10jenkins-bot: rdf-streaming-updater: tune mem overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/905602 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [14:35:07] Heads up, we need to do an urgent deployment on wikifeeds. [14:35:34] ack [14:36:31] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:36:48] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [14:36:56] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:37:40] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:38:01] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [14:38:10] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:38:25] (03PS5) 10Jbond: squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 [14:38:27] (03PS4) 10Jbond: url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) [14:38:55] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:38:59] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) @MoritzMuehlenhoff -- hi checking in on this! I just tried to log in and was not recognized. So want to make sure its not an IT issue! [14:39:18] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [14:39:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40531/console" [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [14:39:49] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:40:14] (03PS6) 10Jbond: squid: Add support for hourly log rotation [puppet] - 10https://gerrit.wikimedia.org/r/905637 [14:40:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [14:41:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40532/console" [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [14:42:05] (03PS5) 10Jbond: url_downloader: switch squid logs to hourly rotation [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) [14:43:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40533/console" [puppet] - 10https://gerrit.wikimedia.org/r/905638 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [14:43:25] !log jiji@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C switches upgrade - T331882 [14:43:29] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [14:43:42] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C switches upgrade -... [14:50:25] (03PS1) 10Jcrespo: external store: Depool es4 (cluster26) from writes for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905648 (https://phabricator.wikimedia.org/T333961) [14:52:11] (03CR) 10Ssingh: [C: 03+1] Also add component/pybal for pybaltest hosts [puppet] - 10https://gerrit.wikimedia.org/r/905543 (owner: 10Muehlenhoff) [14:53:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jhancock.wm) I'm not seeing any disk errors or failures on the tsr report. We're going to fix this foreign drive and see if it happens again. I'll leave this ticket open to the en... [14:54:25] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ mwscript extensions/CentralAuth/maintenance/migrateAccount.php --wiki=metawiki -u 'Translation Notification Bot (T255246)' --auto # T255246 [14:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:30] T255246: Rename (and lock) Translation Notification Bot@Translation Notification Bot - https://phabricator.wikimedia.org/T255246 [14:54:46] (03CR) 10Volans: [C: 03+1] "Python wise LGTM, I'll leave it to your team for the logic." [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) (owner: 10Ryan Kemper) [14:57:42] PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:57:48] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:58] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:58:02] PROBLEM - MariaDB Replica Lag: es4 on es1022 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:58:10] PROBLEM - confd service on an-worker1132 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:58:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: apt-daily-upgrade.service,apt-daily.service,clean_puppet_client_bucket.service,confd_prometheus_metrics.service,export_smart_data_dump.service,hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service,ipmiseld.service,lldpd.service,logrotate.service,man-db.service,prometheus-debian-version-textfile.service,prometheus-ipmi-expor [14:58:16] ce,prometheus-nic-firmware-textfile.service,prometheus-node-exporter-apt.service,prometheus-node-exporter.service,prometheus_intel_microcode.service,prometheus_puppet_agent_stats.service,rsyslog.service,syslog.socket,systemd-journald-audit.socket,systemd-journald-dev-log.socket,systemd-journald.service,systemd-journald.socket,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@116.service,wmf_auto_restart_cron.servi [14:58:16] to_restart_exim4.service,wmf_auto_restart_lldpd.service,wmf_auto_restart_nagios-nrpe-server.service,wmf_auto_restart_nic-saturation-exporter.service,wmf_auto_restart_prometheus-ipmi-e https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:26] PROBLEM - puppet last run on an-worker1132 is CRITICAL: CRITICAL: Puppet last ran 9 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:58:26] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:58:28] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 8: https://wikitech.wikimedia.org/wiki/HAProxy [14:58:32] PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s7.timer,wmf_auto_restart_prometheus-mysqld-exporter@s8.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:32] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:58:40] PROBLEM - mysqld processes on es1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:58:50] PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:58:54] PROBLEM - MariaDB Replica SQL: es4 on es1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:59:08] PROBLEM - MariaDB Replica IO: es4 on es1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:59:08] PROBLEM - MariaDB read only es4 on es1022 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:59:10] PROBLEM - Hadoop DataNode on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [15:00:25] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C switches upgrade -... [15:01:28] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 803280 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [15:01:54] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:10] RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:23] (03CR) 10Ladsgroup: [C: 03+1] external store: Depool es4 (cluster26) from writes for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905648 (https://phabricator.wikimedia.org/T333961) (owner: 10Jcrespo) [15:05:36] (03CR) 10Ladsgroup: [C: 03+2] external store: Depool es4 (cluster26) from writes for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905648 (https://phabricator.wikimedia.org/T333961) (owner: 10Jcrespo) [15:06:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905648 (https://phabricator.wikimedia.org/T333961) (owner: 10Jcrespo) [15:06:22] (03Merged) 10jenkins-bot: external store: Depool es4 (cluster26) from writes for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905648 (https://phabricator.wikimedia.org/T333961) (owner: 10Jcrespo) [15:06:45] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:905648|external store: Depool es4 (cluster26) from writes for maintenance (T333961)]] [15:06:50] T333961: es1021-> es1022 ended up in a weird state after switch maintenance [was: Replication broke on es1022 (es4)] - https://phabricator.wikimedia.org/T333961 [15:07:55] (03PS1) 10Ssingh: admin: add krb: present for sfaci [puppet] - 10https://gerrit.wikimedia.org/r/905652 (https://phabricator.wikimedia.org/T332063) [15:08:10] !log ladsgroup@deploy2002 ladsgroup and jynus: Backport for [[gerrit:905648|external store: Depool es4 (cluster26) from writes for maintenance (T333961)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:11:25] (03CR) 10Ssingh: [C: 03+2] admin: add krb: present for sfaci [puppet] - 10https://gerrit.wikimedia.org/r/905652 (https://phabricator.wikimedia.org/T332063) (owner: 10Ssingh) [15:12:53] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:12:59] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10jhathaway) >>! In T331706#8753210, @Ladsgroup wrote: > I'll try to take a look at the grants (it's a bit unusual for me given that's behind haproxy) but while we are... [15:16:29] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1150.eqiad.wmnet with reason: pending s3 reprovisioning [15:16:44] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1150.eqiad.wmnet with reason: pending s3 reprovisioning [15:18:16] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:905648|external store: Depool es4 (cluster26) from writes for maintenance (T333961)]] (duration: 11m 30s) [15:18:20] T333961: es1021-> es1022 ended up in a weird state after switch maintenance [was: Replication broke on es1022 (es4)] - https://phabricator.wikimedia.org/T333961 [15:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:38] !log jiji@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: eqiad row C switches upgrade - T331882 [15:19:42] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [15:23:05] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10ssingh) 05Open→03Resolved a:03ssingh ` sukhe@krb1001:~$ sudo manage_principals.py create sfaci --email_address=sfaci@wikimedia.org Principal successful... [15:23:10] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:23:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:24:54] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Here are the [[ https://docs.google.com/document/d/1T9vcUvbyWSDOFlj... [15:25:02] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool es1021 reads', diff saved to https://phabricator.wikimedia.org/P46029 and previous config saved to /var/cache/conftool/dbconfig/20230404-152501-jynus.json [15:25:22] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ladsgroup) My apologies, I should have done this last week and mistakenly thought it's data engineering. [15:26:48] (03PS1) 10Ladsgroup: Revert "mergeMessageFileList.php: move code out of file scope." [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/905617 (https://phabricator.wikimedia.org/T333966) [15:28:20] (03PS7) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [15:29:50] jouncebot: nowandnext [15:29:50] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [15:29:50] In 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1600) [15:30:09] (03CR) 10Ladsgroup: [C: 03+2] Revert "mergeMessageFileList.php: move code out of file scope." [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/905617 (https://phabricator.wikimedia.org/T333966) (owner: 10Ladsgroup) [15:30:27] (03CR) 10CI reject: [V: 04-1] sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [15:31:14] !log restart es1021, several connections in a "stuck" state T333961 [15:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:19] T333961: es1021-> es1022 ended up in a weird state after switch maintenance [was: Replication broke on es1022 (es4)] - https://phabricator.wikimedia.org/T333961 [15:31:37] (03PS1) 10BCornwall: gitlab: Fix listen_https typo [puppet] - 10https://gerrit.wikimedia.org/r/905653 [15:32:37] (03PS8) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [15:33:29] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40534/console" [puppet] - 10https://gerrit.wikimedia.org/r/905653 (owner: 10BCornwall) [15:33:34] RECOVERY - mysqld processes on es1022 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:33:48] RECOVERY - MariaDB Replica SQL: es4 on es1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:34:02] RECOVERY - MariaDB Replica IO: es4 on es1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:34:08] RECOVERY - MariaDB read only es4 on es1022 is OK: Version 10.6.12-MariaDB-log, Uptime 77s, read_only: True, event_scheduler: True, 11.63 QPS, connection latency: 0.005605s, query latency: 0.000657s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:34:31] (03PS1) 10Hnowlan: thumbor: increase memory quota, per-container memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/905654 (https://phabricator.wikimedia.org/T333445) [15:39:27] (03CR) 10BCornwall: gitlab: Fix listen_https typo [puppet] - 10https://gerrit.wikimedia.org/r/905653 (owner: 10BCornwall) [15:39:52] (03CR) 10BCornwall: [V: 03+1] gitlab: Disable listening on port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [15:40:43] (03PS25) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [15:41:52] (03PS1) 10MVernon: spicerack: add the transferpy package [puppet] - 10https://gerrit.wikimedia.org/r/905657 (https://phabricator.wikimedia.org/T327253) [15:42:16] (03CR) 10CI reject: [V: 04-1] Revert "mergeMessageFileList.php: move code out of file scope." [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/905617 (https://phabricator.wikimedia.org/T333966) (owner: 10Ladsgroup) [15:42:37] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905657 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [15:44:30] (03CR) 10MVernon: [C: 03+2] spicerack: add the transferpy package [puppet] - 10https://gerrit.wikimedia.org/r/905657 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [15:45:26] (03PS1) 10Jgreen: Remove frdata1001 from DNS for decommissioning. [dns] - 10https://gerrit.wikimedia.org/r/905658 (https://phabricator.wikimedia.org/T333971) [15:47:17] (03CR) 10Jgreen: [C: 03+2] Remove frdata1001 from DNS for decommissioning. [dns] - 10https://gerrit.wikimedia.org/r/905658 (https://phabricator.wikimedia.org/T333971) (owner: 10Jgreen) [15:48:08] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:19] (03CR) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [15:49:32] (03PS2) 10KartikMistry: Remove akwiki from CX config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904952 [15:49:39] !log dancy@deploy2002 Installing scap version "4.48.0" for 592 hosts [15:49:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) There are hundreds of kernel log entries relating to failures of these drives and the filesystems are unusable. It's not previously been a problem to replace bad dr... [15:50:35] !log dancy@deploy2002 Installation of scap version "4.48.0" completed for 592 hosts [15:52:21] (03CR) 10SBassett: api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [15:55:00] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on db2163:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2163 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [15:55:42] PROBLEM - MariaDB Replica IO: es4 on es1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:58:09] !log restart es1021, several connections in a "stuck" state T333961 [15:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:13] T333961: es1021-> es1022 ended up in a weird state after switch maintenance [was: Replication broke on es1022 (es4)] - https://phabricator.wikimedia.org/T333961 [15:58:30] PROBLEM - mysqld processes on es1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:58:44] PROBLEM - MariaDB Replica SQL: es4 on es1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:04] PROBLEM - MariaDB read only es4 on es1022 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:59:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on es1022.eqiad.wmnet with reason: T333961 [15:59:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es1022.eqiad.wmnet with reason: T333961 [15:59:45] (03CR) 10Jbond: [C: 03+2] "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [16:00:05] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:47] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool es1021 for reads (only 10%)', diff saved to https://phabricator.wikimedia.org/P46030 and previous config saved to /var/cache/conftool/dbconfig/20230404-160146-jynus.json [16:04:00] (03CR) 10Dzahn: [C: 03+1] "lgtm, thanks for compiling, I like the campsite analogy" [puppet] - 10https://gerrit.wikimedia.org/r/905653 (owner: 10BCornwall) [16:05:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "clouddumps: depool clouddumps1002" [puppet] - 10https://gerrit.wikimedia.org/r/905610 (owner: 10Arturo Borrero Gonzalez) [16:06:04] (03CR) 10Jbond: [C: 03+1] "lgtm adding simon and moritz who have been working on ldap production recently" [puppet] - 10https://gerrit.wikimedia.org/r/905600 (owner: 10Majavah) [16:06:37] (03CR) 10Brennen Bearnes: [C: 03+1] "As far as I know, this only exists to serve the redirect. Seems ok by me at this point." [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [16:06:39] (03PS1) 10Ladsgroup: Revert "external store: Depool es4 (cluster26) from writes for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905623 (https://phabricator.wikimedia.org/T333961) [16:07:02] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool es1021 for reads', diff saved to https://phabricator.wikimedia.org/P46031 and previous config saved to /var/cache/conftool/dbconfig/20230404-160702-jynus.json [16:07:04] PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:53] (03CR) 10Ladsgroup: [C: 03+2] Revert "external store: Depool es4 (cluster26) from writes for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905623 (https://phabricator.wikimedia.org/T333961) (owner: 10Ladsgroup) [16:09:37] (03Merged) 10jenkins-bot: Revert "external store: Depool es4 (cluster26) from writes for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905623 (https://phabricator.wikimedia.org/T333961) (owner: 10Ladsgroup) [16:10:05] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:905623|Revert "external store: Depool es4 (cluster26) from writes for maintenance" (T333961)]] [16:10:09] T333961: es1021-> es1022 ended up in a weird state after switch maintenance [was: Replication broke on es1022 (es4)] - https://phabricator.wikimedia.org/T333961 [16:10:28] (03CR) 10Dzahn: "While this seems ok to me and I think you know what you are doing - this is kind of the wrong team for this kind of thing nowadays" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [16:11:30] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:905623|Revert "external store: Depool es4 (cluster26) from writes for maintenance" (T333961)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [16:14:14] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:58] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) While working on this, I found a container where the `codfw` containers are consistent, but the listing is wrong. The objects that exist are all the sam... [16:17:37] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:905623|Revert "external store: Depool es4 (cluster26) from writes for maintenance" (T333961)]] (duration: 07m 31s) [16:17:42] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/904502/40536/ - lgtm, I am not around though to deploy right this moment" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [16:17:42] T333961: es1021-> es1022 ended up in a weird state after switch maintenance [was: Replication broke on es1022 (es4)] - https://phabricator.wikimedia.org/T333961 [16:17:45] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) These look to be relatively old objects - I looked one up in a container DB and it was created in 2014. [16:18:02] RECOVERY - IPMI Sensor Status on db2163 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:18:43] (03CR) 10Dzahn: [C: 03+1] gitlab: Disable listening on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [16:18:57] (03Abandoned) 10Dzahn: planet: if on bullseye, install package contents via puppet [puppet] - 10https://gerrit.wikimedia.org/r/898982 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [16:29:39] (03CR) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:34:50] (03CR) 10Dzahn: [C: 04-1] gerrit: replace Icinga with Prometheus monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:36:24] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:36:51] (03CR) 10Dzahn: [C: 03+2] alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:37:41] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:37:53] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:38:04] (03CR) 10Dzahn: "@hashar while we still discuss the HTTPS check details, maybe we can start with the SSH port check? what do you think" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [16:38:57] (03CR) 10Dzahn: [C: 03+2] beta: Enable /srv/mediawiki symlink on deployment-deploy03 [puppet] - 10https://gerrit.wikimedia.org/r/905297 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [16:39:22] (03CR) 10Dzahn: [C: 03+2] "seems like it has a dependency on another change, but not sure if intentional or not" [puppet] - 10https://gerrit.wikimedia.org/r/905297 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [16:40:47] (03CR) 10Dzahn: "I don't think I am the right reviewer for mediawiki deployment, tbh." [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [16:42:50] (03CR) 10Ahmon Dancy: beta: Enable /srv/mediawiki symlink on deployment-deploy03 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905297 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [16:43:26] jouncebot: nowandnext [16:43:26] For the next 0 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1600) [16:43:26] In 0 hour(s) and 16 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1700) [16:43:36] (03PS1) 10DCausse: rdf-streaming-updater: fix jvm-overhead.fraction [deployment-charts] - 10https://gerrit.wikimedia.org/r/905686 [16:43:40] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "mergeMessageFileList.php: move code out of file scope." [core] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/905617 (https://phabricator.wikimedia.org/T333966) (owner: 10Ladsgroup) [16:44:39] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:905617|Revert "mergeMessageFileList.php: move code out of file scope." (T333966)]] [16:44:43] T333966: message keys shown on beta and group0 wikis (test Wikidata, etc.) - https://phabricator.wikimedia.org/T333966 [16:45:29] Hey all - (jbond rvl) - would it be ok if I tried to get a PrivateSettings.php update out here in a bit? I know the puppet and infra windows are happening now, but this should be fairly quick> [16:47:51] sbassett: puppet window is empty, not actually happening [16:47:53] (03CR) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [16:48:33] so I can't speak for infra window, but puppet isnt an issue that should block you [16:50:18] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: fix jvm-overhead.fraction [deployment-charts] - 10https://gerrit.wikimedia.org/r/905686 (owner: 10DCausse) [16:50:50] (03CR) 10Dzahn: [C: 03+1] "the serviceops team would be best for that" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [16:54:37] Tx, mutante [16:54:39] Infra window should be clear sbassett [16:54:50] (03Merged) 10jenkins-bot: rdf-streaming-updater: fix jvm-overhead.fraction [deployment-charts] - 10https://gerrit.wikimedia.org/r/905686 (owner: 10DCausse) [16:55:40] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:55:44] (03CR) 10Volans: "I don't have enough swift context to comment on the whole logic, left some general comments cookbook/python wise. Feel free to ping me if " [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [16:55:46] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:55:58] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:56:08] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1700) [17:02:45] tx claims. Looks like Amir1 still has scap locked rn... [17:03:19] sorry, claime ^ [17:04:07] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:905617|Revert "mergeMessageFileList.php: move code out of file scope." (T333966)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [17:04:11] T333966: message keys shown on beta and group0 wikis (test Wikidata, etc.) - https://phabricator.wikimedia.org/T333966 [17:04:26] (03CR) 10Dzahn: "Hello, this is the same thing you kindly reviewed yesterday and we did without issues for the "preview" site. Just now it's the actual htt" [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [17:06:25] (03PS2) 10Dzahn: wdqs/wcqs: switch query.wikidata.org and wcqs to bullseye backends [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) [17:06:56] sbassett: yes just to confirm dont let the puppet window block [17:07:22] yup [17:07:23] sorry [17:07:49] sbassett: my apologies [17:07:57] it should be over soon [17:09:05] (03CR) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [17:10:19] (03CR) 10BCornwall: [V: 03+1 C: 03+2] gitlab: Disable listening on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [17:13:16] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Thanks @Volans, we'll get the additional info for lines 340 and onwards. I'm still seeing the following S/N's on the Accounting Spreadsheet that are still repor... [17:13:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:14:34] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:19:37] (03PS1) 10DCausse: rdf-streaming-updater: fix mem limits unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/905690 [17:20:30] (03Abandoned) 10Dzahn: drop people.eqiad.wmnet service alias [dns] - 10https://gerrit.wikimedia.org/r/891732 (owner: 10Dzahn) [17:21:28] (03CR) 10Dzahn: "what my intention was should be resolved by https://gerrit.wikimedia.org/r/c/operations/puppet/+/905317 So I will abandon this." [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [17:21:32] (03Abandoned) 10Dzahn: remove commons-query virtual host from httpd on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/893086 (https://phabricator.wikimedia.org/T330090) (owner: 10Dzahn) [17:22:08] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) [17:22:24] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:22:58] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:905617|Revert "mergeMessageFileList.php: move code out of file scope." (T333966)]] (duration: 38m 18s) [17:23:03] T333966: message keys shown on beta and group0 wikis (test Wikidata, etc.) - https://phabricator.wikimedia.org/T333966 [17:25:44] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:27:26] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:34:17] (03Abandoned) 10Dzahn: doc.wikimedia.org: switch active host from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893579 (https://phabricator.wikimedia.org/T330963) (owner: 10Dzahn) [17:37:18] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1003.wikimedia.org (A3) - https://phabricator.wikimedia.org/T333996 (10Jelto) [17:37:27] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1004.wikimedia.org (B1) - https://phabricator.wikimedia.org/T333997 (10Jelto) [17:37:54] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1003.wikimedia.org (A3) - https://phabricator.wikimedia.org/T333996 (10Jelto) [17:41:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jclark-ctr) @MatthewVernon I Talked to Jen about this server it still had foreign state for drive 19 that we corrected this morning. Could this server be reimaged again Possibl... [17:42:00] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff) @FNavas-foundation: We still need feedback by @Ottomata in the kind of access you'll need to access these dashboards. The access requests are proces... [17:43:21] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) >>! In T320955#8755710, @wiki_willy wrote: > Thanks @Volans, we'll get the additional info for lines 340 and onwards. I'm still seeing the following S/N's on the Ac... [17:43:55] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: fix mem limits unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/905690 (owner: 10DCausse) [17:47:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) @Jclark-ctr you'll see from the history of this ticket that the foreign drive state was fixed by Papaul first [[ https://phabricator.wikimedia.org/T332983#8738546 |... [17:48:31] (03Merged) 10jenkins-bot: rdf-streaming-updater: fix mem limits unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/905690 (owner: 10DCausse) [17:48:33] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Ottomata) Hiya, I believe task description is not accurate, but in T331482#8735917 it looks like what is needed is ssh-less membership in analytics-privatedata-users.... [17:50:59] 10SRE, 10Traffic, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Jelto) I send a short message on wikitech-l, in case something breaks on GitLab so users are aware o... [17:53:28] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Jelto) [17:53:40] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jclark-ctr) @MatthewVernon i did review the history and did see papaul mentioned he corrected foreign state. I am new to this ticket and i always hope everyone presses the butto... [17:59:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jclark-ctr) Service Request 16561410 [18:00:04] hashar and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1800). [18:05:10] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:05:20] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:08:39] (03CR) 10Dzahn: [C: 03+1] scap: block Scap execution on inactive deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [18:11:30] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:25] jouncebot: now [18:13:25] For the next 1 hour(s) and 46 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1800) [18:15:36] (03CR) 10Dzahn: [C: 03+1] "best I have right now is the "profile::contacts::role_contacts" in the actual puppet repo.. it maps roles to teams. so for example see the" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [18:15:42] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:17:56] (03PS2) 10Jameel Kaisar: Add timing headers to http endpoints of measure-dc domains [puppet] - 10https://gerrit.wikimedia.org/r/904883 (https://phabricator.wikimedia.org/T332028) [18:19:53] (03CR) 10Jameel Kaisar: Add timing headers to http endpoints of measure-dc domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904883 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [18:27:12] (03PS1) 10Andrew Bogott: Pass enforce_policy_scope and enforce_new_policy_defaults to cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/905700 (https://phabricator.wikimedia.org/T330759) [18:34:08] (03PS2) 10Andrew Bogott: Pass enforce_policy_scope and enforce_new_policy_defaults to cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/905700 (https://phabricator.wikimedia.org/T330759) [18:36:41] (03CR) 10Andrew Bogott: [C: 03+2] Pass enforce_policy_scope and enforce_new_policy_defaults to cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/905700 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [18:38:06] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Answering some specific questions from Eric: > Will disparate WMF... [18:54:30] PROBLEM - Host mw2267 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:54] RECOVERY - Host mw2267 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:59:40] jouncebot: next [18:59:40] In 1 hour(s) and 0 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T2000) [19:02:07] jouncebot: now [19:02:07] For the next 0 hour(s) and 57 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1800) [19:02:09] ah [19:02:24] dduvall: I am doing the group0 promotion to 1.41.0-wmf.3 [19:02:43] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905704 (https://phabricator.wikimedia.org/T330209) [19:02:47] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905704 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [19:03:32] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905704 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [19:04:02] hashar: you’re primary this week, right? I’m available to do it if you need me [19:04:10] yes sir [19:04:39] hopefully we have caught all the issues earlier today :-] [19:05:34] :) [19:06:20] hashar: oh, sorry. i read "I am doing" as "am I doing?" :D [19:06:32] transpositional brain [19:06:33] clearly we need to adopt french as an official language soryr [19:06:40] (03PS1) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus3002 [puppet] - 10https://gerrit.wikimedia.org/r/905705 (https://phabricator.wikimedia.org/T309979) [19:06:41] haha, {{done}} [19:06:42] I AM doing the train right now [19:07:05] w00t! [19:07:24] trains are mostly boring this days [19:07:50] i'll take my trains boring, thank you [19:10:09] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.3 refs T330209 [19:10:13] T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209 [19:10:27] 📺 🚆 🍿 [19:18:02] Reedy: you are too fast, by the time I file the paper work a patch is already present :] [19:31:17] (03PS1) 10Bartosz Dziewoński: EditCheck: catch errors from TransactionSquasher [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/905685 (https://phabricator.wikimedia.org/T324733) [19:31:28] (03PS1) 10Hashar: Replace usages of Hooks::register() [extensions/LdapAuthentication] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/905726 (https://phabricator.wikimedia.org/T334005) [19:31:56] (03PS1) 10Bartosz Dziewoński: Revert "Revert "Enable hidden tag for "Edit Check" project on Wikipedias"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905727 (https://phabricator.wikimedia.org/T324733) [19:31:58] Reedy: bd808: and there is the cherry pick for wmf branch, if one of you can +2 it. I will deploy when it is merged [19:32:23] (03CR) 10BryanDavis: [C: 03+2] Replace usages of Hooks::register() [extensions/LdapAuthentication] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/905726 (https://phabricator.wikimedia.org/T334005) (owner: 10Hashar) [19:33:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:33:11] self-merge of cherry-picks is generally considered to be ok, but happy to throw +2 around wildy ;) [19:33:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:34:15] (03Merged) 10jenkins-bot: Replace usages of Hooks::register() [extensions/LdapAuthentication] (wmf/1.41.0-wmf.3) - 10https://gerrit.wikimedia.org/r/905726 (https://phabricator.wikimedia.org/T334005) (owner: 10Hashar) [19:35:32] (03PS2) 10Bartosz Dziewoński: Revert "Revert "Enable hidden tag for "Edit Check" project on Wikipedias"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905727 (https://phabricator.wikimedia.org/T324733) [19:35:44] (03PS4) 10Bartosz Dziewoński: Clean up history page visual diffs beta feature config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903781 (https://phabricator.wikimedia.org/T333448) [19:37:53] (03PS1) 10Hashar: wm-zuul-status: change pending jobs SUCCESS > INFO [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/905708 (https://phabricator.wikimedia.org/T214068) [19:38:25] jouncebot: now [19:38:25] For the next 0 hour(s) and 21 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T1800) [19:38:36] bd808: thanks :) [19:38:58] if everything breaks, blame Reedy! ;) [19:39:03] for sure! [19:39:19] !log hashar@deploy2002 Started scap: Backport for [[gerrit:905726|Replace usages of Hooks::register() (T334005)]] [19:39:23] T334005: PHP Deprecated: Use of Hooks::register was deprecated in MediaWiki 1.35. [Called from LdapPrimaryAuthenticationProvider::__construct] - https://phabricator.wikimedia.org/T334005 [19:40:50] !log hashar@deploy2002 hashar: Backport for [[gerrit:905726|Replace usages of Hooks::register() (T334005)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [19:41:30] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:46:14] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:905726|Replace usages of Hooks::register() (T334005)]] (duration: 06m 55s) [19:46:19] T334005: PHP Deprecated: Use of Hooks::register was deprecated in MediaWiki 1.35. [Called from LdapPrimaryAuthenticationProvider::__construct] - https://phabricator.wikimedia.org/T334005 [19:46:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:50:38] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite) [19:51:03] (03PS3) 10Dzahn: wdqs/wcqs: switch query.wikidata.org and wcqs to bullseye backends [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) [19:51:34] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:51:38] train looks good, there is an issue with GrowthExperiments which I have filed as T334012 but that does not seem too bad [19:51:39] T334012: Wikimedia\NormalizedException\NormalizedException: {parameter1} - https://phabricator.wikimedia.org/T334012 [19:51:56] (03CR) 10Cwhite: [C: 03+2] logstash: restore logstash index patch level [puppet] - 10https://gerrit.wikimedia.org/r/904265 (https://phabricator.wikimedia.org/T180051) (owner: 10Cwhite) [19:51:59] (03PS4) 10Dzahn: wdqs/wcqs: switch query.wikidata.org and wcqs to bullseye backends [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) [19:52:16] (03CR) 10Ryan Kemper: [C: 03+1] wdqs/wcqs: switch query.wikidata.org and wcqs to bullseye backends [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [19:52:22] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:52:36] (03CR) 10Dzahn: [C: 03+2] wdqs/wcqs: switch query.wikidata.org and wcqs to bullseye backends [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [19:52:49] dduvall: train looks good [19:53:08] there is something with GrowthExperiments but I am not sure how disruptive it is to the app [19:53:32] and some deprecation notice for wikitech but that got hotfixed faster than I wrote this message [19:54:23] (03PS1) 10Bartosz Dziewoński: Stop using redundant $wmg variables for VisualEditor extension (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905709 [19:54:25] (03PS1) 10Bartosz Dziewoński: Stop using redundant $wmg variables for VisualEditor extension (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905710 [19:54:27] (03PS1) 10Bartosz Dziewoński: Stop using redundant $wmg variables for VisualEditor extension (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905711 (https://phabricator.wikimedia.org/T119117) [19:54:29] (03PS1) 10Bartosz Dziewoński: Remove weird VisualEditor config hack from 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905712 [19:55:15] !log https://query.wikidata.org and WCQS GUIs are switching to new backend VMs on bullseye in codfw T330090 T331896 [19:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:21] T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896 [19:55:21] T330090: Switchover static miscweb services to codfw - https://phabricator.wikimedia.org/T330090 [19:56:07] hashar: yay! ty [19:56:27] (03CR) 10Hashar: [C: 03+2] wm-zuul-status: change pending jobs SUCCESS > INFO [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/905708 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [19:56:36] jouncebot: next [19:56:36] In 0 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T2000) [19:56:48] next thing, write doc for that gerrit javascript so I can get proper reviews :) [19:56:59] (03Merged) 10jenkins-bot: wm-zuul-status: change pending jobs SUCCESS > INFO [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/905708 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [19:57:45] i have 2 patches in the next window, but if any deployers are feeling bored, i have a couple config cleanup patches that we could also do [19:58:16] !log hashar@deploy2002 Started deploy [gerrit/gerrit@dbaaa7a]: wm-zuul-status: change pending jobs SUCCESS > INFO | T214068 [19:58:20] https://gerrit.wikimedia.org/r/q/project:operations/mediawiki-config+owner:dziewonski+is:open [19:58:21] T214068: Display Zuul status of jobs for a change on Gerrit UI - https://phabricator.wikimedia.org/T214068 [19:58:24] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@dbaaa7a]: wm-zuul-status: change pending jobs SUCCESS > INFO | T214068 (duration: 00m 07s) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T2000). [20:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] wotcha [20:00:15] !log T331896 Running puppet on wdqs fleet to pickup new miscweb gui_url: `ryankemper@cumin1001:~$ sudo -E cumin -b 6 'wdqs*' 'run-puppet-agent'` [20:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:20] * TheresNoTime can deploy [20:01:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/905685 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [20:01:49] eh we'll do 905727 while we wait on ^ [20:02:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905727 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [20:03:10] !log running puppet on cp5*, cp4*... [20:03:11] (03Merged) 10jenkins-bot: Revert "Revert "Enable hidden tag for "Edit Check" project on Wikipedias"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905727 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [20:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:36] !log samtar@deploy2002 Started scap: Backport for [[gerrit:905727|Revert "Revert "Enable hidden tag for "Edit Check" project on Wikipedias"" (T324733)]] [20:03:40] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [20:04:02] (03CR) 10Cwhite: [C: 03+2] logstash: add thanos-query ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902334 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:05:02] !log samtar@deploy2002 matmarex and samtar: Backport for [[gerrit:905727|Revert "Revert "Enable hidden tag for "Edit Check" project on Wikipedias"" (T324733)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:05:23] MatmaRex: can you test 905727 ? [20:05:25] fyi, i can't really test this, we haven't managed to reproduce the errors this catches [20:05:31] but i will watch the error logs afterwards [20:05:32] ah, okay, syncing [20:05:33] if that's okay [20:05:54] PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:27] if anything is wrong, errors would show up in this logstash view: https://logstash.wikimedia.org/goto/250706a6d2eba3b5f62559b1d35bb31c [20:06:36] ack, will keep an eye [20:06:49] !log T331896 Running puppet on wcqs fleet to pickup new miscweb gui_url: `ryankemper@cumin1001:~$ sudo -E cumin -b 2 'wcqs*' 'run-puppet-agent'` [20:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:53] T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896 [20:07:39] (03PS1) 10Hashar: admin: hashar: some more git aliases [puppet] - 10https://gerrit.wikimedia.org/r/905715 [20:10:05] !log deploying ATS config change on cp2* for query.wikidata.org [20:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:07] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:905727|Revert "Revert "Enable hidden tag for "Edit Check" project on Wikipedias"" (T324733)]] (duration: 07m 30s) [20:11:10] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [20:11:12] dduvall: there are a few more errors but I don't think any of them warrant a rollback. I will file them tomorrow morning [20:13:46] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:14:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:15:00] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:15:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/905685 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [20:15:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:16:46] (03PS1) 10Ottomata: WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/905716 [20:16:58] (03CR) 10Dzahn: [C: 03+2] "deployed together carefully, batches of cp servers at a time while we looked at logs on miscweb* apaches, checked / and /querybuilder, cou" [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [20:18:23] (03CR) 10Dzahn: [C: 03+2] "thank you Ryan Kemper for doing this together" [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [20:18:29] (03Merged) 10jenkins-bot: EditCheck: catch errors from TransactionSquasher [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/905685 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [20:18:57] !log samtar@deploy2002 Started scap: Backport for [[gerrit:905685|EditCheck: catch errors from TransactionSquasher (T324733)]] [20:19:01] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [20:20:21] !log samtar@deploy2002 matmarex and samtar: Backport for [[gerrit:905685|EditCheck: catch errors from TransactionSquasher (T324733)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:20:26] the train looks quiet. I filed a few more error log related tasks but nothing that sounds worrying [20:20:33] happy backport window. I am off to bed [20:20:37] MatmaRex: live on mwdebug, can you test? ^ [20:20:40] hashar: o/ [20:20:48] :-] [20:20:50] yeah [20:21:50] TheresNoTime:looks good [20:21:57] syncin' [20:23:47] !log bking@cumin1001 unban elastic nodes post switch maintenance T331882 [20:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:52] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [20:24:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:24:18] MatmaRex: FYI, couple of errors on https://logstash.wikimedia.org/app/discover#/?_g=h@1177e1c&_a=h@c79fcfe [20:25:00] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:25:49] TheresNoTime: hmm, i just realized we should have synced the changes in the opposute order :/ i think that's the cause of the errors [20:26:05] ah [20:26:20] it should be fine after the wmf.2 patch is deployed [20:27:20] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:905685|EditCheck: catch errors from TransactionSquasher (T324733)]] (duration: 08m 23s) [20:27:26] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [20:28:32] TheresNoTime: it may take a while for the errors to fully stop occurring, because of people who have already opened the editor with the faulty version of the code [20:28:38] so don't be alarmed if you see a few more [20:29:04] (it's the same reason why there are a few occurrences during the previous week, even days after we reverted the code) [20:29:24] Okay :) was going to deploy `903781: Clean up history page visual diffs beta feature config | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/903781` as one of the misc. ones you mentioned, sound okay? [20:29:26] (people just keep the editor open in a browser tab for days routinely) [20:29:41] TheresNoTime: absolutely, that should be fine to deploy [20:29:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903781 (https://phabricator.wikimedia.org/T333448) (owner: 10Bartosz Dziewoński) [20:31:16] (03Merged) 10jenkins-bot: Clean up history page visual diffs beta feature config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903781 (https://phabricator.wikimedia.org/T333448) (owner: 10Bartosz Dziewoński) [20:31:41] !log samtar@deploy2002 Started scap: Backport for [[gerrit:903781|Clean up history page visual diffs beta feature config (T333448)]] [20:31:46] T333448: Remove the BetaFeatures integration for historical visual diffs - https://phabricator.wikimedia.org/T333448 [20:33:04] !log samtar@deploy2002 matmarex and samtar: Backport for [[gerrit:903781|Clean up history page visual diffs beta feature config (T333448)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:33:07] * TheresNoTime will sync [20:34:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:35:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:38:16] (03PS1) 10Xcollazo: Add metric alert for section image suggestions. [alerts] - 10https://gerrit.wikimedia.org/r/905719 (https://phabricator.wikimedia.org/T328789) [20:38:23] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:903781|Clean up history page visual diffs beta feature config (T333448)]] (duration: 06m 42s) [20:38:28] T333448: Remove the BetaFeatures integration for historical visual diffs - https://phabricator.wikimedia.org/T333448 [20:39:06] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:39:25] Unrelated to current deploy, I'm seeing a greater than normal number of `DBTransactionStateError` / `DBQueryError`s [20:39:48] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:40:17] (*looking at the last ~6 hours) [20:41:06] (03PS2) 10Xcollazo: structured-data: Add metric alert for section image suggestions. [alerts] - 10https://gerrit.wikimedia.org/r/905719 (https://phabricator.wikimedia.org/T328789) [20:43:00] thanks for deploying [20:43:07] jouncebot: now [20:43:07] For the next 0 hour(s) and 16 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230404T2000) [20:44:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:44:09] !log closing UTC late backport window [20:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on miscweb2002.codfw.wmnet with reason: decom [20:44:32] thanks for that log :) [20:44:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on miscweb2002.codfw.wmnet with reason: decom [20:44:48] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:45:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:46:30] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:47:45] (03PS1) 10AOkoth: exim: fix hard-coded vrts hostname [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) [20:47:48] 10SRE, 10DBA, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10TheresNoTime) [20:49:27] Hey all - finally deploying the security mitigation for T333140 [20:55:13] 10SRE, 10DBA, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10jcrespo) Looking at the latest errors, it seems related to ruwiki & discussion tools. [20:56:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:56:41] !log Deployed mitigation for T333140 [20:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:51] (03CR) 10Dzahn: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [21:01:18] (03CR) 10Dzahn: "I think it's a very good catch by Arnold to point out these hardcoded host names in global mx template. So adding more people to agree how" [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [21:01:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:02:44] (03CR) 10Dzahn: "see my inline comments. but also I am thinking maybe VRTS should have its exim4.conf.erb inside its own module and not use the one from ro" [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [21:03:58] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10jcrespo) [21:04:29] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10jcrespo) >>! In T334023#8756844, @jcrespo wrote: > Looking at the latest errors, it seems related to ~~ruwiki~~ & discussion too... [21:15:35] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10jcrespo) If this is caused by a deploy, I would revert, as it is overloading ruwiki dbs (s6). If this is caused by traffic, I wo... [21:16:05] (03PS1) 10Eevans: sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce} [puppet] - 10https://gerrit.wikimedia.org/r/905746 (https://phabricator.wikimedia.org/T327954) [21:16:28] (03CR) 10CI reject: [V: 04-1] sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce} [puppet] - 10https://gerrit.wikimedia.org/r/905746 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [21:18:48] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10jcrespo) It stopped now, but for some time writes got 40x the normal rate in s6: https://grafana.wikimedia.org/goto/qCzK_vLVz?or... [21:22:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts miscweb2002.codfw.wmnet [21:22:37] (03PS2) 10Eevans: sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce} [puppet] - 10https://gerrit.wikimedia.org/r/905746 (https://phabricator.wikimedia.org/T327954) [21:25:51] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/905746 (https://phabricator.wikimedia.org/T327954) (owner: 10Eevans) [21:26:45] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [21:27:36] (03PS1) 10Dzahn: remove miscweb2002, was commented out fail-over machine [dns] - 10https://gerrit.wikimedia.org/r/905748 (https://phabricator.wikimedia.org/T334024) [21:27:46] (03CR) 10CI reject: [V: 04-1] remove miscweb2002, was commented out fail-over machine [dns] - 10https://gerrit.wikimedia.org/r/905748 (https://phabricator.wikimedia.org/T334024) (owner: 10Dzahn) [21:28:14] (03PS2) 10Dzahn: remove miscweb2002, was commented out fail-over machine [dns] - 10https://gerrit.wikimedia.org/r/905748 (https://phabricator.wikimedia.org/T334024) [21:37:29] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: miscweb2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [21:39:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: miscweb2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [21:39:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:39:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts miscweb2002.codfw.wmnet [21:40:22] (03PS1) 10Bartosz Dziewoński: Stop using redundant $wmg variable for MobileFrontend extension (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905750 [21:40:24] (03PS1) 10Bartosz Dziewoński: Stop using redundant $wmg variable for MobileFrontend extension (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905751 (https://phabricator.wikimedia.org/T119117) [21:42:26] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:49:14] (03PS1) 10Dzahn: delete query-preview.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) [21:49:21] (03CR) 10CI reject: [V: 04-1] delete query-preview.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:51:14] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10matmarex) (summarizing from IRC:) The DiscussionTools errors are a known issue (T323077), but their rate so far has been low. We... [21:55:34] (03PS2) 10JHathaway: Add an in place Debian upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) [21:56:11] (03CR) 10JHathaway: "Thanks for reviewing @jbond" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [22:00:14] PROBLEM - Check systemd state on cp3060 is CRITICAL: CRITICAL - degraded: The following units failed: varnishkafka-webrequest.service,varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:36] PROBLEM - Webrequests Varnishkafka log producer on cp3060 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:00:38] PROBLEM - eventlogging Varnishkafka log producer on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:01:42] RECOVERY - eventlogging Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:02:04] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) Can it be related to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/905727 ? If not, it can be simpl... [22:02:16] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Greater than average number of DBTransactionStateError/DBQueryErrors - https://phabricator.wikimedia.org/T334023 (10jcrespo) After looking at the binlogs, there is a lot of inserts happening in a very short period of time. While T323077 itself... [22:02:24] (03PS2) 10Dzahn: delete query-preview.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) [22:03:21] (03CR) 10Dzahn: [C: 03+2] remove miscweb2002, was commented out fail-over machine [dns] - 10https://gerrit.wikimedia.org/r/905748 (https://phabricator.wikimedia.org/T334024) (owner: 10Dzahn) [22:06:10] (03PS5) 10Dzahn: miscweb/site: remove miscweb2002 from site [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896) [22:06:29] (03CR) 10Dzahn: [C: 03+2] "decom cookbook has finished and shut it down" [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [22:08:00] RECOVERY - Check systemd state on cp3060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:00] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) T323077 is a dealock and my guess is... [22:08:08] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10jcrespo) [22:08:24] RECOVERY - Webrequests Varnishkafka log producer on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:09:41] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [22:10:52] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10jcrespo) I've updated the ticket to make explic... [22:12:56] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) 05Resolved→03Open Hi, Sorry, but it happened again while uploading https://commons.wikimedia.o... [22:13:51] (03CR) 10JHathaway: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [22:17:10] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - yup, that's correct. They've all been recycled. For any equipment that we've already sent out for recycling, we've been adding the asset tags and... [22:18:51] (03CR) 10Dzahn: exim: fix hard-coded vrts hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [22:21:32] (03CR) 10Dzahn: "also, I see the "route" itself in exim is called "otrs". Arnold has already renamed almost everything else except this and the db name or " [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [22:23:47] jouncebot: now [22:23:47] No deployments scheduled for the next 7 hour(s) and 36 minute(s) [22:25:36] (03CR) 10Dzahn: [C: 03+2] "Deploying this regardless. I checked additionally on mwdebug etc: https://puppet-compiler.wmflabs.org/output/904502/40542/mwdebug1002.eqia" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [22:26:14] (03CR) 10Dzahn: [C: 03+2] "there is also finally no deployment window now" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [22:26:58] !log deploying change to block scap execution on inactive deployment server via gerrit:904502 T330756 [22:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:02] T330756: Improve behavior around global Scap lock + communicate changes - https://phabricator.wikimedia.org/T330756 [22:35:59] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10Ladsgroup) it subsided drastically now, I tried... [22:36:44] (03PS1) 10Andrew Bogott: wikireplica_dns.yaml: remove osm entry [puppet] - 10https://gerrit.wikimedia.org/r/905759 [22:36:46] (03PS1) 10Andrew Bogott: wikireplica_dns.yaml: make legacy tools-db names cnames for the wmcloud domain [puppet] - 10https://gerrit.wikimedia.org/r/905760 (https://phabricator.wikimedia.org/T333471) [22:40:26] (03CR) 10Dzahn: [C: 03+2] "checked: mwdebug1002. mw1451: added "block_execution: false" to /etc/scap.cfg - deploy1002: block_execution: true deploy2002: lock_execu" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [22:40:37] (03CR) 10Andrew Bogott: [C: 04-2] "This is bad, tools.db.svc.wikimedia.cloud is a cname pointing to tools.db.svc.eqiad.wmflabs. So this patch removes the A record entirely a" [puppet] - 10https://gerrit.wikimedia.org/r/905760 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott) [22:41:09] (03CR) 10Dzahn: [C: 03+2] "execution should now be blocked on deploy1002" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [22:45:01] (03CR) 10Clare Ming: [C: 03+1] VisualEditorFeatureUse sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905601 (https://phabricator.wikimedia.org/T333168) (owner: 10Phuedx) [22:45:37] (03CR) 10Clare Ming: [C: 03+1] mediawiki.edit_attempt: Ignore events from PHP MPC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905261 (https://phabricator.wikimedia.org/T309985) (owner: 10Phuedx) [22:45:38] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Jclark-ctr) Open ticket with dell Confirmed: Service Request 165628610 was successfully submitted. [22:47:20] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Jclark-ctr) 05Open→03Resolved T333091 duplicate ticket [22:48:10] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) Submitted 2nd ticket Open ticket with dell Confirmed: Service Request 165628610 was successfully submitted. They have not responded to 1st ticket except for asking for address a... [22:55:49] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [22:57:46] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns cloudvirtlocal - jclark@cumin1001" [22:58:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns cloudvirtlocal - jclark@cumin1001" [22:58:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:59:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) [23:00:51] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [23:12:01] (03CR) 10Aaron Schulz: [C: 03+1] Re-enable xenon/excimer after mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905608 (owner: 10Tim Starling) [23:13:20] (03CR) 10Tim Starling: [C: 03+2] Re-enable xenon/excimer after mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905608 (owner: 10Tim Starling) [23:14:07] (03Merged) 10jenkins-bot: Re-enable xenon/excimer after mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905608 (owner: 10Tim Starling) [23:21:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [23:21:41] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [23:22:50] (03CR) 10EoghanGaffney: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [23:25:13] !log tstarling@deploy2002 Synchronized src/Profiler.php: re-enable excimer T331882 (duration: 06m 25s) [23:25:18] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [23:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:28:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [23:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:34:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [23:40:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1002.mgmt.eqiad.wmnet with reboot policy FORCED [23:53:21] (03PS3) 10Ladsgroup: Revert "Revert "Revert "Revert "mwscript: Switch to use run.php"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905609 (https://phabricator.wikimedia.org/T326800)