[00:08:13] PROBLEM - Host db1262 #page is DOWN: PING CRITICAL - Packet loss = 100% [00:08:25] o/ [00:10:12] RECOVERY - Host db1262 #page is UP: PING WARNING - Packet loss = 80%, RTA = 0.31 ms [00:10:26] PROBLEM - MariaDB Replica IO: s4 #page on db1262 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:10:47] * swfrench-wmf is around if you need more hands for anything [00:10:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [00:10:51] via PacketFabric) {#12243_12334-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [00:10:56] PROBLEM - MariaDB Replica SQL: s4 #page on db1262 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:19] PROBLEM - MariaDB read only s4 on db1262 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [00:11:33] I'm DMing with ChrisDobbins901_ - I thought we just had to depool a bad DB host but that TransitPeeringOutboundSaturation alert changes the story [00:11:55] PROBLEM - mysqld processes on db1262 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:13:03] looking to see if there's anything traffic-driven here [00:13:57] db1262 is s4 so something commons-y and load-driven would definitely fit [00:14:41] I'm around [00:14:42] gotcha. is depooling it still advised? [00:14:42] yay [00:14:42] odd that it's a codfw peering, but an eqiad read replica [00:15:02] let me check [00:15:09] thanks Amir1 <3 [00:15:32] ChrisDobbins901_: let's hold off for now -- if it does turn out to be driven by traffic, that would just shift the load to another replica and knock that one over instead [00:15:37] thank you, Amir1! [00:15:44] don't depool it for now, mw automatically depools hosts that are lagged or unresponsive for certain period [00:15:46] ack and thanks [00:15:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [00:15:51] via PacketFabric) {#12243_12334-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [00:17:10] okay, I don't think the s4 replica going down is related. Compare these two [00:17:13] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s4&var-role=$__all&from=now-6h&to=now&timezone=utc [00:17:17] and https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=$__all&var-role=$__all&from=now-6h&to=now&timezone=utc [00:17:29] nothing stands out in the rows read/written graphs on the appserver-red-k8s dashboard either [00:17:29] the rest of s4 in eqiad are doing fine [00:17:43] huh okay! [00:17:52] rzl: I see ParserCache got hammered though [00:17:56] PROBLEM - MariaDB Replica Lag: s4 #page on db1262 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:17:59] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=$__all&var-role=$__all&from=now-6h&to=now&timezone=utc&viewPanel=panel-7 [00:18:22] so I'm going to depool and donwtime this [00:18:36] Amir1: do you mind if ChrisDobbins901_ depools it for the practice? :) [00:18:47] ChrisDobbins901_: I'll walk you through it [00:18:51] sure sure [00:19:05] ok, thanks y'all [00:19:14] let me know when you're done, I'm going to start investigating the host afterwards [00:19:20] (switching back to DMs for that bit, in here we should still figure out that transit spike/pc situation) [00:19:21] ack [00:19:32] *peering spiek [00:22:29] (03PS3) 10Cathal Mooney: Refactor of move_server and script to move selective hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) [00:26:22] (03PS1) 10Catrope: Enable Special:AccountRecovery everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202346 (https://phabricator.wikimedia.org/T399742) [00:27:37] (03PS4) 10Cathal Mooney: Refactor of move_server and script to move selective hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) [00:27:38] !log cdobbins@cumin2002 dbctl commit (dc=all): 'Depool db1262', diff saved to https://phabricator.wikimedia.org/P84962 and previous config saved to /var/cache/conftool/dbconfig/20251106-002737-cdobbins.json [00:29:22] !log cdobbins@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1262.eqiad.wmnet with reason: HW issues, T409374 [00:29:25] T409374: db1262 is down - https://phabricator.wikimedia.org/T409374 [00:29:41] ChrisDobbins901_: lgtm [00:29:52] Amir1: all yours, and we got you a tracking task too [00:30:16] thanks, rzl and Amir1 [00:30:21] Thanks! [00:36:42] (03CR) 10CI reject: [V:04-1] Refactor of move_server and script to move selective hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) (owner: 10Cathal Mooney) [00:38:16] (03PS5) 10Cathal Mooney: Refactor of move_server and script to move selective hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) [00:38:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1202348 [00:38:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1202348 (owner: 10TrainBranchBot) [00:41:59] (03PS6) 10Cathal Mooney: Refactor of move_server and script to move selective hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) [00:45:33] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11347591 (10Dzahn) The ssh key is already in the repo since the user has existing shell access with other groups. Only need to add to deployment group. [00:48:37] (03PS1) 10Dzahn: admin: make user itamar a deployer [puppet] - 10https://gerrit.wikimedia.org/r/1202350 (https://phabricator.wikimedia.org/T408924) [00:49:24] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [00:49:27] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [00:54:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1202348 (owner: 10TrainBranchBot) [01:00:51] PROBLEM - Host wikikube-worker1086 is DOWN: PING CRITICAL - Packet loss = 100% [01:01:01] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:03:53] RECOVERY - Host wikikube-worker1086 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [01:08:46] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11347622 (10Dzahn) 05Open→03In progress [01:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:09:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1202357 [01:09:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1202357 (owner: 10TrainBranchBot) [01:12:58] (03CR) 10Aaron Schulz: "Changing rb-mw-mangling might be a less janky way to do this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202323 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [01:14:33] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 32s) [01:24:05] (03PS1) 10Novem Linguae: let temporary accounts edit enwiki draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) [01:24:32] (03CR) 10Novem Linguae: "Untested. Please review carefully." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) (owner: 10Novem Linguae) [01:33:21] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:56] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1202357 (owner: 10TrainBranchBot) [01:38:34] (03CR) 10Bartosz Dziewoński: [C:03+1] "Looks right, but make sure to test all 3 cases (anon/temp/named) during deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) (owner: 10Novem Linguae) [01:42:40] FIRING: [5x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:21] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:48:00] :| [01:48:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [01:48:51] via PacketFabric) {#12243_12334-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [01:52:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:53:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [01:53:51] via PacketFabric) {#12243_12334-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [02:02:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) (owner: 10Novem Linguae) [02:09:05] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:09] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11347743 (10Dzahn) We had some strange results when trying to debug this together. So I ended up testing every combination betw... [02:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11347761 (10Dzahn) @MoritzMuehlenhoff Could you take a look one more time? After debugging some strange issu... [02:40:13] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11347766 (10Dzahn) I tried reimaging 7001 but it stayed the same. DNS is ok both ways (2a02:ec80:700:103:10... [04:06:42] (03PS1) 10Tim Starling: Add English translations to namespaces that lack them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202368 (https://phabricator.wikimedia.org/T407127) [04:06:44] (03PS1) 10Tim Starling: Set robot noindex policy for draft namespaces that lacked it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202369 (https://phabricator.wikimedia.org/T407127) [04:07:28] (03CR) 10CI reject: [V:04-1] Add English translations to namespaces that lack them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202368 (https://phabricator.wikimedia.org/T407127) (owner: 10Tim Starling) [04:07:37] (03CR) 10CI reject: [V:04-1] Set robot noindex policy for draft namespaces that lacked it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202369 (https://phabricator.wikimedia.org/T407127) (owner: 10Tim Starling) [04:32:05] Sorry for the late handoff [04:32:12] Handoff: tonight was not quiet. db1262 was unabled to be pinged for reasons still unknown; it's been depooled. A.mir1 is going to investigate. There was also a spike in uploads traffic due to a single client that impersonated multiple user-agents; a requestctl rule based on the JA3N header was created and enabled. Also, there will be an email from t.opranks later, but note that LibreNMS alerts are disabled [04:32:12] (see: https://phabricator.wikimedia.org/T409330#11346176; thanks s.ukhe for pointing this out). [04:32:12] Thanks to rzl, A.mir1, s.ukhe, and sw.french-wmf for helping me out! [05:01:33] (03PS2) 10Tim Starling: Add English translations to namespaces that lack them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202368 (https://phabricator.wikimedia.org/T407127) [05:01:33] (03PS2) 10Tim Starling: Set robot noindex policy for draft namespaces that lacked it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202369 (https://phabricator.wikimedia.org/T407127) [05:08:21] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:25:02] (03PS1) 10MusikAnimal: Hide the WikiEditor search button [extensions/CodeEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202374 [05:32:45] (03CR) 10Tim Starling: [C:03+2] admin: Add FIDO key for tstarling [puppet] - 10https://gerrit.wikimedia.org/r/1201850 (owner: 10Tim Starling) [05:33:21] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202374 (owner: 10MusikAnimal) [05:39:31] (03Merged) 10jenkins-bot: Hide the WikiEditor search button [extensions/CodeEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202374 (owner: 10MusikAnimal) [05:40:11] (03PS1) 10Tim Starling: "hide logged in users" is no longer working with "non-JavaScript interface" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202375 (https://phabricator.wikimedia.org/T409157) [05:40:29] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1202374|Hide the WikiEditor search button]] [05:42:40] FIRING: [5x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:56] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1202374|Hide the WikiEditor search button]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [05:44:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:57:36] (03PS1) 10KartikMistry: Update Recommnedation API to 2025-11-05-230545-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202376 (https://phabricator.wikimedia.org/T405000) [06:06:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202375 (https://phabricator.wikimedia.org/T409157) (owner: 10Tim Starling) [06:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2205.codfw.wmnet with reason: Maintenance [06:42:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:43:21] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:59] (03PS1) 10Marostegui: report_users.sh Exclude PUBLIC role [software] - 10https://gerrit.wikimedia.org/r/1202380 [06:55:39] 10ops-eqiad, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11348009 (10Marostegui) The host went down, so it is not really a mariadb bug in that sense. This is more likely to be a memory error: ` 2025-11-06T00:05:21.598940+00:00 db1262 ker... [06:55:57] (03CR) 10Marostegui: "This has been tested" [software] - 10https://gerrit.wikimedia.org/r/1202380 (owner: 10Marostegui) [06:55:59] (03CR) 10Marostegui: [C:03+2] report_users.sh Exclude PUBLIC role [software] - 10https://gerrit.wikimedia.org/r/1202380 (owner: 10Marostegui) [06:56:28] (03Merged) 10jenkins-bot: report_users.sh Exclude PUBLIC role [software] - 10https://gerrit.wikimedia.org/r/1202380 (owner: 10Marostegui) [06:58:35] 10ops-eqiad, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11348014 (10Marostegui) For what is worth this is a super new host T400214 - it's been in production for over just a month [06:58:50] (03PS1) 10Marostegui: db1262: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1202381 (https://phabricator.wikimedia.org/T409374) [06:59:02] 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11348017 (10Marostegui) p:05High→03Medium [06:59:28] (03CR) 10Marostegui: [C:03+2] db1262: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1202381 (https://phabricator.wikimedia.org/T409374) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T0700). [07:01:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:01:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1163 (T407997)', diff saved to https://phabricator.wikimedia.org/P84964 and previous config saved to /var/cache/conftool/dbconfig/20251106-070128-marostegui.json [07:01:32] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:01:35] (03PS1) 10Filippo Giunchedi: cloudceph: adjust mtu on cluster interface for single-nic [puppet] - 10https://gerrit.wikimedia.org/r/1202382 (https://phabricator.wikimedia.org/T409294) [07:03:53] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1202382 (https://phabricator.wikimedia.org/T409294) (owner: 10Filippo Giunchedi) [07:10:16] (03PS2) 10Filippo Giunchedi: pontoon: improve UX during create-hosts errors [puppet] - 10https://gerrit.wikimedia.org/r/1202174 [07:10:16] (03PS1) 10Filippo Giunchedi: pontoon: add development instructions and fix tests [puppet] - 10https://gerrit.wikimedia.org/r/1202450 [07:10:35] (03Abandoned) 10Filippo Giunchedi: pontoon: add rolegroup bootstrap to demo [puppet] - 10https://gerrit.wikimedia.org/r/1201551 (owner: 10Filippo Giunchedi) [07:10:40] (03Abandoned) 10Filippo Giunchedi: pontoon: new stack demo [puppet] - 10https://gerrit.wikimedia.org/r/1201550 (owner: 10Filippo Giunchedi) [07:11:09] (03CR) 10CI reject: [V:04-1] pontoon: add development instructions and fix tests [puppet] - 10https://gerrit.wikimedia.org/r/1202450 (owner: 10Filippo Giunchedi) [07:15:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1209 with weight 0 T409299', diff saved to https://phabricator.wikimedia.org/P84965 and previous config saved to /var/cache/conftool/dbconfig/20251106-071506-marostegui.json [07:15:10] T409299: Switchover s8 master (db1193 -> db1209) - https://phabricator.wikimedia.org/T409299 [07:15:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T409299 [07:15:35] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1202157 (https://phabricator.wikimedia.org/T409299) (owner: 10Gerrit maintenance bot) [07:18:53] !log Starting s8 eqiad failover from db1193 to db1209 - T409299 [07:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1209 to s8 primary T409299', diff saved to https://phabricator.wikimedia.org/P84966 and previous config saved to /var/cache/conftool/dbconfig/20251106-071911-marostegui.json [07:19:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1193 T409299', diff saved to https://phabricator.wikimedia.org/P84967 and previous config saved to /var/cache/conftool/dbconfig/20251106-071949-marostegui.json [07:19:50] (03PS2) 10Filippo Giunchedi: pontoon: add development instructions and fix tests [puppet] - 10https://gerrit.wikimedia.org/r/1202450 [07:21:30] (03PS1) 10Marostegui: db1193: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202564 (https://phabricator.wikimedia.org/T409299) [07:21:33] (03PS2) 10Kosta Harlan: Allow temporary accounts to create in fawiki/enwiki Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) (owner: 10Novem Linguae) [07:22:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T407997)', diff saved to https://phabricator.wikimedia.org/P84968 and previous config saved to /var/cache/conftool/dbconfig/20251106-072200-marostegui.json [07:22:00] (03CR) 10Marostegui: [C:03+2] db1193: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202564 (https://phabricator.wikimedia.org/T409299) (owner: 10Marostegui) [07:22:03] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:22:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1193.eqiad.wmnet with reason: Maintenance [07:22:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1193 - Depool db1193 for migration to mariadb 10.11 [07:22:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1193 - Depool db1193 for migration to mariadb 10.11 [07:22:37] (03CR) 10Filippo Giunchedi: [C:03+2] "non-production, self merging" [puppet] - 10https://gerrit.wikimedia.org/r/1202174 (owner: 10Filippo Giunchedi) [07:22:48] (03CR) 10Filippo Giunchedi: [C:03+2] "non-production, self merging" [puppet] - 10https://gerrit.wikimedia.org/r/1202450 (owner: 10Filippo Giunchedi) [07:23:25] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11348060 (10Krd) All provided examples appear correct to me. The number for the main queue is always including the numbers for it's sub-queues. [07:23:29] (03CR) 10Kosta Harlan: "I've verified it locally, but yes, would be good to validate during deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) (owner: 10Novem Linguae) [07:23:33] (03CR) 10Kosta Harlan: [C:03+1] Allow temporary accounts to create in fawiki/enwiki Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) (owner: 10Novem Linguae) [07:23:41] !log musikanimal@deploy2002 musikanimal: Continuing with sync [07:28:02] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202374|Hide the WikiEditor search button]] (duration: 107m 34s) [07:29:52] 10ops-codfw, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390 (10phaultfinder) 03NEW [07:30:04] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1193 gradually with 4 steps - Repooling after upgrade [07:37:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P84970 and previous config saved to /var/cache/conftool/dbconfig/20251106-073707-marostegui.json [07:38:21] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202086 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:42:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202094 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:42:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:52:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P84972 and previous config saved to /var/cache/conftool/dbconfig/20251106-075215-marostegui.json [07:54:52] 10ops-codfw, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11348198 (10phaultfinder) [07:55:31] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11348199 (10Marostegui) I've started mariadb, but once the memory has been replaced we should just simply reclone this host. [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T0800). [08:00:05] robertsky, TimStarling, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:18] o/ [08:00:44] hi [08:00:59] robertsky: I'll deploy your patch [08:01:02] hi. am here to unbreak the draft issue. :) [08:01:03] thanks! [08:01:24] on standby to test in a fresh browser. [08:01:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) (owner: 10Novem Linguae) [08:01:58] robertsky: thanks, will let you know when it's on mwdebug [08:02:24] I verified it locally but we may as well check it before it syncs out [08:02:43] (03Merged) 10jenkins-bot: Allow temporary accounts to create in fawiki/enwiki Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202359 (https://phabricator.wikimedia.org/T409366) (owner: 10Novem Linguae) [08:03:13] yeah. I have checked on fawiki as well. no dice there too. am amazed that it wasn't caught there earlier? [08:03:34] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1202359|Allow temporary accounts to create in fawiki/enwiki Draft namespace (T409366)]] [08:03:37] T409366: Temp accounts can't create pages in the 'Draft:' namespace on English Wikipedia - https://phabricator.wikimedia.org/T409366 [08:03:59] it's difficult to get feedback on issues like this from new editors [08:04:13] as they often don't know where to report, or even to know that there's a problem [08:04:32] true true. [08:05:59] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps2009.codfw.wmnet [08:06:02] !log kharlan@deploy2002 kharlan, novemlinguae: Backport for [[gerrit:1202359|Allow temporary accounts to create in fawiki/enwiki Draft namespace (T409366)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:06:31] robertsky: ok, we can test on mwdebug [08:07:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T407997)', diff saved to https://phabricator.wikimedia.org/P84974 and previous config saved to /var/cache/conftool/dbconfig/20251106-080723-marostegui.json [08:07:27] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:07:28] tested. [08:07:28] robertsky: https://en.wikipedia.org/wiki/Draft:T409366 that works [08:07:29] https://en.wikipedia.org/wiki/Draft:Test_draft_creation_issue [08:07:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [08:07:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T407997)', diff saved to https://phabricator.wikimedia.org/P84975 and previous config saved to /var/cache/conftool/dbconfig/20251106-080746-marostegui.json [08:08:10] fawiki side looks good too. [08:08:28] didn't test submit, but i see the submit button is now enabled. [08:08:34] ok, I'll sync it [08:08:50] I confirmed we didn't break things for named accounts either [08:09:12] okie. [08:09:19] https://en.wikipedia.org/wiki/Draft:T409366-named [08:09:19] T409366: Temp accounts can't create pages in the 'Draft:' namespace on English Wikipedia - https://phabricator.wikimedia.org/T409366 [08:09:25] !log kharlan@deploy2002 kharlan, novemlinguae: Continuing with sync [08:10:10] shall i clean up/CSD the test pages? [08:10:32] robertsky: please, thank you [08:12:26] jmm@cumin2002 decommission (PID 4123380) is awaiting input [08:12:30] done. have deleted the test drafts as G2 (test pages).Suppressed talk page notification as well. [08:13:41] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202359|Allow temporary accounts to create in fawiki/enwiki Draft namespace (T409366)]] (duration: 10m 07s) [08:14:51] robertsky: thank you! [08:15:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1193 gradually with 4 steps - Repooling after upgrade [08:15:56] no issue in production for temp account. have verified. https://en.wikipedia.org/wiki/Draft:Test_draft_creation [08:16:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:16:45] this is through incognito without the debug extension. [08:20:48] kostajh: o/, are you done with your deploy? [08:20:54] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:21:01] Yes [08:21:06] ok thanks [08:21:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:21:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:21:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts maps2009.codfw.wmnet [08:21:28] TimStarling: o/ are you around? do you plan to self-deploy your change? [08:21:32] 06SRE, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11348298 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps2009.codfw.wmnet` - maps2009.codfw.wmnet (**FAIL**) - Downti... [08:21:36] I'm here [08:21:39] (03CR) 10Brouberol: [C:03+2] Define the growthbook-backend domain [dns] - 10https://gerrit.wikimedia.org/r/1201075 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:21:51] you can roll both out together if you like, I'm ready to test mine [08:22:07] !log brouberol@dns1004 START - running authdns-update [08:22:12] !log brouberol@dns1004 START - running authdns-update [08:22:18] TimStarling: sure doing this now [08:23:02] !log brouberol@dns1004 END - running authdns-update [08:23:24] (03CR) 10Brouberol: trafficserver: rediredct growthbook-backend from public to private domains [puppet] - 10https://gerrit.wikimedia.org/r/1201074 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:23:46] 06SRE, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11348307 (10MoritzMuehlenhoff) [08:24:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202375 (https://phabricator.wikimedia.org/T409157) (owner: 10Tim Starling) [08:24:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202086 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:24:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202094 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:25:25] (03Merged) 10jenkins-bot: cirrus: enable default_sort on en, fr and he wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202086 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:25:28] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348338 (10MoritzMuehlenhoff) [08:25:31] (03Merged) 10jenkins-bot: cirrus: enable alt index with default_sort on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202094 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:25:53] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348350 (10MoritzMuehlenhoff) [08:26:01] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11348351 (10MoritzMuehlenhoff) [08:26:45] (03PS2) 10Brouberol: dse-k8s-eqiad: add the backend domain to the certificate SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201080 (https://phabricator.wikimedia.org/T408903) [08:28:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T407997)', diff saved to https://phabricator.wikimedia.org/P84977 and previous config saved to /var/cache/conftool/dbconfig/20251106-082814-marostegui.json [08:28:18] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:29:08] (03Merged) 10jenkins-bot: "hide logged in users" is no longer working with "non-JavaScript interface" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202375 (https://phabricator.wikimedia.org/T409157) (owner: 10Tim Starling) [08:29:44] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1202375|"hide logged in users" is no longer working with "non-JavaScript interface" (T409157)]], [[gerrit:1202086|cirrus: enable default_sort on en, fr and he wikipedias (T404858)]], [[gerrit:1202094|cirrus: enable alt index with default_sort on a set of wikis (T404858)]] [08:29:49] T409157: "hide logged in users" is no longer working with "non-JavaScript interface" - https://phabricator.wikimedia.org/T409157 [08:29:49] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:30:06] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps1005.eqiad.wmnet [08:32:11] (03CR) 10Btullis: [C:03+1] trafficserver: rediredct growthbook-backend from public to private domains [puppet] - 10https://gerrit.wikimedia.org/r/1201074 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:32:14] !log dcausse@deploy2002 dcausse, tstarling: Backport for [[gerrit:1202375|"hide logged in users" is no longer working with "non-JavaScript interface" (T409157)]], [[gerrit:1202086|cirrus: enable default_sort on en, fr and he wikipedias (T404858)]], [[gerrit:1202094|cirrus: enable alt index with default_sort on a set of wikis (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes c [08:32:14] an now be verified there. [08:32:33] (03CR) 10Btullis: [C:03+1] growthbook: set the APP_ORIGIN and API_HOST env vars to the public domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201081 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:32:53] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: add the backend domain to the certificate SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201080 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:32:59] testing [08:34:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11348392 (10phaultfinder) [08:34:57] didn't realise we're not fully through the train, was testing the wrong wiki [08:35:28] np! [08:36:29] ok verified now [08:36:41] ok thanks, shipping [08:36:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:37:02] !log dcausse@deploy2002 dcausse, tstarling: Continuing with sync [08:39:04] (03PS1) 10KartikMistry: machinetranslation: Increase replica and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) [08:39:37] (03PS2) 10KartikMistry: machinetranslation: Increase replica and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) [08:40:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:41:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:41:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:41:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1005.eqiad.wmnet [08:41:23] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348397 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps1005.eqiad.wmnet` - maps1005.eqiad.wmnet (**PASS**) - Downti... [08:41:55] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps1006.eqiad.wmnet [08:42:33] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202375|"hide logged in users" is no longer working with "non-JavaScript interface" (T409157)]], [[gerrit:1202086|cirrus: enable default_sort on en, fr and he wikipedias (T404858)]], [[gerrit:1202094|cirrus: enable alt index with default_sort on a set of wikis (T404858)]] (duration: 12m 49s) [08:42:38] T409157: "hide logged in users" is no longer working with "non-JavaScript interface" - https://phabricator.wikimedia.org/T409157 [08:42:38] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:43:06] ok I guess we can now close this deploy window [08:43:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P84978 and previous config saved to /var/cache/conftool/dbconfig/20251106-084322-marostegui.json [08:44:08] !log UTC morning backport window done [08:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:54] jmm@cumin2002 decommission (PID 4130598) is awaiting input [08:48:12] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: add the backend domain to the certificate SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201080 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:48:16] (03CR) 10Brouberol: [C:03+2] growthbook: set the APP_ORIGIN and API_HOST env vars to the public domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201081 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:48:24] (03PS3) 10Brouberol: growthbook: set the APP_ORIGIN and API_HOST env vars to the public domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201081 (https://phabricator.wikimedia.org/T408903) [08:48:28] (03CR) 10Brouberol: [C:03+2] trafficserver: rediredct growthbook-backend from public to private domains [puppet] - 10https://gerrit.wikimedia.org/r/1201074 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:49:15] (03CR) 10Brouberol: [V:03+2 C:03+2] growthbook: set the APP_ORIGIN and API_HOST env vars to the public domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201081 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:49:32] (03PS2) 10Daniel Kinzler: Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 [08:50:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:50:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:51:18] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:51:28] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:51:53] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:54:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:55:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:55:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:55:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1006.eqiad.wmnet [08:55:45] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348446 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps1006.eqiad.wmnet` - maps1006.eqiad.wmnet (**PASS**) - Downti... [08:55:50] (03PS1) 10Brouberol: growthbook: set the right SAN in the backend tls certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202643 (https://phabricator.wikimedia.org/T408903) [08:56:02] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps1007.eqiad.wmnet [08:58:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P84979 and previous config saved to /var/cache/conftool/dbconfig/20251106-085830-marostegui.json [09:03:18] (03CR) 10Brouberol: [C:03+2] growthbook: set the right SAN in the backend tls certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202643 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [09:04:20] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:04:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:05:05] (03PS1) 10Mszwarc: Revert "Use OutputPageBeforeHTML instead of BeforePageDisplay to add modules" [extensions/Gadgets] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202645 (https://phabricator.wikimedia.org/T409367) [09:05:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:07:16] I'm going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Gadgets/+/1202645 [09:07:18] (03PS1) 10Brouberol: growthbook: avoid having the https:// scheme in the certificate SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202646 (https://phabricator.wikimedia.org/T408903) [09:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:46] (03CR) 10Brouberol: [C:03+2] growthbook: avoid having the https:// scheme in the certificate SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202646 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [09:10:05] jmm@cumin2002 decommission (PID 4132652) is awaiting input [09:10:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:11:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:11:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:11:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1007.eqiad.wmnet [09:11:46] (03Merged) 10jenkins-bot: growthbook: avoid having the https:// scheme in the certificate SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202646 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [09:11:56] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348511 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps1007.eqiad.wmnet` - maps1007.eqiad.wmnet (**PASS**) - Downti... [09:12:03] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps1008.eqiad.wmnet [09:12:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:12:27] (03PS1) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) [09:12:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:13:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T407997)', diff saved to https://phabricator.wikimedia.org/P84980 and previous config saved to /var/cache/conftool/dbconfig/20251106-091337-marostegui.json [09:13:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:13:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1186.eqiad.wmnet with reason: Maintenance [09:14:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T407997)', diff saved to https://phabricator.wikimedia.org/P84981 and previous config saved to /var/cache/conftool/dbconfig/20251106-091401-marostegui.json [09:14:24] (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [09:14:59] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202648 (https://phabricator.wikimedia.org/T402389) [09:15:59] jmm@cumin2002 decommission (PID 4137102) is awaiting input [09:16:42] (03PS1) 10Brouberol: Rename the growthbook-backend discovery domain into growthbook-api [dns] - 10https://gerrit.wikimedia.org/r/1202650 (https://phabricator.wikimedia.org/T408903) [09:17:48] (03CR) 10Elukey: "Adding folks from ServiceOps to validate the change, since allocating 64G of memory for a single pod is not usual and it may lead to some " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [09:18:05] (03CR) 10Btullis: [C:03+1] Rename the growthbook-backend discovery domain into growthbook-api [dns] - 10https://gerrit.wikimedia.org/r/1202650 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [09:18:50] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:19:24] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:19:40] (03CR) 10Brouberol: [C:03+2] Rename the growthbook-backend discovery domain into growthbook-api [dns] - 10https://gerrit.wikimedia.org/r/1202650 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [09:19:55] !log brouberol@dns1004 START - running authdns-update [09:20:34] !log upgrade python3-conftool and spicerack on cumin1003 and cumin2002 hosts [09:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:46] !log brouberol@dns1004 END - running authdns-update [09:20:56] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:22:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/Gadgets] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202645 (https://phabricator.wikimedia.org/T409367) (owner: 10Mszwarc) [09:22:30] (03PS1) 10Brouberol: Rename the growthbook-backend discovery domain into growthbook-api [dns] - 10https://gerrit.wikimedia.org/r/1202652 (https://phabricator.wikimedia.org/T408903) [09:23:22] (03CR) 10Brouberol: [C:03+2] Rename the growthbook-backend discovery domain into growthbook-api [dns] - 10https://gerrit.wikimedia.org/r/1202652 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [09:23:30] !log brouberol@dns1004 START - running authdns-update [09:23:32] (03Merged) 10jenkins-bot: Revert "Use OutputPageBeforeHTML instead of BeforePageDisplay to add modules" [extensions/Gadgets] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202645 (https://phabricator.wikimedia.org/T409367) (owner: 10Mszwarc) [09:24:04] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1202645|Revert "Use OutputPageBeforeHTML instead of BeforePageDisplay to add modules" (T409367)]] [09:24:07] T409367: Gadgets not loaded when wikitext editing - https://phabricator.wikimedia.org/T409367 [09:24:23] !log brouberol@dns1004 END - running authdns-update [09:24:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1008.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:26:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1008.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:26:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:26:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1008.eqiad.wmnet [09:26:26] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348567 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps1008.eqiad.wmnet` - maps1008.eqiad.wmnet (**PASS**) - Downti... [09:26:46] (03PS2) 10Dpogorzelski: knative-serving: add podspec features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202194 (https://phabricator.wikimedia.org/T403599) [09:26:59] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps1010.eqiad.wmnet [09:27:06] (03CR) 10STran: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202648 (https://phabricator.wikimedia.org/T402389) (owner: 10STran) [09:27:27] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1202645|Revert "Use OutputPageBeforeHTML instead of BeforePageDisplay to add modules" (T409367)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:28:36] !log mszwarc@deploy2002 mszwarc: Continuing with sync [09:29:17] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202648 (https://phabricator.wikimedia.org/T402389) (owner: 10STran) [09:32:56] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202645|Revert "Use OutputPageBeforeHTML instead of BeforePageDisplay to add modules" (T409367)]] (duration: 08m 52s) [09:32:59] T409367: Gadgets not loaded when wikitext editing - https://phabricator.wikimedia.org/T409367 [09:33:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:34:05] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T407997)', diff saved to https://phabricator.wikimedia.org/P84982 and previous config saved to /var/cache/conftool/dbconfig/20251106-093406-marostegui.json [09:34:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:37:01] (03PS1) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) [09:37:44] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:38:54] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Kanban Board), and 2 others: Record api-user-agent in metrics; filter by MediaWikiJs - https://phabricator.wikimedia.org/T402385#11348630 (10Mvolz) [09:39:26] (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [09:39:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:39:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11348636 (10phaultfinder) [09:39:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:39:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1010.eqiad.wmnet [09:39:57] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348637 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps1010.eqiad.wmnet` - maps1010.eqiad.wmnet (**PASS**) - Downti... [09:40:41] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps1009.eqiad.wmnet [09:46:12] (03PS2) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) [09:47:24] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348673 (10MoritzMuehlenhoff) [09:48:08] (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [09:48:18] (03PS1) 10Daniel Kinzler: rest-gateway: enable rate limits on some routes in shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202658 (https://phabricator.wikimedia.org/T406498) [09:49:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P84983 and previous config saved to /var/cache/conftool/dbconfig/20251106-094914-marostegui.json [09:49:38] (03PS1) 10Zabe: Update for new WikimediaMaintenance script locations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202659 [09:50:07] (03PS23) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [09:50:24] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [09:50:25] (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limits on some routes in shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202658 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [09:50:30] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:50:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:50:43] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [09:51:13] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:51:45] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [09:51:49] (03CR) 10Daniel Kinzler: api-gateway: Make x-ratelimit response header configurable. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [09:52:17] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [09:52:28] (03CR) 10Daniel Kinzler: [C:04-1] "CR-1 per Claime's comment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:21] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [09:53:49] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [09:54:05] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:54:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:54:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps1009.eqiad.wmnet [09:54:27] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348678 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps1009.eqiad.wmnet` - maps1009.eqiad.wmnet (**PASS**) - Downti... [09:56:20] (03CR) 10Hnowlan: [C:03+1] admin: make user itamar a deployer [puppet] - 10https://gerrit.wikimedia.org/r/1202350 (https://phabricator.wikimedia.org/T408924) (owner: 10Dzahn) [09:56:42] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db1176.eqiad.wmnet onto db2230.codfw.wmnet [09:56:52] /win 24 [09:57:12] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1176.eqiad.wmnet onto db2230.codfw.wmnet [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:58:17] (03CR) 10Muehlenhoff: [C:03+2] Remove old maps nodes from site.pp and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1202176 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:59:47] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db1176.eqiad.wmnet onto db2230.codfw.wmnet [09:59:55] (03CR) 10Hnowlan: [C:03+2] admin: make user itamar a deployer [puppet] - 10https://gerrit.wikimedia.org/r/1202350 (https://phabricator.wikimedia.org/T408924) (owner: 10Dzahn) [10:00:44] (03PS4) 10Blake: admin: Adding blake to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1201596 (https://phabricator.wikimedia.org/T409166) [10:01:23] (03PS1) 10Btullis: Register growthbook for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1202660 (https://phabricator.wikimedia.org/T409183) [10:01:39] (03PS2) 10Muehlenhoff: osm_master: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) [10:02:27] (03PS3) 10Muehlenhoff: osm_master: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) [10:04:00] (03PS1) 10Phuedx: EventStreamConfig: Remove mediawiki.wikistories_* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202661 (https://phabricator.wikimedia.org/T408178) [10:04:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409 (10Arian_Bozorg) 03NEW [10:04:18] !log brouberol@dns1004 START - running authdns-update [10:04:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P84984 and previous config saved to /var/cache/conftool/dbconfig/20251106-100421-marostegui.json [10:05:07] !log brouberol@dns1004 END - running authdns-update [10:05:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127) (owner: 10Phuedx) [10:05:34] (03PS1) 10Btullis: Add a summy secret for growthbook OIDC [labs/private] - 10https://gerrit.wikimedia.org/r/1202662 (https://phabricator.wikimedia.org/T409183) [10:05:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202661 (https://phabricator.wikimedia.org/T408178) (owner: 10Phuedx) [10:05:51] (03CR) 10Btullis: [V:03+2 C:03+2] Add a summy secret for growthbook OIDC [labs/private] - 10https://gerrit.wikimedia.org/r/1202662 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [10:06:38] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7567/console" [puppet] - 10https://gerrit.wikimedia.org/r/1202660 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [10:06:44] (03CR) 10Elukey: [C:03+1] "I am very ignorant about the logic bits but overall it looks good." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) (owner: 10Cathal Mooney) [10:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:12:25] (03PS1) 10Marostegui: report_users.sh: Add check for public role [software] - 10https://gerrit.wikimedia.org/r/1202663 [10:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11348780 (10phaultfinder) [10:15:45] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db1176.eqiad.wmnet onto db2230.codfw.wmnet [10:16:26] (03CR) 10Brouberol: [C:03+1] Register growthbook for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1202660 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [10:18:23] (03CR) 10Btullis: [V:03+1 C:03+2] Register growthbook for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1202660 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [10:18:25] (03CR) 10Jcrespo: [C:03+1] "Looks good to me, although consider adding MAILTO=... to avoid forgetting changing the to address if it changes in the future." [software] - 10https://gerrit.wikimedia.org/r/1202663 (owner: 10Marostegui) [10:19:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11348797 (10hnowlan) Merged! I see `deployment` in the user groups for itamar now. [10:19:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11348798 (10hnowlan) 05In progress→03Resolved [10:19:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T407997)', diff saved to https://phabricator.wikimedia.org/P84985 and previous config saved to /var/cache/conftool/dbconfig/20251106-101929-marostegui.json [10:19:33] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:19:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1195.eqiad.wmnet with reason: Maintenance [10:19:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T407997)', diff saved to https://phabricator.wikimedia.org/P84986 and previous config saved to /var/cache/conftool/dbconfig/20251106-101954-marostegui.json [10:20:29] (03PS2) 10Marostegui: report_users.sh: Add check for public role [software] - 10https://gerrit.wikimedia.org/r/1202663 [10:20:36] (03CR) 10Marostegui: "Good idea!" [software] - 10https://gerrit.wikimedia.org/r/1202663 (owner: 10Marostegui) [10:20:39] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:21:21] (03PS3) 10Marostegui: report_users.sh: Add check for public role [software] - 10https://gerrit.wikimedia.org/r/1202663 [10:22:31] (03PS1) 10Muehlenhoff: osm_replica: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565) [10:23:21] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:54] (03CR) 10Jcrespo: [C:03+1] "it would be nice to test it once after deploy, to check it works" [software] - 10https://gerrit.wikimedia.org/r/1202663 (owner: 10Marostegui) [10:26:23] (03CR) 10Fabfur: [C:03+1] "great work!" [puppet] - 10https://gerrit.wikimedia.org/r/1202306 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [10:26:53] 06SRE, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11348822 (10MoritzMuehlenhoff) [10:27:23] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11348823 (10MoritzMuehlenhoff) [10:28:19] 06SRE, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348828 (10MoritzMuehlenhoff) [10:28:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11348829 (10MoritzMuehlenhoff) [10:29:21] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11348831 (10MoritzMuehlenhoff) [10:29:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:30:39] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:31:14] (03CR) 10Dpogorzelski: knative-serving: add podspec features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202194 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [10:35:17] (03CR) 10Marostegui: "Yeah, I am going to test before merge" [software] - 10https://gerrit.wikimedia.org/r/1202663 (owner: 10Marostegui) [10:39:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11348857 (10phaultfinder) [10:43:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T407997)', diff saved to https://phabricator.wikimedia.org/P84988 and previous config saved to /var/cache/conftool/dbconfig/20251106-104304-marostegui.json [10:43:08] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:43:38] (03PS3) 10Muehlenhoff: osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) [10:45:22] (03PS1) 10Dpogorzelski: ml-services: add aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) [10:47:17] (03PS4) 10Muehlenhoff: osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) [10:54:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11348910 (10phaultfinder) [10:58:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P84989 and previous config saved to /var/cache/conftool/dbconfig/20251106-105812-marostegui.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1100) [11:06:38] (03CR) 10Hnowlan: [C:03+1] admin: Adding blake to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1201596 (https://phabricator.wikimedia.org/T409166) (owner: 10Blake) [11:07:05] (03CR) 10Elukey: ml-services: add aya-llm (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [11:10:29] PROBLEM - Confd vcl based reload on cp7008 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:29] PROBLEM - Confd vcl based reload on cp7005 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:29] PROBLEM - Confd vcl based reload on cp7002 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:31] PROBLEM - Confd vcl based reload on cp2033 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:31] PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:31] PROBLEM - Confd vcl based reload on cp2027 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:31] PROBLEM - Confd vcl based reload on cp2031 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:31] PROBLEM - Confd vcl based reload on cp2041 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:37] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:41] PROBLEM - Confd vcl based reload on cp5021 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:41] PROBLEM - Confd vcl based reload on cp5017 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:41] PROBLEM - Confd vcl based reload on cp5023 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:41] PROBLEM - Confd vcl based reload on cp5024 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:41] PROBLEM - Confd vcl based reload on cp5020 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:41] PROBLEM - Confd vcl based reload on cp5022 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:42] PROBLEM - Confd vcl based reload on cp5018 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:42] PROBLEM - Confd vcl based reload on cp5019 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:51] PROBLEM - Confd vcl based reload on cp1100 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:51] PROBLEM - Confd vcl based reload on cp1102 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:51] PROBLEM - Confd vcl based reload on cp1104 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:53] PROBLEM - Confd vcl based reload on cp1112 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:10:53] PROBLEM - Confd vcl based reload on cp1114 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:11] PROBLEM - Confd vcl based reload on cp1106 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:11] PROBLEM - Confd vcl based reload on cp1108 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:11] PROBLEM - Confd vcl based reload on cp1110 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:15] PROBLEM - Confd vcl based reload on cp4044 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:15] PROBLEM - Confd vcl based reload on cp4043 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:15] PROBLEM - Confd vcl based reload on cp4039 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:15] PROBLEM - Confd vcl based reload on cp4037 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:15] PROBLEM - Confd vcl based reload on cp4040 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:15] PROBLEM - Confd vcl based reload on cp4038 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:16] PROBLEM - Confd vcl based reload on cp4041 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:16] PROBLEM - Confd vcl based reload on cp4042 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:17] PROBLEM - Confd vcl based reload on cp3067 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:21] PROBLEM - Confd vcl based reload on cp3066 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:21] PROBLEM - Confd vcl based reload on cp3069 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:21] PROBLEM - Confd vcl based reload on cp3068 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:23] PROBLEM - Confd vcl based reload on cp3070 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:23] PROBLEM - Confd vcl based reload on cp3073 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:25] PROBLEM - Confd vcl based reload on cp3072 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:25] PROBLEM - Confd vcl based reload on cp3071 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:25] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:25] PROBLEM - Confd vcl based reload on cp6010 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:25] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:25] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:25] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:26] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:27] PROBLEM - Confd vcl based reload on cp6013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:27] PROBLEM - Confd vcl based reload on cp2037 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:28] PROBLEM - Confd vcl based reload on cp2039 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:28] PROBLEM - Confd vcl based reload on cp2029 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:29] PROBLEM - Confd vcl based reload on cp7004 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:29] PROBLEM - Confd vcl based reload on cp7007 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:29] PROBLEM - Confd vcl based reload on cp7001 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:30] PROBLEM - Confd vcl based reload on cp7006 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:11:31] PROBLEM - Confd vcl based reload on cp7003 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [11:13:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P84990 and previous config saved to /var/cache/conftool/dbconfig/20251106-111319-marostegui.json [11:20:11] RECOVERY - Confd vcl based reload on cp1106 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:11] RECOVERY - Confd vcl based reload on cp1108 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:11] RECOVERY - Confd vcl based reload on cp1110 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:15] RECOVERY - Confd vcl based reload on cp4037 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:15] RECOVERY - Confd vcl based reload on cp4044 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:15] RECOVERY - Confd vcl based reload on cp4043 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:15] RECOVERY - Confd vcl based reload on cp4040 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:15] RECOVERY - Confd vcl based reload on cp4041 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:16] RECOVERY - Confd vcl based reload on cp4038 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:16] RECOVERY - Confd vcl based reload on cp4042 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:16] RECOVERY - Confd vcl based reload on cp4039 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:17] RECOVERY - Confd vcl based reload on cp3067 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:21] RECOVERY - Confd vcl based reload on cp3069 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:21] RECOVERY - Confd vcl based reload on cp3066 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:21] RECOVERY - Confd vcl based reload on cp3068 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:23] RECOVERY - Confd vcl based reload on cp3070 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:23] RECOVERY - Confd vcl based reload on cp3073 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:23] RECOVERY - Confd vcl based reload on cp3072 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:25] RECOVERY - Confd vcl based reload on cp3071 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:25] RECOVERY - Confd vcl based reload on cp6015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:25] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:25] RECOVERY - Confd vcl based reload on cp6010 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:25] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:25] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:27] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:27] RECOVERY - Confd vcl based reload on cp6013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:27] RECOVERY - Confd vcl based reload on cp2037 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:27] RECOVERY - Confd vcl based reload on cp2039 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:28] RECOVERY - Confd vcl based reload on cp2029 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:29] RECOVERY - Confd vcl based reload on cp7005 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:29] RECOVERY - Confd vcl based reload on cp7007 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:29] RECOVERY - Confd vcl based reload on cp7004 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:30] RECOVERY - Confd vcl based reload on cp7008 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:30] RECOVERY - Confd vcl based reload on cp7006 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:30] RECOVERY - Confd vcl based reload on cp7002 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:31] RECOVERY - Confd vcl based reload on cp7001 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:32] RECOVERY - Confd vcl based reload on cp7003 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:33] RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:33] RECOVERY - Confd vcl based reload on cp2033 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:33] RECOVERY - Confd vcl based reload on cp2027 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:34] RECOVERY - Confd vcl based reload on cp2031 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:34] RECOVERY - Confd vcl based reload on cp2041 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:37] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:41] RECOVERY - Confd vcl based reload on cp5021 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:41] RECOVERY - Confd vcl based reload on cp5017 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:41] RECOVERY - Confd vcl based reload on cp5023 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:41] RECOVERY - Confd vcl based reload on cp5020 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:41] RECOVERY - Confd vcl based reload on cp5018 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:41] RECOVERY - Confd vcl based reload on cp5024 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:41] RECOVERY - Confd vcl based reload on cp5019 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:42] RECOVERY - Confd vcl based reload on cp5022 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:51] RECOVERY - Confd vcl based reload on cp1102 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:51] RECOVERY - Confd vcl based reload on cp1104 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:51] RECOVERY - Confd vcl based reload on cp1100 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:53] RECOVERY - Confd vcl based reload on cp1112 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:20:53] RECOVERY - Confd vcl based reload on cp1114 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:24:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11349012 (10phaultfinder) [11:28:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T407997)', diff saved to https://phabricator.wikimedia.org/P84992 and previous config saved to /var/cache/conftool/dbconfig/20251106-112827-marostegui.json [11:28:31] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:28:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1196.eqiad.wmnet with reason: Maintenance [11:29:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:29:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T407997)', diff saved to https://phabricator.wikimedia.org/P84993 and previous config saved to /var/cache/conftool/dbconfig/20251106-112910-marostegui.json [11:30:10] (03CR) 10Cathal Mooney: [C:03+2] Refactor of move_server and script to move selective hosts (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) (owner: 10Cathal Mooney) [11:30:35] (03PS5) 10Blake: admin: Adding blake to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1201596 (https://phabricator.wikimedia.org/T409166) [11:32:58] (03Merged) 10jenkins-bot: Refactor of move_server and script to move selective hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) (owner: 10Cathal Mooney) [11:33:02] (03PS6) 10Blake: admin: Adding blake to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1201596 (https://phabricator.wikimedia.org/T409166) [11:34:08] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar), 05WMF-NDA: Change Gitiles caching config - https://phabricator.wikimedia.org/T409422 (10LSobanski) 03NEW [11:34:51] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11349081 (10phaultfinder) [11:36:18] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1006.eqiad.wmnet [11:36:27] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:37:00] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:37:22] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [11:37:54] cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists vewikimedia; (T297297) [11:37:55] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [11:38:13] (03PS4) 10Marostegui: report_users.sh: Add check for public role [software] - 10https://gerrit.wikimedia.org/r/1202663 [11:38:31] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [11:38:45] (03CR) 10Marostegui: "Final tested version" [software] - 10https://gerrit.wikimedia.org/r/1202663 (owner: 10Marostegui) [11:39:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:34] !log cmooney@cumin1003 START - Cookbook sre.hosts.dhcp for host sretest1005.eqiad.wmnet [11:40:52] (03PS1) 10Giuseppe Lavagetto: wikimedia-frontend: add variable to use for rate-limiting [puppet] - 10https://gerrit.wikimedia.org/r/1202677 (https://phabricator.wikimedia.org/T406555) [11:42:38] cmooney@cumin1003 dhcp (PID 144583) is awaiting input [11:45:58] (03PS1) 10Federico Ceratto: sre.mysql.clone: Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202679 [11:46:10] (03CR) 10Hnowlan: [C:03+1] admin: Adding blake to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1201596 (https://phabricator.wikimedia.org/T409166) (owner: 10Blake) [11:49:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T407997)', diff saved to https://phabricator.wikimedia.org/P84994 and previous config saved to /var/cache/conftool/dbconfig/20251106-114921-marostegui.json [11:49:25] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:53:33] (03CR) 10Jcrespo: [C:03+1] report_users.sh: Add check for public role [software] - 10https://gerrit.wikimedia.org/r/1202663 (owner: 10Marostegui) [11:53:41] (03CR) 10Marostegui: [C:03+1] "As I already +1 and the change was merged, removing myself as it keeps popping into the "your turn" list in Gerrit!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [11:53:56] (03CR) 10Marostegui: [C:03+2] report_users.sh: Add check for public role [software] - 10https://gerrit.wikimedia.org/r/1202663 (owner: 10Marostegui) [11:54:26] (03Merged) 10jenkins-bot: report_users.sh: Add check for public role [software] - 10https://gerrit.wikimedia.org/r/1202663 (owner: 10Marostegui) [11:55:28] (03CR) 10CI reject: [V:04-1] sre.mysql.clone: Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202679 (owner: 10Federico Ceratto) [11:56:43] (03CR) 10Vgutierrez: [V:03+2 C:03+1] "varnishtests are happy for both text & upload" [puppet] - 10https://gerrit.wikimedia.org/r/1202677 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [12:04:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P84995 and previous config saved to /var/cache/conftool/dbconfig/20251106-120429-marostegui.json [12:06:35] (03CR) 10Clément Goubert: [C:03+2] admin: Adding blake to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1201596 (https://phabricator.wikimedia.org/T409166) (owner: 10Blake) [12:09:43] (03PS2) 10Dpogorzelski: ml-services: add aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) [12:09:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11349325 (10phaultfinder) [12:11:40] (03PS2) 10Giuseppe Lavagetto: wikimedia-frontend: add variable to use for rate-limiting [puppet] - 10https://gerrit.wikimedia.org/r/1202677 (https://phabricator.wikimedia.org/T406555) [12:11:42] (03CR) 10Giuseppe Lavagetto: wikimedia-frontend: add variable to use for rate-limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202677 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [12:18:19] (03CR) 10Bartosz Wójtowicz: [C:03+2] inference-services: Add revise-tone-task-generator experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202057 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:19:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P84996 and previous config saved to /var/cache/conftool/dbconfig/20251106-121937-marostegui.json [12:20:01] (03Merged) 10jenkins-bot: inference-services: Add revise-tone-task-generator experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202057 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:21:21] (03PS1) 10Muehlenhoff: Remove legacy maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1202686 (https://phabricator.wikimedia.org/T381565) [12:22:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to ops-limited for blake - https://phabricator.wikimedia.org/T409166#11349353 (10hnowlan) 05Open→03Resolved a:03hnowlan [12:25:46] (03PS1) 10Muehlenhoff: Remove kartotherian-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1202687 (https://phabricator.wikimedia.org/T381565) [12:26:40] (03PS2) 10Federico Ceratto: sre.mysql.clone: Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202679 [12:30:04] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1202688 (owner: 10L10n-bot) [12:30:31] (03PS1) 10Muehlenhoff: Remove maps-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1202689 (https://phabricator.wikimedia.org/T381565) [12:33:22] (03CR) 10CI reject: [V:04-1] sre.mysql.clone: Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202679 (owner: 10Federico Ceratto) [12:34:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T407997)', diff saved to https://phabricator.wikimedia.org/P84998 and previous config saved to /var/cache/conftool/dbconfig/20251106-123444-marostegui.json [12:34:48] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:35:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1206.eqiad.wmnet with reason: Maintenance [12:35:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T407997)', diff saved to https://phabricator.wikimedia.org/P84999 and previous config saved to /var/cache/conftool/dbconfig/20251106-123507-marostegui.json [12:36:19] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:40:35] (03CR) 10Dpogorzelski: ml-services: add aya-llm (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [12:41:09] (03PS1) 10Muehlenhoff: ganeti-ca: Warn after 90 days [alerts] - 10https://gerrit.wikimedia.org/r/1202696 (https://phabricator.wikimedia.org/T382902) [12:42:43] (03CR) 10CI reject: [V:04-1] ganeti-ca: Warn after 90 days [alerts] - 10https://gerrit.wikimedia.org/r/1202696 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [12:44:54] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP groups for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11349424 (10Virginie.caplet) It seems to be working fine! Thank you so much! [12:44:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11349425 (10phaultfinder) [12:51:12] (03CR) 10Mark Bergsma: [C:03+1] Remove otto from ops group [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff) [12:54:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T407997)', diff saved to https://phabricator.wikimedia.org/P85000 and previous config saved to /var/cache/conftool/dbconfig/20251106-125427-marostegui.json [12:54:31] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1300) [13:04:03] (03PS1) 10Muehlenhoff: Add Kavitha as second approver for ops and ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1202700 [13:04:32] (03PS1) 10Superpes15: [tcywiki] Add Portal and Draft namespaces and its talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) [13:08:48] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:09:07] (03CR) 10Muehlenhoff: [C:03+2] Add Kavitha as second approver for ops and ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1202700 (owner: 10Muehlenhoff) [13:09:11] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2152 - Upgrading db2152.codfw.wmnet [13:09:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P85001 and previous config saved to /var/cache/conftool/dbconfig/20251106-130934-marostegui.json [13:09:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11349498 (10phaultfinder) [13:11:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11349499 (10hnowlan) [13:13:26] (03CR) 10Elukey: "Everything looks good, one last nit - using values.yaml forces the new InferenceService resource to be available for deployments on all cl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [13:13:52] (03PS1) 10Federico Ceratto: db2152: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202703 (https://phabricator.wikimedia.org/T406008) [13:14:12] (03PS2) 10Federico Ceratto: db2152: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202703 (https://phabricator.wikimedia.org/T406008) [13:14:13] fceratto@cumin1003 major-upgrade (PID 236765) is awaiting input [13:15:30] (03PS2) 10Muehlenhoff: Remove otto from ops group [puppet] - 10https://gerrit.wikimedia.org/r/1202114 [13:16:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11349507 (10hnowlan) Hi Arian, thanks for the ticket - could you let us know what username you would like for your account? Usually we'd go with somethi... [13:16:49] (03CR) 10Marostegui: [C:03+1] db2152: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202703 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [13:17:28] (03PS1) 10Hnowlan: admin: add abozorg-wmde to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) [13:18:09] (03PS3) 10Muehlenhoff: Remove otto from ops group [puppet] - 10https://gerrit.wikimedia.org/r/1202114 [13:19:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:19:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:20:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11349514 (10hnowlan) 05Open→03In progress [13:20:20] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:20:52] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:20:54] (03CR) 10Sbisson: [C:03+1] "Removal of language pairs config looks fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202376 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry) [13:22:49] (03PS3) 10Dpogorzelski: ml-services: add aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) [13:23:13] (03PS2) 10KartikMistry: Update Recommnedation API to 2025-11-05-230545-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202376 (https://phabricator.wikimedia.org/T405000) [13:23:59] (03CR) 10Dpogorzelski: ml-services: add aya-llm (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [13:24:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11349525 (10taavi) >>! In T409409#11349506, @hnowlan wrote: > Hi Arian, thanks for the ticket - could you let us know what usernam... [13:24:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P85002 and previous config saved to /var/cache/conftool/dbconfig/20251106-132442-marostegui.json [13:24:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11349530 (10phaultfinder) [13:25:07] (03PS2) 10Muehlenhoff: ganeti-ca: Warn after 90 days [alerts] - 10https://gerrit.wikimedia.org/r/1202696 (https://phabricator.wikimedia.org/T382902) [13:26:36] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2152 - Upgrading db2152.codfw.wmnet [13:26:47] (03CR) 10CI reject: [V:04-1] ganeti-ca: Warn after 90 days [alerts] - 10https://gerrit.wikimedia.org/r/1202696 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [13:27:23] (03CR) 10A smart kitten: "I believe the CI failure is because of needing to run `composer manage-dblist update`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [13:28:18] (03CR) 10Federico Ceratto: [C:03+2] db2152: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202703 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [13:29:17] (03CR) 10KartikMistry: [C:03+2] Update Recommnedation API to 2025-11-05-230545-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202376 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry) [13:30:05] (03CR) 10Lucas Werkmeister (WMDE): EventStreamConfig: Remove mediawiki.wikistories_* streams (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202661 (https://phabricator.wikimedia.org/T408178) (owner: 10Phuedx) [13:31:03] (03Merged) 10jenkins-bot: Update Recommnedation API to 2025-11-05-230545-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202376 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry) [13:31:19] (03PS1) 10Tchanders: LQT Import: Fix quadratic time explosion in finding next offset [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) [13:32:02] Starting recommendation API deployment.. [13:33:03] (03PS2) 10Phuedx: EventStreamConfig: Remove mediawiki.wikistories_* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202661 (https://phabricator.wikimedia.org/T408178) [13:33:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [13:35:01] (03CR) 10Phuedx: EventStreamConfig: Remove mediawiki.wikistories_* streams (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202661 (https://phabricator.wikimedia.org/T408178) (owner: 10Phuedx) [13:36:31] (03CR) 10Lucas Werkmeister (WMDE): [tcywiki] Add Portal and Draft namespaces and its talk (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) (owner: 10Superpes15) [13:36:34] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:38:46] (03CR) 10Lucas Werkmeister (WMDE): [tcywiki] Add Portal and Draft namespaces and its talk (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) (owner: 10Superpes15) [13:39:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T407997)', diff saved to https://phabricator.wikimedia.org/P85004 and previous config saved to /var/cache/conftool/dbconfig/20251106-133949-marostegui.json [13:39:53] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:39:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11349584 (10phaultfinder) [13:40:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1218.eqiad.wmnet with reason: Maintenance [13:40:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T407997)', diff saved to https://phabricator.wikimedia.org/P85005 and previous config saved to /var/cache/conftool/dbconfig/20251106-134013-marostegui.json [13:41:46] (03CR) 10Abijeet Patro: "the first line in the file says:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [13:41:50] (03CR) 10CI reject: [V:04-1] LQT Import: Fix quadratic time explosion in finding next offset [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [13:43:19] (03PS3) 10Lucas Werkmeister (WMDE): EventStreamConfig: Remove mediawiki.reference_previews stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127) (owner: 10Phuedx) [13:43:42] (03CR) 10Vgutierrez: [C:03+1] wikimedia-frontend: add variable to use for rate-limiting [puppet] - 10https://gerrit.wikimedia.org/r/1202677 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [13:43:56] (03CR) 10Lucas Werkmeister (WMDE): "Rebased (patch Ie346086c42 had touched the removed block in the meantime). Hopefully a straightforward conflict resolution, but still, ple" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127) (owner: 10Phuedx) [13:44:29] (03CR) 10Phuedx: [C:03+1] EventStreamConfig: Remove mediawiki.reference_previews stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127) (owner: 10Phuedx) [13:44:50] (03CR) 10Giuseppe Lavagetto: [C:03+2] wikimedia-frontend: add variable to use for rate-limiting [puppet] - 10https://gerrit.wikimedia.org/r/1202677 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [13:45:37] (03CR) 10Lucas Werkmeister (WMDE): "Phan is complaining about… a file that wasn’t touched in this change? o_O" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [13:46:55] (03CR) 10Elukey: [C:03+1] Remove legacy maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1202686 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:47:00] (03CR) 10Zabe: "phan failure is fixed by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/1201818" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [13:47:02] (03CR) 10Lucas Werkmeister (WMDE): LQT Import: Fix quadratic time explosion in finding next offset (031 comment) [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [13:47:14] (03CR) 10Elukey: [C:03+1] Remove kartotherian-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1202687 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:47:21] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2152 gradually with 4 steps - Migration of db2152.codfw.wmnet completed [13:47:37] (03CR) 10Lucas Werkmeister (WMDE): "Ack, then let’s backport that too?" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [13:48:05] (03CR) 10Elukey: [C:03+1] Remove maps-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1202689 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:49:21] (03CR) 10Zabe: "imo the easiest solution and there is basically no risk" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [13:49:22] (03CR) 10Elukey: [C:03+1] ml-services: add aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [13:49:27] (03PS1) 10Cathal Mooney: move_server: filter out spine switches when getting rack switches [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202716 [13:49:50] (03PS3) 10Muehlenhoff: ganeti-ca: Warn after 90 days [alerts] - 10https://gerrit.wikimedia.org/r/1202696 (https://phabricator.wikimedia.org/T382902) [13:52:05] (03CR) 10CI reject: [V:04-1] move_server: filter out spine switches when getting rack switches [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202716 (owner: 10Cathal Mooney) [13:52:37] (03CR) 10A smart kitten: "I guess the files want each addition/removal to be made individually via (e.g.) `composer manage-dblist del arbcom_cswiki specialcontribut" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [13:53:10] (03CR) 10Lucas Werkmeister (WMDE): EventStreamConfig: Remove mediawiki.wikistories_* streams (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202661 (https://phabricator.wikimedia.org/T408178) (owner: 10Phuedx) [13:53:50] (03PS1) 10Lucas Werkmeister (WMDE): Update types for WatchArticleHook/UnwatchArticleHook [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202717 [13:53:59] (03PS2) 10Tchanders: LQT Import: Fix quadratic time explosion in finding next offset [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) [13:54:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202717 (owner: 10Lucas Werkmeister (WMDE)) [13:57:23] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1400). [14:00:05] phuedx, Superpes, and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:06] !log move public1-c-eqiad sub-interface from ae3 to et-1/0/5 on cr2-eqiad (T405579) [14:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:10] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [14:00:19] o/ [14:00:28] I can deploy! (with Sean next to me ^^) [14:00:40] o/ here for the LQT patch with Tchanders [14:00:42] o/ [14:01:25] \o [14:01:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T407997)', diff saved to https://phabricator.wikimedia.org/P85008 and previous config saved to /var/cache/conftool/dbconfig/20251106-140127-marostegui.json [14:01:31] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:02:09] let’s start with the config changes by phuedx [14:02:33] Lucas_WMDE: what's up with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/1202717 ? [14:02:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127) (owner: 10Phuedx) [14:02:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202661 (https://phabricator.wikimedia.org/T408178) (owner: 10Phuedx) [14:03:09] (03PS1) 10KartikMistry: Revert "Update Recommnedation API to 2025-11-05-230545-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202719 [14:03:15] edsanders: that was necessary to make Phan happy [14:03:17] oh I see, a phan failure we need to backport as well [14:03:23] you can see the failed build at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/1202709/ [14:03:25] yeah [14:03:33] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 2 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:03:37] can be deployed together [14:03:38] (03Merged) 10jenkins-bot: EventStreamConfig: Remove mediawiki.reference_previews stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127) (owner: 10Phuedx) [14:03:47] (03Merged) 10jenkins-bot: EventStreamConfig: Remove mediawiki.wikistories_* streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202661 (https://phabricator.wikimedia.org/T408178) (owner: 10Phuedx) [14:04:21] Superpes: I left a comment on your change btw, in case you didn’t see it yet [14:04:26] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197659|EventStreamConfig: Remove mediawiki.reference_previews stream (T242127)]], [[gerrit:1202661|EventStreamConfig: Remove mediawiki.wikistories_* streams (T408178)]] [14:04:32] T242127: Remove Reference Previews tracking metrics - https://phabricator.wikimedia.org/T242127 [14:04:32] T408178: Decommission the Wikistories instruments - https://phabricator.wikimedia.org/T408178 [14:06:18] (03CR) 10KartikMistry: [C:03+2] Revert "Update Recommnedation API to 2025-11-05-230545-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202719 (owner: 10KartikMistry) [14:06:19] I’ll start the gate-and-submit builds for the backports already [14:06:32] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit build ahead of deployment" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202717 (owner: 10Lucas Werkmeister (WMDE)) [14:06:35] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit build ahead of deployment" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [14:07:04] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, phuedx: Backport for [[gerrit:1197659|EventStreamConfig: Remove mediawiki.reference_previews stream (T242127)]], [[gerrit:1202661|EventStreamConfig: Remove mediawiki.wikistories_* streams (T408178)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:07:15] (03CR) 10Muehlenhoff: CAS version 7.2.7 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [14:07:25] phuedx: anything to test here? [14:07:40] Lucas_WMDE: One sec [14:07:51] (03CR) 10Reedy: [C:03+1] Update for new WikimediaMaintenance script locations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202659 (owner: 10Zabe) [14:07:51] !log move public1-c-eqiad sub-interface from ae3 to et-1/0/5 on cr1-eqiad (T405579) [14:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:57] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [14:08:08] (03Merged) 10jenkins-bot: Revert "Update Recommnedation API to 2025-11-05-230545-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202719 (owner: 10KartikMistry) [14:08:24] (03CR) 10Muehlenhoff: [C:03+2] Remove legacy maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1202686 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:08:44] (03PS5) 10Slyngshede: CAS version 7.2.7 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) [14:08:56] (03CR) 10Slyngshede: CAS version 7.2.7 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [14:08:56] (03Merged) 10jenkins-bot: Update types for WatchArticleHook/UnwatchArticleHook [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202717 (owner: 10Lucas Werkmeister (WMDE)) [14:09:02] huh. that was fast [14:09:07] (03Merged) 10jenkins-bot: LQT Import: Fix quadratic time explosion in finding next offset [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202709 (https://phabricator.wikimedia.org/T405080) (owner: 10Tchanders) [14:09:07] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:09:16] Lucas_WMDE: LGTM [14:09:18] ok! [14:09:23] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, phuedx: Continuing with sync [14:09:26] Lucas_WMDE Uhm that's weird! Looking [14:09:33] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:09:39] No errors in the console. Streams have disappeared from action=streamconfigs on metawiki [14:09:41] and apparently those backports already went through. I forgot we don’t run selenium in gate-and-submit on wmf branches [14:09:51] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:09:52] (bit confusing that we still do it in the test build, so test is slower than gate-and-submit…) [14:10:03] I may have found more streams to tidy up though. There might be another patch at the end of the window ;) [14:10:09] cool ^^ [14:11:24] (03CR) 10Muehlenhoff: [C:03+2] Remove kartotherian-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1202687 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:12:09] (03CR) 10Elukey: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202716 (owner: 10Cathal Mooney) [14:13:24] (03PS2) 10Superpes15: [tcywiki] Add Portal and Draft namespaces and its talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) [14:13:39] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197659|EventStreamConfig: Remove mediawiki.reference_previews stream (T242127)]], [[gerrit:1202661|EventStreamConfig: Remove mediawiki.wikistories_* streams (T408178)]] (duration: 09m 12s) [14:13:42] (03PS1) 10Sergio Gimeno: [beta] GrowthExperiments: add revise-tone experiment setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202721 (https://phabricator.wikimedia.org/T402707) [14:13:43] T242127: Remove Reference Previews tracking metrics - https://phabricator.wikimedia.org/T242127 [14:13:44] T408178: Decommission the Wikistories instruments - https://phabricator.wikimedia.org/T408178 [14:13:59] (03CR) 10Superpes15: "Seems that Atom added this character! Thanks should be solved now :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) (owner: 10Superpes15) [14:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:22] (03PS3) 10TChin: [eventgate] Split alerts into global and per-site alerts [alerts] - 10https://gerrit.wikimedia.org/r/1199859 (https://phabricator.wikimedia.org/T405952) [14:14:57] (03CR) 10Superpes15: [tcywiki] Add Portal and Draft namespaces and its talk (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) (owner: 10Superpes15) [14:14:59] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] [tcywiki] Add Portal and Draft namespaces and its talk (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) (owner: 10Superpes15) [14:15:02] (03PS2) 10Muehlenhoff: Remove maps-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1202689 (https://phabricator.wikimedia.org/T381565) [14:15:20] (03CR) 10Elukey: [C:03+1] move_server: filter out spine switches when getting rack switches [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202716 (owner: 10Cathal Mooney) [14:15:51] (03CR) 10Cathal Mooney: [C:03+2] move_server: filter out spine switches when getting rack switches [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202716 (owner: 10Cathal Mooney) [14:15:59] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1202717|Update types for WatchArticleHook/UnwatchArticleHook]], [[gerrit:1202709|LQT Import: Fix quadratic time explosion in finding next offset (T405080)]] [14:16:02] T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080 [14:16:13] edsanders / Tchanders: is the Flow backport testable on WikimediaDebug? or is the code only reachable from a maintenance script? [14:16:20] (it’s not on WikimediaDebug yet, just asking in advance ^^) [14:16:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P85010 and previous config saved to /var/cache/conftool/dbconfig/20251106-141635-marostegui.json [14:16:47] Maintenance script [14:16:55] ack [14:16:59] then I’ll just sync it directly [14:17:05] Thanks [14:17:37] (03CR) 10Muehlenhoff: [C:03+2] Remove maps-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1202689 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:17:51] (03CR) 10TChin: [eventgate] Split alerts into global and per-site alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1199859 (https://phabricator.wikimedia.org/T405952) (owner: 10TChin) [14:17:53] (03Merged) 10jenkins-bot: move_server: filter out spine switches when getting rack switches [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202716 (owner: 10Cathal Mooney) [14:18:00] (03CR) 10TChin: [C:03+2] [eventgate] Split alerts into global and per-site alerts [alerts] - 10https://gerrit.wikimedia.org/r/1199859 (https://phabricator.wikimedia.org/T405952) (owner: 10TChin) [14:18:45] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, tchanders: Backport for [[gerrit:1202717|Update types for WatchArticleHook/UnwatchArticleHook]], [[gerrit:1202709|LQT Import: Fix quadratic time explosion in finding next offset (T405080)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:19:07] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, tchanders: Continuing with sync [14:19:18] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:19:20] (03Merged) 10jenkins-bot: [eventgate] Split alerts into global and per-site alerts [alerts] - 10https://gerrit.wikimedia.org/r/1199859 (https://phabricator.wikimedia.org/T405952) (owner: 10TChin) [14:19:32] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:20:01] jouncebot: nowandnext [14:20:01] For the next 0 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1400) [14:20:01] In 1 hour(s) and 9 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1530) [14:20:01] !log move private1-c-eqiad sub-interface from ae3 to et-1/0/5 on cr2-eqiad (T405579) [14:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:04] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [14:20:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [14:20:52] (03CR) 10Slyngshede: [V:03+2 C:03+2] CAS version 7.2.7 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [14:20:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11349773 (10phaultfinder) [14:23:06] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:23:25] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202717|Update types for WatchArticleHook/UnwatchArticleHook]], [[gerrit:1202709|LQT Import: Fix quadratic time explosion in finding next offset (T405080)]] (duration: 07m 26s) [14:23:29] T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080 [14:23:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:23:42] (03CR) 10Muehlenhoff: [C:04-1] admin: add abozorg-wmde to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan) [14:23:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) (owner: 10Superpes15) [14:24:05] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:24:33] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 2 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:24:45] (03Merged) 10jenkins-bot: [tcywiki] Add Portal and Draft namespaces and its talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202701 (https://phabricator.wikimedia.org/T409329) (owner: 10Superpes15) [14:24:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:24:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:ae3 (asw2-c-eqiad:ae2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:25:14] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1202701|[tcywiki] Add Portal and Draft namespaces and its talk (T409329)]] [14:25:17] T409329: Enable Portal and Draft namespaces on Tulu Wikipedia (tcywiki) - https://phabricator.wikimedia.org/T409329 [14:27:41] !log lucaswerkmeister-wmde@deploy2002 superpes, lucaswerkmeister-wmde: Backport for [[gerrit:1202701|[tcywiki] Add Portal and Draft namespaces and its talk (T409329)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:27:45] !log move private1-c-eqiad sub-interface from ae3 to et-1/0/5 on cr1-eqiad (T405579) [14:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:48] Testing [14:27:49] T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 [14:28:33] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:29:06] Looks fine Lucas_WMDE :) [14:29:11] !log lucaswerkmeister-wmde@deploy2002 superpes, lucaswerkmeister-wmde: Continuing with sync [14:29:12] \o/ [14:29:26] Thanks for noticing the issue [14:29:45] I'm travelling rn so I didn't notice it at all :D [14:31:20] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device ssw1-a8-codfw [14:31:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-a8-codfw [14:31:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P85012 and previous config saved to /var/cache/conftool/dbconfig/20251106-143142-marostegui.json [14:32:50] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2152 gradually with 4 steps - Migration of db2152.codfw.wmnet completed [14:32:51] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:33:21] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:33:26] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202701|[tcywiki] Add Portal and Draft namespaces and its talk (T409329)]] (duration: 08m 12s) [14:33:29] T409329: Enable Portal and Draft namespaces on Tulu Wikipedia (tcywiki) - https://phabricator.wikimedia.org/T409329 [14:34:05] Lucas_WMDE Thanks for your assistance as always! Don't forget to run NamespaceDupes.php even if the result should be negative :3 [14:34:14] yeah, I was about to do that ^^ [14:34:18] :P [14:34:24] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device ssw1-a1-codfw [14:34:33] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-a1-codfw [14:34:42] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: namespaceDupes tcywiki --fix # T328207 [14:34:42] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-b7-codfw [14:34:45] T328207: Change Namespace Aliases on diq.wikipedia - https://phabricator.wikimedia.org/T328207 [14:34:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae3 (asw2-c-eqiad:ae1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:34:51] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b7-codfw [14:35:21] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-b8-codfw [14:35:30] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b8-codfw [14:35:41] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-b6-codfw [14:35:50] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b6-codfw [14:36:12] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-b4-codfw [14:36:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b4-codfw [14:37:04] (03PS1) 10Federico Ceratto: db2163: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202724 (https://phabricator.wikimedia.org/T406008) [14:37:38] !log installing bind security updates (client-side tools/libs only) [14:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:21] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:38:26] I’ll deploy a security fix while the window is still open [14:39:41] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-b3-codfw [14:39:50] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b3-codfw [14:39:55] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-b5-codfw [14:40:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b5-codfw [14:40:10] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-b2-codfw [14:40:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-b2-codfw [14:41:45] (03CR) 10Bking: [C:03+1] airflow: define a network policy specific to task pods requiring egress to a proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202186 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [14:42:08] (03CR) 10Bking: [C:03+1] airflow-platform-eng: enabled jobs properly labeled to egress to the urldownloader proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202187 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [14:43:21] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:43:52] (03CR) 10Marostegui: [C:03+1] db2163: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202724 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [14:46:25] !log lucaswerkmeister-wmde Deployed security patch for T409423 [14:46:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T407997)', diff saved to https://phabricator.wikimedia.org/P85014 and previous config saved to /var/cache/conftool/dbconfig/20251106-144650-marostegui.json [14:46:54] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:47:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance [14:47:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T407997)', diff saved to https://phabricator.wikimedia.org/P85015 and previous config saved to /var/cache/conftool/dbconfig/20251106-144714-marostegui.json [14:47:31] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:49:36] (03PS4) 10Muehlenhoff: osm_master: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) [14:49:44] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:49:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/0 (Core: asw2-c-eqiad:et-2/0/53 {#G2204190495000069}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:50:06] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2163 - Upgrading db2163.codfw.wmnet [14:50:42] “You can also use if you are not the owner of the offending file(s).” [14:50:43] wat [14:50:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2163 - Upgrading db2163.codfw.wmnet [14:50:49] (that’s two spaces between “use” and “if” there) [14:50:55] what is /srv/patches telling me to do [14:51:37] oh dear [14:51:41] lmaooo [14:51:42] > You can also use `sudo /usr/local/sbin/fix-staging-perms` if you are not [14:51:47] it’s actually running the command in backticks [14:51:50] is the problem fixed now by any chance? [14:51:52] .. yeah [14:52:14] idk if it’s fixed, I’m done committing now [14:52:17] what did you do? ^^ [14:52:24] the pre-commit hook still looks the same to me [14:52:54] I didn't do anything, I was asking if the hook actually correctly ran the command instead of printing it [14:53:02] (03PS1) 10Btullis: WIP: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [14:53:14] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-a7-codfw [14:53:20] cmooney@cumin1003 netbox (PID 415692) is awaiting input [14:53:23] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a7-codfw [14:53:29] !log elukey@cumin1003 START - Cookbook sre.network.tls for network device lsw1-a8-codfw [14:53:35] I see [14:53:35] (03PS1) 10Lucas Werkmeister (WMDE): Fix /srv/patches pre-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1202727 [14:53:39] well I chmod’ed the file manually anyway [14:53:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a8-codfw [14:53:43] but it sounds like I woudln’t have needed to [14:53:44] anyway, ^ [14:53:55] (03CR) 10Majavah: [C:03+2] Fix /srv/patches pre-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1202727 (owner: 10Lucas Werkmeister (WMDE)) [14:54:02] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change dns for row c gateway interfaces eqiad CRs - cmooney@cumin1003" [14:54:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change dns for row c gateway interfaces eqiad CRs - cmooney@cumin1003" [14:54:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:22] (03CR) 10Lucas Werkmeister (WMDE): Fix /srv/patches pre-commit hook (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1202727 (owner: 10Lucas Werkmeister (WMDE)) [14:54:28] (03CR) 10CI reject: [V:04-1] WIP: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [14:55:27] !log Deployed security patch for T409423 [14:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:39] * Lucas_WMDE done deploying [14:55:53] !log UTC afternoon backport+config window done [14:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:46] (03CR) 10Thcipriani: [C:03+1] "Good for me as deployment owner." [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff) [14:57:06] (03CR) 10Federico Ceratto: [C:03+2] db2163: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202724 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [14:57:10] fceratto@cumin1003 major-upgrade (PID 419692) is awaiting input [14:57:17] !log bump space for prometheus k8s-dse in eqiad [14:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:21] FIRING: [15x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:59:33] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device lsw1-a5-codfw [14:59:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a5-codfw [14:59:52] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device lsw1-a6-codfw [14:59:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a6-codfw [15:00:07] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device lsw1-a4-codfw [15:00:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a4-codfw [15:00:25] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device lsw1-a3-codfw [15:00:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a3-codfw [15:02:33] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device lsw1-a2-codfw [15:02:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-a2-codfw [15:02:49] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device ssw1-f1-eqiad [15:02:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-f1-eqiad [15:03:20] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device ssw1-e1-eqiad [15:03:21] FIRING: [12x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:03:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-e1-eqiad [15:04:05] FIRING: [12x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:05:09] (03PS1) 10Cathal Mooney: Eqiad row c: move vlan gateways to ports facing the Nokia spines [homer/public] - 10https://gerrit.wikimedia.org/r/1202729 (https://phabricator.wikimedia.org/T405579) [15:06:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T407997)', diff saved to https://phabricator.wikimedia.org/P85017 and previous config saved to /var/cache/conftool/dbconfig/20251106-150622-marostegui.json [15:06:26] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:08:12] (03PS1) 10Marostegui: site.pp: Remove old note [puppet] - 10https://gerrit.wikimedia.org/r/1202730 [15:08:21] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:21] FIRING: [12x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:08:51] (03CR) 10Marostegui: [C:03+2] site.pp: Remove old note [puppet] - 10https://gerrit.wikimedia.org/r/1202730 (owner: 10Marostegui) [15:10:06] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device cr2-eqsin [15:10:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqsin [15:10:46] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device fasw2-c1a-eqiad [15:10:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c1a-eqiad [15:13:21] FIRING: [12x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:14:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11350068 (10phaultfinder) [15:15:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11350074 (10cmooney) 05Open→03Resolved This is now complete. For now we will leave things as they are and... [15:17:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:17:42] (03PS1) 10Bking: w[cd]qs: Log more provenance headers [puppet] - 10https://gerrit.wikimedia.org/r/1202733 (https://phabricator.wikimedia.org/T408123) [15:17:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202733 (https://phabricator.wikimedia.org/T408123) (owner: 10Bking) [15:18:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2163 gradually with 4 steps - Migration of db2163.codfw.wmnet completed [15:21:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P85019 and previous config saved to /var/cache/conftool/dbconfig/20251106-152129-marostegui.json [15:22:55] (03PS2) 10Bking: w[cd]qs: Log more provenance headers [puppet] - 10https://gerrit.wikimedia.org/r/1202733 (https://phabricator.wikimedia.org/T408123) [15:23:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202733 (https://phabricator.wikimedia.org/T408123) (owner: 10Bking) [15:24:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11350129 (10phaultfinder) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1530) [15:31:15] (03CR) 10Brouberol: [C:03+1] w[cd]qs: Log more provenance headers [puppet] - 10https://gerrit.wikimedia.org/r/1202733 (https://phabricator.wikimedia.org/T408123) (owner: 10Bking) [15:32:23] (03CR) 10Majavah: "My main concern is that this causes the cluster MTU to be different on the old and the new nodes, which might (or might not!) be a problem" [puppet] - 10https://gerrit.wikimedia.org/r/1202382 (https://phabricator.wikimedia.org/T409294) (owner: 10Filippo Giunchedi) [15:33:21] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:52] (03CR) 10Filippo Giunchedi: [V:03+1] "That's a good and fair point, I'll loop in Cathal to get his input too" [puppet] - 10https://gerrit.wikimedia.org/r/1202382 (https://phabricator.wikimedia.org/T409294) (owner: 10Filippo Giunchedi) [15:36:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P85022 and previous config saved to /var/cache/conftool/dbconfig/20251106-153636-marostegui.json [15:39:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:09] (03PS1) 10Andrew Bogott: pdns-recursor: replace webserver address setting in yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1202738 (https://phabricator.wikimedia.org/T381608) [15:45:30] (03CR) 10Ssingh: [C:03+1] "Yes, looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1202738 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [15:49:00] !log drop grants for dbprov1003 & dbprov2003 T403166 [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:04] T403166: Setup dbprov1007 & dbprov2007; prepare for decommission dbprov1003 & dbprov2003 - https://phabricator.wikimedia.org/T403166 [15:49:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202738 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [15:50:31] (03CR) 10Jcrespo: [C:03+2] mariadb: Remove grants for dbprov1003 & dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/1201595 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [15:51:12] (03CR) 10Bking: [C:03+2] w[cd]qs: Log more provenance headers [puppet] - 10https://gerrit.wikimedia.org/r/1202733 (https://phabricator.wikimedia.org/T408123) (owner: 10Bking) [15:51:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T407997)', diff saved to https://phabricator.wikimedia.org/P85024 and previous config saved to /var/cache/conftool/dbconfig/20251106-155143-marostegui.json [15:51:47] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:51:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance [15:52:07] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cp1108.eqiad.wmnet with reason: C/D Migration [15:52:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T407997)', diff saved to https://phabricator.wikimedia.org/P85025 and previous config saved to /var/cache/conftool/dbconfig/20251106-155207-marostegui.json [15:52:52] !log cp1108 moving as part of migration [15:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:56] (03CR) 10Ottomata: "TIL! This is an LDAP group request GUI tool?!?" [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff) [15:53:17] (03CR) 10Ottomata: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff) [15:53:23] !log dancy@deploy2002 Installing scap version "4.225.0" for 2 host(s) [15:55:11] !log dancy@deploy2002 Installation of scap version "4.225.0" completed for 2 hosts [15:56:42] (03PS1) 10TChin: [eventgate] Fix lint problem [alerts] - 10https://gerrit.wikimedia.org/r/1202743 (https://phabricator.wikimedia.org/T405952) [15:58:33] (03CR) 10Xcollazo: [C:03+1] [eventgate] Fix lint problem [alerts] - 10https://gerrit.wikimedia.org/r/1202743 (https://phabricator.wikimedia.org/T405952) (owner: 10TChin) [15:59:16] (03CR) 10TChin: [C:03+2] [eventgate] Fix lint problem [alerts] - 10https://gerrit.wikimedia.org/r/1202743 (https://phabricator.wikimedia.org/T405952) (owner: 10TChin) [15:59:44] (03CR) 10AikoChou: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [16:00:04] jeena and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1600). [16:00:49] (03CR) 10JavierMonton: [V:03+1] [eventgate] Fix lint problem [alerts] - 10https://gerrit.wikimedia.org/r/1202743 (https://phabricator.wikimedia.org/T405952) (owner: 10TChin) [16:01:11] (03Merged) 10jenkins-bot: [eventgate] Fix lint problem [alerts] - 10https://gerrit.wikimedia.org/r/1202743 (https://phabricator.wikimedia.org/T405952) (owner: 10TChin) [16:04:18] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2163 gradually with 4 steps - Migration of db2163.codfw.wmnet completed [16:04:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [16:04:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cp1109.eqiad.wmnet with reason: C/D Migration [16:04:45] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202744 [16:04:49] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202745 [16:05:13] (03CR) 10Andrew Bogott: [C:03+2] pdns-recursor: replace webserver address setting in yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1202738 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [16:09:25] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T409390#11350361 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm power balanced. [16:09:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1082.eqiad.wmnet with reason: C/D Migration [16:10:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T407997)', diff saved to https://phabricator.wikimedia.org/P85027 and previous config saved to /var/cache/conftool/dbconfig/20251106-161045-marostegui.json [16:10:49] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:13:21] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:21] RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:13:21] RESOLVED: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:51] cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists boardvotetest and boardvote2007_test; (T297297) [16:15:51] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [16:16:12] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists boardvotetest and boardvote2007_test; (T297297) [16:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:20] !log installing sysstat security updates [16:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:31] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1083.eqiad.wmnet with reason: C/D Migration [16:17:39] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1083.eqiad.wmnet with reason: C/D Migration [16:18:44] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1087.eqiad.wmnet with reason: C/D Migration [16:19:26] (03CR) 10Dpogorzelski: "i would need a +2 from someone otherwise wi can't merge :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [16:21:53] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1088.eqiad.wmnet with reason: C/D Migration [16:23:59] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on dbprov1003.eqiad.wmnet with reason: C/D Migration [16:25:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P85028 and previous config saved to /var/cache/conftool/dbconfig/20251106-162552-marostegui.json [16:28:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on ms-be1082.eqiad.wmnet with reason: C/D Migration [16:28:47] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11350458 (10elukey) To keep archives happy: I added the uid to the `ops` ldap group as well! [16:30:55] (03PS1) 10Fabfur: external_clouds_vendors: add AppleBot [puppet] - 10https://gerrit.wikimedia.org/r/1202750 [16:30:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11350487 (10Jhancock.wm) @MatthewVernon drive has been replaced. [16:31:39] (03CR) 10CDanis: [C:03+1] external_clouds_vendors: add AppleBot [puppet] - 10https://gerrit.wikimedia.org/r/1202750 (owner: 10Fabfur) [16:32:29] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on ms-fe1019.eqiad.wmnet with reason: C/D Migration [16:32:44] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for maps2009.mgmt:22 - https://phabricator.wikimedia.org/T390659#11350514 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm server is decommed. [16:32:45] (03CR) 10Vgutierrez: [C:03+1] external_clouds_vendors: add AppleBot [puppet] - 10https://gerrit.wikimedia.org/r/1202750 (owner: 10Fabfur) [16:32:57] (03CR) 10Fabfur: [C:03+2] external_clouds_vendors: add AppleBot [puppet] - 10https://gerrit.wikimedia.org/r/1202750 (owner: 10Fabfur) [16:33:58] (03PS1) 10Federico Ceratto: db2154: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202752 (https://phabricator.wikimedia.org/T406008) [16:34:52] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [16:35:09] (03PS1) 10Bking: wdqs: fix access log formatting, don't log monitoring traffic [puppet] - 10https://gerrit.wikimedia.org/r/1202753 (https://phabricator.wikimedia.org/T408123) [16:35:40] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [16:35:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202753 (https://phabricator.wikimedia.org/T408123) (owner: 10Bking) [16:37:23] (03CR) 10Marostegui: [C:03+1] db2154: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202752 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [16:37:29] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-worker1133.eqiad.wmnet with reason: C/D Migration [16:38:27] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [16:38:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2154 - Upgrading db2154.codfw.wmnet [16:39:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202331 (https://phabricator.wikimedia.org/T406134) (owner: 10DLynch) [16:39:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202301 (https://phabricator.wikimedia.org/T406134) (owner: 10DLynch) [16:39:40] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2154 - Upgrading db2154.codfw.wmnet [16:40:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-worker1151.eqiad.wmnet with reason: C/D Migration [16:41:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P85030 and previous config saved to /var/cache/conftool/dbconfig/20251106-164100-marostegui.json [16:42:07] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312#11350548 (10elukey) New version of the two queries for the quarterly sloth panel: ` 1-( sum_over_time( ( slo:sli_error:ratio_rate1h{sloth_service="${service}",sloth_slo="${slo}"} * on() gro... [16:43:47] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-worker1180.eqiad.wmnet with reason: C/D Migration [16:43:59] (03CR) 10Federico Ceratto: [C:03+2] db2154: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202752 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [16:45:35] fceratto@cumin1003 major-upgrade (PID 584766) is awaiting input [16:45:58] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-worker1224.eqiad.wmnet with reason: C/D Migration [16:47:32] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312#11350568 (10elukey) One thing that I cannot solve is that `vector(${__to:date:seconds})` returns a unix ts for `Mon Dec 1 12:59:59 AM CET 2025` and `12` when selecting the month, while the time picker in grafa... [16:47:41] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-worker1225.eqiad.wmnet with reason: C/D Migration [16:49:46] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-worker1226.eqiad.wmnet with reason: C/D Migration [16:51:48] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 218422880 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:52:27] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on prometheus1008.eqiad.wmnet with reason: C/D Migration [16:52:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4386680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:52:52] !log drop backup grants from m* section primaries T403166 [16:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:55] T403166: Setup dbprov1007 & dbprov2007; prepare for decommission dbprov1003 & dbprov2003 - https://phabricator.wikimedia.org/T403166 [16:54:54] (03CR) 10Jdlrobson: [C:03+1] Update QuickSurvey platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 (owner: 10Jdlrobson) [16:54:57] (03PS3) 10Jdlrobson: Update QuickSurvey platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 [16:55:04] (03CR) 10Dpogorzelski: [C:03+2] ml-services: add aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202665 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [16:55:13] (03CR) 10Dpogorzelski: [C:03+2] knative-serving: add podspec features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202194 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [16:55:55] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on ms-fe1011.eqiad.wmnet with reason: C/D Migration [16:56:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T407997)', diff saved to https://phabricator.wikimedia.org/P85031 and previous config saved to /var/cache/conftool/dbconfig/20251106-165607-marostegui.json [16:56:11] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:56:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance [16:56:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T407997)', diff saved to https://phabricator.wikimedia.org/P85032 and previous config saved to /var/cache/conftool/dbconfig/20251106-165631-marostegui.json [17:00:05] jhathaway and moritzm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:21] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-presto1019.eqiad.wmnet with reason: C/D Migration [17:02:05] (03Merged) 10jenkins-bot: knative-serving: add podspec features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202194 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [17:02:09] (03PS1) 10Jcrespo: dbbackups: Remove dbprov1003 & dbprov2003 role and set them "insetup" [puppet] - 10https://gerrit.wikimedia.org/r/1202754 (https://phabricator.wikimedia.org/T403166) [17:03:55] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [17:06:13] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2154 gradually with 4 steps - Migration of db2154.codfw.wmnet completed [17:07:25] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2090-4 to codfw - jhancock@cumin1003" [17:07:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2090-4 to codfw - jhancock@cumin1003" [17:07:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:59] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2090 [17:08:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2090 [17:08:12] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2091 [17:08:22] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2091 [17:08:25] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2092 [17:08:35] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2092 [17:08:37] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2093 [17:08:48] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2093 [17:08:51] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2094 [17:09:01] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2094 [17:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:10:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-worker1222.eqiad.wmnet with reason: C/D Migration [17:11:28] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2090.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:11:49] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2091.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:12:14] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2092.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:12:17] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on an-worker1223.eqiad.wmnet with reason: C/D Migration [17:12:20] (03PS1) 10Btullis: Add a local_upstream_port parameter to the mesh.configuration module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202755 (https://phabricator.wikimedia.org/T409183) [17:12:32] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2093.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:13:01] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:13:56] (03CR) 10Btullis: Add a local_upstream_port parameter to the mesh.configuration module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202755 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [17:14:20] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on ml-cache1002.eqiad.wmnet with reason: C/D Migration [17:15:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T407997)', diff saved to https://phabricator.wikimedia.org/P85034 and previous config saved to /var/cache/conftool/dbconfig/20251106-171505-marostegui.json [17:15:09] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:15:34] (03CR) 10Federico Ceratto: [C:03+1] "I checked that the names are consistent across description, related task, and code change and that the hosts are not pooled in as dbs in d" [puppet] - 10https://gerrit.wikimedia.org/r/1202754 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [17:16:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1080.eqiad.wmnet with reason: C/D Migration [17:17:09] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1081.eqiad.wmnet with reason: C/D Migration [17:18:48] jhancock@cumin1003 provision (PID 619408) is awaiting input [17:19:22] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1118.eqiad.wmnet with reason: C/D Migration [17:19:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cirrussearch1119.eqiad.wmnet with reason: C/D Migration [17:21:48] !log multiple moves from C/D per T405942 [17:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:52] T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 [17:22:59] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [17:23:56] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:25:29] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on relforge1008.eqiad.wmnet with reason: C/D Migration [17:28:05] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on wdqs1014.eqiad.wmnet with reason: C/D Migration [17:28:20] jhancock@cumin1003 provision (PID 618646) is awaiting input [17:30:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P85036 and previous config saved to /var/cache/conftool/dbconfig/20251106-173013-marostegui.json [17:30:27] (03CR) 10Jcrespo: [C:03+1] "Thank you very much for the review, will merge tomorrow morning." [puppet] - 10https://gerrit.wikimedia.org/r/1202754 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [17:31:54] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on mc1049.eqiad.wmnet with reason: C/D Migration [17:31:57] (03CR) 10Dzahn: [C:04-1] "both links redirect to the main .org page (since the redirects were added on their side).. so I would say you can abandon this one now" [puppet] - 10https://gerrit.wikimedia.org/r/1201689 (https://phabricator.wikimedia.org/T407579) (owner: 10Aklapper) [17:33:42] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on mc1050.eqiad.wmnet with reason: C/D Migration [17:37:12] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cp1110.eqiad.wmnet with reason: C/D Migration [17:38:37] !log shutting down people1004 and people2003 - had already shut them down on Oct 29 but someone or something booted them again T408713 [17:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:40] T408713: decom old people VMs / finish people host upgrade - https://phabricator.wikimedia.org/T408713 [17:39:21] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on people2003.codfw.wmnet with reason: decom [17:39:36] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on people1004.eqiad.wmnet with reason: decom [17:42:19] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts people2003.codfw.wmnet [17:44:19] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-11-06-123623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202761 [17:45:05] jhancock@cumin1003 provision (PID 618646) is awaiting input [17:45:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P85038 and previous config saved to /var/cache/conftool/dbconfig/20251106-174521-marostegui.json [17:45:27] dzahn@cumin2002 decommission (PID 66332) is awaiting input [17:46:47] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2090.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:51] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2093.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:59] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2092.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:47:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2091.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:47:10] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-11-06-123623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202761 (owner: 10BryanDavis) [17:48:49] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-11-06-123623-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202761 (owner: 10BryanDavis) [17:50:23] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts people2003.codfw.wmnet [17:51:41] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2154 gradually with 4 steps - Migration of db2154.codfw.wmnet completed [17:51:41] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [17:52:02] when the user is asked if they want to abort and they say they DO want abort.. that should not be called a FAILURE [17:53:00] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on apus-fe1003.eqiad.wmnet with reason: C/D Migration [17:53:07] anyone where by chance who deployes changes to the machinetranslations service? [17:54:34] (03CR) 10Bking: [C:03+2] wdqs: fix access log formatting, don't log monitoring traffic [puppet] - 10https://gerrit.wikimedia.org/r/1202753 (https://phabricator.wikimedia.org/T408123) (owner: 10Bking) [17:54:47] (03CR) 10Bking: [C:03+2] "tested and confirmed working on wdqs2008" [puppet] - 10https://gerrit.wikimedia.org/r/1202753 (https://phabricator.wikimedia.org/T408123) (owner: 10Bking) [17:55:44] (03CR) 10CDanis: [C:03+1] Add a local_upstream_port parameter to the mesh.configuration module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202755 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [18:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1800). [18:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1800). [18:00:14] o/ [18:00:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T407997)', diff saved to https://phabricator.wikimedia.org/P85040 and previous config saved to /var/cache/conftool/dbconfig/20251106-180028-marostegui.json [18:00:32] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:00:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [18:00:46] o/ I will be shipping a new developer-portal build [18:00:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T407997)', diff saved to https://phabricator.wikimedia.org/P85041 and previous config saved to /var/cache/conftool/dbconfig/20251106-180052-marostegui.json [18:01:27] (03CR) 10Scott French: [C:03+2] deployment_server: migrate mw-wikifunctions to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202315 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:01:37] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-be1066.eqiad.wmnet with reason: C/D Migration [18:02:00] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:02:19] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): serve 25% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202321 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:02:23] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:04:13] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-be1086.eqiad.wmnet with reason: C/D Migration [18:04:13] (03Merged) 10jenkins-bot: mw-(api-ext|web): serve 25% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202321 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:05:29] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-be1092.eqiad.wmnet with reason: C/D Migration [18:06:32] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:06:40] !log Rack C2 C/D switch migrations in progress [18:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:50] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:06:59] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:07:17] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:09:32] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:09:33] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:09:37] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:09:39] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:09:42] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1132.eqiad.wmnet with reason: C/D Migration [18:09:56] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:09:58] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:10:03] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:10:20] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:10:21] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:10:35] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:10:46] !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run to switch mw-wikifunctions to PHP 8.3 - T405955 [18:10:49] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:11:18] !log swfrench@deploy2002 Stopping before sync operations [18:11:45] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11350945 (10cmooney) So this is causing a lot of logspam on our Nokia switches right now. What I've noticed before is that our hosts tend to alternate between two LLDP neighbors co... [18:12:12] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1220.eqiad.wmnet with reason: C/D Migration [18:14:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1221.eqiad.wmnet with reason: C/D Migration [18:14:56] (03CR) 10Dzahn: "any feedback here? per https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service adding data to etcd shoud be like the first" [puppet] - 10https://gerrit.wikimedia.org/r/1197657 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [18:15:14] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1234.eqiad.wmnet with reason: C/D Migration [18:15:35] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:15:47] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:15:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11350953 (10cmooney) Regarding the an-presto move issue the link came up both sides I can see in the logs. I do notice th... [18:15:54] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:16:10] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:17:06] (03CR) 10Btullis: [C:03+2] Add a local_upstream_port parameter to the mesh.configuration module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202755 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [18:17:22] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on druid1012.eqiad.wmnet with reason: C/D Migration [18:17:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:18:10] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:18:17] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:18:32] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:18:39] (03CR) 10RLazarus: [C:03+1] Add a local_upstream_port parameter to the mesh.configuration module (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202755 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [18:18:47] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on maps1013.eqiad.wmnet with reason: C/D Migration [18:18:59] (03Merged) 10jenkins-bot: Add a local_upstream_port parameter to the mesh.configuration module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202755 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [18:19:25] (03CR) 10RLazarus: "Oops, we crossed in-flight! Fine to disregard both of those comments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202755 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [18:19:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T407997)', diff saved to https://phabricator.wikimedia.org/P85042 and previous config saved to /var/cache/conftool/dbconfig/20251106-181944-marostegui.json [18:19:48] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:20:22] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on kafka-logging1002.eqiad.wmnet with reason: C/D Migration [18:22:25] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [18:23:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [18:24:19] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc1045.eqiad.wmnet with reason: C/D Migration [18:25:25] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2090.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:25:43] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2091.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:26:10] (03PS1) 10Gergő Tisza: Use prefixed 'sub' field in OAuth 2 access tokens on beta & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202766 (https://phabricator.wikimedia.org/T399199) [18:26:20] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc1046.eqiad.wmnet with reason: C/D Migration [18:26:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2092.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:27:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202766 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [18:27:15] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2093.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:27:54] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc1047.eqiad.wmnet with reason: C/D Migration [18:28:23] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc1048.eqiad.wmnet with reason: C/D Migration [18:31:13] (03PS2) 10Gergő Tisza: Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202766 (https://phabricator.wikimedia.org/T399199) [18:31:13] (03PS1) 10Gergő Tisza: Use prefixed 'sub' field in OAuth 2 access tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) [18:33:40] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [18:34:17] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [18:34:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P85043 and previous config saved to /var/cache/conftool/dbconfig/20251106-183452-marostegui.json [18:38:56] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on krb1002.eqiad.wmnet with reason: C/D Migration [18:39:46] * swfrench-wmf is done with changes planned for this infra window [18:41:26] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aqs1018.eqiad.wmnet with reason: C/D Migration [18:42:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2091.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:43:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2092.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:43:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2090.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:43:39] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2093.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:44:16] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1168.eqiad.wmnet with reason: C/D Migration [18:44:18] !log C5 eqiad c/d server switch migrations in progress [18:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:44] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1169.eqiad.wmnet with reason: C/D Migration [18:47:10] (03PS1) 10Clare Ming: Test Kitchen: Deploying v1.1.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202770 (https://phabricator.wikimedia.org/T404458) [18:49:24] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1217.eqiad.wmnet with reason: C/D Migration [18:49:47] (03PS1) 10Clare Ming: Test Kitchen: Deploying v1.1.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202772 (https://phabricator.wikimedia.org/T404458) [18:49:55] (03PS1) 10Federico Ceratto: db2152: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202773 (https://phabricator.wikimedia.org/T406008) [18:49:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P85044 and previous config saved to /var/cache/conftool/dbconfig/20251106-184958-marostegui.json [18:49:59] (03PS1) 10Federico Ceratto: db2164: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202774 (https://phabricator.wikimedia.org/T406008) [18:50:03] (03PS1) 10Federico Ceratto: db2166: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202775 (https://phabricator.wikimedia.org/T406008) [18:50:07] (03PS1) 10Federico Ceratto: db2167: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202776 (https://phabricator.wikimedia.org/T406008) [18:50:15] (03PS1) 10Federico Ceratto: db2181: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202777 (https://phabricator.wikimedia.org/T406008) [18:50:25] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1218.eqiad.wmnet with reason: C/D Migration [18:50:30] (03CR) 10Ssingh: "Thanks. The way we do the roll-out for this is essentially get all the patches ready and then push them one by one." [puppet] - 10https://gerrit.wikimedia.org/r/1197657 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [18:50:35] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2090'] [18:50:41] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2091'] [18:50:48] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2092'] [18:50:50] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2090'] [18:50:54] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2093'] [18:51:00] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2091'] [18:51:05] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2092'] [18:51:09] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2093'] [18:51:19] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1262.eqiad.wmnet with reason: C/D Migration [18:51:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2090.codfw.wmnet with OS bullseye [18:52:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11351130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2090.codfw.wmnet with OS bullseye [18:52:10] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2091.codfw.wmnet with OS bullseye [18:52:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11351131 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2091.codfw.wmnet with OS bullseye [18:52:25] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2092.codfw.wmnet with OS bullseye [18:52:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11351133 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2092.codfw.wmnet with OS bullseye [18:52:41] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2093.codfw.wmnet with OS bullseye [18:52:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11351138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2093.codfw.wmnet with OS bullseye [18:53:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11351139 (10Jhancock.wm) note to self: 90 needs the mgmt connection checked for connectivity. [18:53:27] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on es1045.eqiad.wmnet with reason: C/D Migration [18:54:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11351141 (10Jhancock.wm) [18:55:29] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on sessionstore1005.eqiad.wmnet with reason: C/D Migration [18:57:16] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-test-worker1002.eqiad.wmnet with reason: C/D Migration [19:00:04] jeena and dduvall: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T1900). Please do the needful. [19:01:54] ATTENTION: Do not run puppet for the next 10 minutes. We are migrating puppetserver1001's primary network connection from the old to new switch stacks. There is less than 3 seconds disruption to connectivity expected. [19:02:12] will update when done [19:02:22] robh: should I wait to run the train as well? [19:02:43] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on puppetserver1001.eqiad.wmnet with reason: C/D Migration [19:02:50] nah [19:02:55] train is medwploy shouldnt matter [19:03:02] and this is literally 3 seconds of nework blip [19:03:36] I'm honestly being overly paranoid echoing it at all we've now moved a couple dozen today with no blips ; D [19:03:52] and its moved [19:04:07] puppetserver1001 move complete [19:04:36] okay thanks! [19:05:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T407997)', diff saved to https://phabricator.wikimedia.org/P85045 and previous config saved to /var/cache/conftool/dbconfig/20251106-190506-marostegui.json [19:05:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:05:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:05:36] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wcqs1003.eqiad.wmnet with reason: C/D Migration [19:05:45] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202779 (https://phabricator.wikimedia.org/T408271) [19:05:48] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202779 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [19:06:09] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wdqs1013.eqiad.wmnet with reason: C/D Migration [19:07:14] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202779 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [19:14:05] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:15:02] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.1 refs T408271 [19:15:06] T408271: 1.46.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T408271 [19:16:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11351182 (10RobH) Day 1 of migrations update: * 56 hosts moved total * We focused on moving hosts that did not require any specific scheduling wi... [19:16:40] (03PS1) 10Catrope: AccountRecovery: Use canonical URL in confirmation email [extensions/EmailAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202780 [19:17:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/EmailAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202780 (owner: 10Catrope) [19:17:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/EmailAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202780 (owner: 10Catrope) [19:18:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202346 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [19:18:10] !log disable-puppet on A:cp hosts for haproxy config change [19:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:31] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11351189 (10Andrew) For future debug research: We can prevent the final reboot after a reimage like this: - disable puppet on apt1002 - comment out the two reboot_in_p... [19:19:31] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1202306 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [19:19:35] (03CR) 10Scott French: [C:03+2] hiera: enable haproxy known-client identification [puppet] - 10https://gerrit.wikimedia.org/r/1202306 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [19:19:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [19:21:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [19:21:44] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2090.codfw.wmnet with reason: host reimage [19:25:04] PROBLEM - Host lsw1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:25:04] PROBLEM - Host lsw1-d6-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:25:04] PROBLEM - Host lsw1-d6-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:26:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:27:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:27:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2090.codfw.wmnet with reason: host reimage [19:29:16] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2091.codfw.wmnet with reason: host reimage [19:30:06] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11351215 (10cmooney) FWIW this will need further investigation, I've reset a bunch of these switches which will cause the scenario the alerts should fire, but I... [19:31:16] !log rolling run-puppet-agent on A:cp hosts for haproxy config change [19:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:35] (03PS1) 10Dzahn: allocate eqiad VIP for load balanced tcp-proxy service [dns] - 10https://gerrit.wikimedia.org/r/1202782 (https://phabricator.wikimedia.org/T408532) [19:33:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2091.codfw.wmnet with reason: host reimage [19:34:02] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2092.codfw.wmnet with reason: host reimage [19:34:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2093.codfw.wmnet with reason: host reimage [19:35:22] FIRING: GnmiTargetDown: lsw1-d6-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [19:36:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1251.eqiad.wmnet with reason: Maintenance [19:37:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T407997)', diff saved to https://phabricator.wikimedia.org/P85046 and previous config saved to /var/cache/conftool/dbconfig/20251106-193705-marostegui.json [19:37:09] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:37:11] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2092.codfw.wmnet with reason: host reimage [19:38:16] (03PS2) 10Aaron Schulz: Route /page/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1199033 (https://phabricator.wikimedia.org/T384216) [19:39:25] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on lsw1-d6-eqiad,lsw1-d6-eqiad IPv6,lsw1-d6-eqiad.mgmt with reason: told switch to reboot and its stuck in UEFI shell [19:39:59] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2093.codfw.wmnet with reason: host reimage [19:42:06] (03PS3) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1194995 (https://phabricator.wikimedia.org/T385066) [19:43:41] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [19:43:58] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11351297 (10wiki_willy) @Jclark-ctr - can you help out @Marostegui with getting a RMA for the DIMM? >>! In T409374#11348009, @Marostegui wrote: > The host went down, so i... [19:44:13] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1008-dev.eqiad.wmnet'] [19:46:10] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:47:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:47:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:50:48] 06SRE, 06Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186#11351320 (10Krinkle) >>! In T74186#9357590, @Tgr wrote: > […] Varnish [[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/78... [19:52:50] (03PS2) 10Aaron Schulz: Route /page/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) [19:53:00] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:55:09] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:55:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2092.codfw.wmnet with OS bullseye [19:55:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11351336 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2092.codfw.wmnet with OS bullseye complet... [19:55:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T407997)', diff saved to https://phabricator.wikimedia.org/P85047 and previous config saved to /var/cache/conftool/dbconfig/20251106-195557-marostegui.json [19:56:01] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:57:49] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11351345 (10cmooney) Small update, right now lsw1-d6-eqiad is broken. So this alert should be present for ssw1-d1-eqiad and ssw1-d8-eqiad. [20:11:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P85048 and previous config saved to /var/cache/conftool/dbconfig/20251106-201105-marostegui.json [20:14:05] FIRING: [4x] SystemdUnitFailed: docker-registry.service on registry1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:15] andrew@cumin2002 upgrade-firmware (PID 109745) is awaiting input [20:26:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P85049 and previous config saved to /var/cache/conftool/dbconfig/20251106-202612-marostegui.json [20:33:11] (03PS1) 10Ladsgroup: Revert "RestrictionStore: Switch order between pr_cascade and links queries" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202800 [20:33:19] (03PS1) 10Ladsgroup: Revert "BacklinkCache: Switch order between pr_cascade and links queries" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202801 [20:33:33] (03CR) 10Ladsgroup: [C:03+2] Revert "RestrictionStore: Switch order between pr_cascade and links queries" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202800 (owner: 10Ladsgroup) [20:33:36] (03CR) 10Ladsgroup: [C:03+2] Revert "BacklinkCache: Switch order between pr_cascade and links queries" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202801 (owner: 10Ladsgroup) [20:41:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T407997)', diff saved to https://phabricator.wikimedia.org/P85050 and previous config saved to /var/cache/conftool/dbconfig/20251106-204120-marostegui.json [20:41:25] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:41:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [20:47:49] (03PS1) 10Clare Ming: Re-run xLab MW Module Loaded experiment v2 [extensions/MetricsPlatform] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202807 (https://phabricator.wikimedia.org/T401705) [20:48:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/MetricsPlatform] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202807 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [20:49:31] (03CR) 10CI reject: [V:04-1] Revert "RestrictionStore: Switch order between pr_cascade and links queries" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202800 (owner: 10Ladsgroup) [20:50:01] (03CR) 10Ladsgroup: [C:03+2] "hit me baby one more time" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202800 (owner: 10Ladsgroup) [20:52:13] (03Merged) 10jenkins-bot: Revert "BacklinkCache: Switch order between pr_cascade and links queries" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202801 (owner: 10Ladsgroup) [20:55:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202800 (owner: 10Ladsgroup) [20:56:29] (03CR) 10BCornwall: [C:03+1] allocate eqiad VIP for load balanced tcp-proxy service [dns] - 10https://gerrit.wikimedia.org/r/1202782 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T2100). nyaa~ [21:00:05] AaronSchulz, kemayo, tgr, RoanKattouw, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:07] o/ I have two that I can deploy myself together. [21:00:15] Go for it [21:00:18] o/ [21:00:30] Then I can eat lunch before deploying mine (and/or the others) [21:01:14] lunch! good idea [21:01:47] Hm, spiderpig let me start, but "backport is locked by ladsgroup (pid 490376)" [21:01:59] o/ [21:03:03] He is in the middle of deploying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1202800 I think [21:03:27] (03Merged) 10jenkins-bot: Revert "RestrictionStore: Switch order between pr_cascade and links queries" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202800 (owner: 10Ladsgroup) [21:03:27] I am too trusting of spiderpig's ability to notice other deployments, clearly. [21:03:29] (03CR) 10Dzahn: [C:03+2] allocate eqiad VIP for load balanced tcp-proxy service [dns] - 10https://gerrit.wikimedia.org/r/1202782 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [21:03:35] !log dzahn@dns1004 START - running authdns-update [21:03:41] Well, it'll work itself out -- it says it's polling on the lock. [21:03:52] Kemayo: We'll improve that some day. [21:04:05] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1202801|Revert "BacklinkCache: Switch order between pr_cascade and links queries"]], [[gerrit:1202800|Revert "RestrictionStore: Switch order between pr_cascade and links queries"]] [21:07:03] !log dzahn@dns1004 END - running authdns-update [21:08:41] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:10:35] 06SRE, 06Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186#11351578 (10Tgr) FWIW this was fixed at some point in the past (per {T351988}). [21:11:47] 06SRE, 06Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186#11351596 (10Tgr) 05Open→03Resolved (And in any case we don't have mobile redirects anymore.) [21:13:20] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added VIP for tcpproxy service in eqiad - dzahn@cumin2002" [21:13:26] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added VIP for tcpproxy service in eqiad - dzahn@cumin2002" [21:13:26] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:13:44] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11351612 (10colewhite) In today's case, the alert criteria wasn't met because the metrics [[ https://grafana-rw.wikimedia.org/explore?schemaVersion=1&panes=%7B%... [21:17:16] (03PS1) 10Jgreen: nsca_frack.cfg.erb remove deprecated check_endpoints service check [puppet] - 10https://gerrit.wikimedia.org/r/1202827 (https://phabricator.wikimedia.org/T367370) [21:22:11] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: power up failure for an-worker1132.eqiad.wmnet - https://phabricator.wikimedia.org/T405221#11351677 (10RobH) Please note this server was never returned from failed to active status in netbox, and caused an issue during migration of switches earlier today... [21:26:45] Is this an uncommonly long sync-world, or has the logging run into issues? [21:27:26] I don't know why either [21:27:42] but it's moving forward [21:28:04] Fair! I just wanted to make sure it hadn't finished and just failed to announce it. [21:29:49] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1202801|Revert "BacklinkCache: Switch order between pr_cascade and links queries"]], [[gerrit:1202800|Revert "RestrictionStore: Switch order between pr_cascade and links queries"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:29:56] Looks like there was a full l10n rebuild. [21:31:17] Huh. That did not look like the kind of patches that would require that. [21:31:21] nevermind. I was looking at the wrong stuff. [21:32:18] (03PS3) 10Tim Starling: Add English translations to namespaces that lack them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202368 (https://phabricator.wikimedia.org/T407127) [21:32:18] (03PS3) 10Tim Starling: Set robot noindex policy for draft namespaces that lacked it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202369 (https://phabricator.wikimedia.org/T407127) [21:32:21] No, I was looking at the right stuff: 542 languages rebuilt out of 542 [21:32:28] dancy: yeah, no - I think you're right [21:34:38] (03PS1) 10Dzahn: allocate codfw VIP for load-balanced tcp-proxy service [dns] - 10https://gerrit.wikimedia.org/r/1202835 (https://phabricator.wikimedia.org/T408532) [21:34:56] Weird, because Amir's patch didn't touch anything i18n-related [21:35:34] (03CR) 10Dzahn: "same thing as earlier but for codfw - the sre.dns.netbox cookbook will be run after authdns-update" [dns] - 10https://gerrit.wikimedia.org/r/1202835 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [21:38:47] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [21:39:39] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice-archive: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11351698 (10Tgr) The PR looks right to me; not sure if there's an easy way to verify the header is applied to all requests, other than removing... [21:40:12] RECOVERY - OSPF status on cr1-esams is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:43:12] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:48:10] (03PS1) 10Dzahn: service: add tcpproxy service to service catalog (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532) [21:54:36] Amir1: how's your sync doing? I'm a bit concerned that it's taking quite this long [21:55:04] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:55:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:55:15] almost there [21:55:17] can our window be pushed back an hour? if the web team isn't using their window? [21:55:18] 84% done [21:55:47] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:55:51] Well, there's no such thing as a web team now, so... ;-) [21:56:07] 87% [21:59:32] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202801|Revert "BacklinkCache: Switch order between pr_cascade and links queries"]], [[gerrit:1202800|Revert "RestrictionStore: Switch order between pr_cascade and links queries"]] (duration: 55m 26s) [21:59:45] wowee [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T2200) [22:00:22] Well, that might be a record. [22:00:47] RESOLVED: [4x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:00:48] (03CR) 10Krinkle: [C:03+1] Add English translations to namespaces that lack them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202368 (https://phabricator.wikimedia.org/T407127) (owner: 10Tim Starling) [22:00:56] (03CR) 10Krinkle: [C:03+1] Set robot noindex policy for draft namespaces that lacked it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202369 (https://phabricator.wikimedia.org/T407127) (owner: 10Tim Starling) [22:02:12] I suppose I can be polite and wait the remaining 4 minutes the deployments page says to for the formerly-known-as-the-web-team(s) before having spiderpig try mine again. [22:05:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202331 (https://phabricator.wikimedia.org/T406134) (owner: 10DLynch) [22:05:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202301 (https://phabricator.wikimedia.org/T406134) (owner: 10DLynch) [22:06:59] (03Merged) 10jenkins-bot: Enable editcheck addReference a/b test on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202301 (https://phabricator.wikimedia.org/T406134) (owner: 10DLynch) [22:15:19] Kemayo: Could you ping me when you're done so that I can go after you? [22:15:30] RoanKattouw: sure thing [22:15:36] 55 minutes? yesterday I was admiring the 10 minute scaps and thinking that someone must have fixed something [22:17:41] RoanKattouw: can you ping me when you're done? if others in queue are not around, i'll go after you [22:18:36] oh - maybe tgr is here -- i can go last as i'm last in the queue - just lmk [22:19:23] if you have a few to go and each takes 55 minutes you should push them all out at once, if they're low risk [22:19:56] I think we're hoping that the 55 minutes one was a strange one-off. I guess we're about to find out. [22:20:13] mine is low risk - can go out with others [22:20:15] back in my day... [22:20:28] (03Merged) 10jenkins-bot: Edit check: allow any check to be an a/b test including default ones [extensions/VisualEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202331 (https://phabricator.wikimedia.org/T406134) (owner: 10DLynch) [22:20:49] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1202331|Edit check: allow any check to be an a/b test including default ones (T406134)]], [[gerrit:1202301|Enable editcheck addReference a/b test on enwiki (T406134)]] [22:20:52] T406134: Deploy config change to start the Reference Check A/B Test (en.wiki) - https://phabricator.wikimedia.org/T406134 [22:20:54] And here... we... go. [22:20:57] yeah I remember complaining when sync-file took 45 seconds, I was like come on, we can do better than that [22:21:49] who didn't love a completely pointless full i18n cache rebuild for multiple versions... [22:23:03] I was wondering why apt needs to run so many times to do an image rebuild, don't we have base images? [22:23:04] Okay, it has already started sync-testservers-k8s, so I think that's a good sign. [22:23:18] it runs and hits security.debian.org many times every time we do a scap [22:24:16] seems like it should at least be a local process -- if security.debian.org ever goes down nobody will know how to make scap work [22:24:55] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1202331|Edit check: allow any check to be an a/b test including default ones (T406134)]], [[gerrit:1202301|Enable editcheck addReference a/b test on enwiki (T406134)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:25:46] TimStarling: There are various image types (for example, debug images, images for dumps, etc) which all use the same files from /srv/mediawiki (which is the biggest component of the images). One option is to make various base images for these types and then rsync /srv/mediawiki into them. This means that when there is a full image build, there are multiple copies of ~8GB of data. We do things the other way around where we build the [22:25:46] image with the ~8GB of data, and then install the apt packages afterward to create the differentiated images. That is a smaller operation on average. [22:26:50] my change is functionally beta only, can go whenever [22:26:50] right [22:26:58] This is all because we use simple Dockerfile builds. Newer versions of Docker have features that we can leverage to improve the efficiency in this area. [22:27:57] !log kemayo@deploy2002 kemayo: Continuing with sync [22:31:57] the really slow parts of the image rebuild seem to be the docker push and the sleep 300, accounting for 10 minutes and 5 minutes respectively in Amir's scap [22:32:41] Ah yes.. the very painful 5 minute sleep. [22:33:21] aren't we doing that twice per deploy now, with 8.1 and 8.3 images both being built? [22:33:34] We do seem to be completely back to normal speed for my deploy, at least. [22:33:36] The sleeps happen in parallel for each full image build. [22:33:52] each full image push, I should say [22:34:41] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202331|Edit check: allow any check to be an a/b test including default ones (T406134)]], [[gerrit:1202301|Enable editcheck addReference a/b test on enwiki (T406134)]] (duration: 13m 52s) [22:34:44] T406134: Deploy config change to start the Reference Check A/B Test (en.wiki) - https://phabricator.wikimedia.org/T406134 [22:34:52] RoanKattouw: mine's done now. [22:34:59] Great, I'll start mine now [22:35:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/EmailAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202780 (owner: 10Catrope) [22:35:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202346 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [22:36:06] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [22:36:10] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:36:27] tgr: i can do yours and mine together after Roan [22:36:27] (03Merged) 10jenkins-bot: Enable Special:AccountRecovery everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202346 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [22:37:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and Hurricane Electric (2001:7f8:54:5::13) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:37:51] (03PS2) 10Scott French: mw-(api-ext|web): tune maxUnavailable and maxSurge for main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202856 (https://phabricator.wikimedia.org/T405955) [22:38:27] (03Merged) 10jenkins-bot: AccountRecovery: Use canonical URL in confirmation email [extensions/EmailAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202780 (owner: 10Catrope) [22:38:45] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1202780|AccountRecovery: Use canonical URL in confirmation email]], [[gerrit:1202346|Enable Special:AccountRecovery everywhere (T399742)]] [22:38:48] so, another factor with the slowness is poor reuse of the (node-local) image cache during deployments, given the particular stage we're at in the 8.3 migration. [22:38:49] T399742: Integrated on-page form for EmailAuth recovery requests - https://phabricator.wikimedia.org/T399742 [22:39:26] i.e., 8.1 images get less reuse, and thus (slow) image pulls are likely in the event of a full image build [22:39:34] thanks cjming [22:39:42] np [22:39:52] mine is low risk too [22:40:09] !log ryankemper@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [22:40:29] AaronSchulz: i can also include yours then too [22:40:49] sounds good [22:40:56] !log catrope@deploy2002 catrope: Backport for [[gerrit:1202780|AccountRecovery: Use canonical URL in confirmation email]], [[gerrit:1202346|Enable Special:AccountRecovery everywhere (T399742)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:41:05] is it ok to do 2 config patches and a 1.46.0-wmf.1 all in one? or should i just do the config patches together? [22:41:20] I'm doing a config patch and a wmf.1 patch together right now [22:41:25] cool [22:41:53] I would normally have sequenced them because one kind of depends on the other (fix a bug in wmf.1, then turn the feature on in config) but that would take like 15 minutes longer [22:42:22] then after you're done RoanKattouw, I'll do the rest of the patches in our window together (3 in all) [22:42:28] !log catrope@deploy2002 catrope: Continuing with sync [22:42:35] (03CR) 10Dr0ptp4kt: [C:03+1] Test Kitchen: Deploying v1.1.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202770 (https://phabricator.wikimedia.org/T404458) (owner: 10Clare Ming) [22:42:56] (03CR) 10Dr0ptp4kt: [C:03+1] Test Kitchen: Deploying v1.1.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202772 (https://phabricator.wikimedia.org/T404458) (owner: 10Clare Ming) [22:44:04] (03CR) 10RLazarus: [C:03+1] mw-(api-ext|web): tune maxUnavailable and maxSurge for main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202856 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:46:29] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [22:46:32] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:47:40] (03PS2) 10Aaron Schulz: Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201826 [22:47:44] cjming: before you start the next scap run, would you mind if I merge a deployment-charts patch that should hopefully make things a bit faster? [22:48:03] swfrench-wmf: sure [22:48:13] (03PS3) 10Gergő Tisza: Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202766 (https://phabricator.wikimedia.org/T399199) [22:49:09] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202780|AccountRecovery: Use canonical URL in confirmation email]], [[gerrit:1202346|Enable Special:AccountRecovery everywhere (T399742)]] (duration: 10m 24s) [22:49:12] T399742: Integrated on-page form for EmailAuth recovery requests - https://phabricator.wikimedia.org/T399742 [22:49:18] great, I might merge that now [22:49:22] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): tune maxUnavailable and maxSurge for main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202856 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:49:36] cjming: All yours [22:49:56] ....after swfrench-wmf is done that is [22:50:00] RoanKattouw: thanks! [22:50:15] ... at the speed of CI [22:50:25] (should be quick on this one) [22:50:30] (03PS4) 10Bking: opensearch-cluster: create separate user for operator and admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202769 (https://phabricator.wikimedia.org/T408919) [22:50:35] FIRING: DiskSpace: Disk space install1005:9100:/ 3.536% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:51:20] (03Merged) 10jenkins-bot: mw-(api-ext|web): tune maxUnavailable and maxSurge for main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202856 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:51:29] (03CR) 10Ryan Kemper: [C:03+1] opensearch-cluster: create separate user for operator and admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202769 (https://phabricator.wikimedia.org/T408919) (owner: 10Bking) [22:51:52] (03PS5) 10Bking: opensearch-cluster: create separate user for operator and admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202769 (https://phabricator.wikimedia.org/T408919) [22:52:14] (03PS6) 10Bking: opensearch-cluster: create separate user for operator and admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202769 (https://phabricator.wikimedia.org/T408919) [22:52:22] cjming: you should be good now. thank you! [22:52:34] great ! thanks [22:53:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201826 (owner: 10Aaron Schulz) [22:53:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202766 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [22:53:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/MetricsPlatform] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202807 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [22:54:10] what's the opposite of a full image rebuild? how does the fast path work? [22:54:25] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1121:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:49] (03Merged) 10jenkins-bot: Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201826 (owner: 10Aaron Schulz) [22:55:00] (03Merged) 10jenkins-bot: Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202766 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [22:56:02] TimStarling: so, the script that builds the mediawiki images uses a heuristic to tell whether it makes sense to use the previous image as a base and commit a new layer on top with difference [22:56:07] *the difference [22:56:28] so, if you see in the logs mention of incremental vs. full build, that's what it's doign [22:57:49] https://gitlab.wikimedia.org/repos/releng/release/-/blob/main/make-container-image/build_image_incr.py?ref_type=heads#L212 [22:59:25] RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:02:02] right, so Amir's scap hit the slow path with ...is not suitable due to rsync transfer pct 49.648484885375765 (threshold is 25) [23:02:10] (03Merged) 10jenkins-bot: Re-run xLab MW Module Loaded experiment v2 [extensions/MetricsPlatform] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202807 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [23:02:29] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1201826|Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs]], [[gerrit:1202766|Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster (T399199)]], [[gerrit:1202807|Re-run xLab MW Module Loaded experiment v2 (T401705)]] [23:02:33] T399199: Update OAuth 2.0 sessions to include new JWT session data from core - https://phabricator.wikimedia.org/T399199 [23:02:34] T401705: Implement debugging for events in the Javascript SDK - https://phabricator.wikimedia.org/T401705 [23:04:47] !log cjming@deploy2002 cjming, tgr, aaron: Backport for [[gerrit:1201826|Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs]], [[gerrit:1202766|Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster (T399199)]], [[gerrit:1202807|Re-run xLab MW Module Loaded experiment v2 (T401705)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there [23:04:47] . [23:05:08] and I guess that was because of l10n? not sure where the logs are for the rest of scap outside of the image rebuild [23:05:34] AaronSchulz: do you need to check test servers? i will just sync otherwise [23:06:24] let's sync [23:06:30] !log cjming@deploy2002 cjming, tgr, aaron: Continuing with sync [23:07:35] (03CR) 10Bking: [C:03+2] opensearch-cluster: create separate user for operator and admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202769 (https://phabricator.wikimedia.org/T408919) (owner: 10Bking) [23:08:43] TimStarling: that was presumably the cause, yes, though still unclear as to why the l10n updates were triggered. ah, and the logs for the rest of the scap run are in logstash. [23:11:02] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201826|Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs]], [[gerrit:1202766|Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster (T399199)]], [[gerrit:1202807|Re-run xLab MW Module Loaded experiment v2 (T401705)]] (duration: 08m 34s) [23:11:10] T399199: Update OAuth 2.0 sessions to include new JWT session data from core - https://phabricator.wikimedia.org/T399199 [23:11:10] T401705: Implement debugging for events in the Javascript SDK - https://phabricator.wikimedia.org/T401705 [23:11:52] thanks all [23:12:59] !log end of UTC late backport window [23:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:05] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:20:10] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1082:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:55] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1082:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:37:58] (03CR) 10Clare Ming: [C:03+2] Test Kitchen: Deploying v1.1.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202770 (https://phabricator.wikimedia.org/T404458) (owner: 10Clare Ming) [23:39:39] (03Merged) 10jenkins-bot: Test Kitchen: Deploying v1.1.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202770 (https://phabricator.wikimedia.org/T404458) (owner: 10Clare Ming) [23:43:40] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [23:44:10] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [23:45:55] jouncebot: nowandnext [23:45:55] No deployments scheduled for the next 7 hour(s) and 14 minute(s) [23:45:56] In 7 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251107T0700) [23:47:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:47:39] (03CR) 10Zabe: [C:03+2] Update for new WikimediaMaintenance script locations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202659 (owner: 10Zabe) [23:47:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:48:28] (03Merged) 10jenkins-bot: Update for new WikimediaMaintenance script locations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202659 (owner: 10Zabe) [23:48:50] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1202659|Update for new WikimediaMaintenance script locations]] [23:49:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on install1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:51:00] !log zabe@deploy2002 zabe: Backport for [[gerrit:1202659|Update for new WikimediaMaintenance script locations]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:51:48] !log zabe@deploy2002 zabe: Continuing with sync [23:56:06] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202659|Update for new WikimediaMaintenance script locations]] (duration: 07m 15s) [23:57:17] (03PS1) 10Zabe: mediawiki: Update location of startupregistrystats script [puppet] - 10https://gerrit.wikimedia.org/r/1202872 [23:57:17] (03PS1) 10Zabe: mediawiki: Update sendVerifyEmailReminderNotification script location [puppet] - 10https://gerrit.wikimedia.org/r/1202873 [23:57:36] swfrench-wmf: rsync -n doesn't actually look at the file contents, it's just comparing the timestamps [23:58:07] the l10n files don't change very much but rsync -n sees them as completely changed files