[00:02:01] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:02:53] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058280 (owner: 10TrainBranchBot) [00:03:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:03:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1248.eqiad.wmnet with OS bullseye [00:03:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030299 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1248.eqiad.wmnet with OS bullseye... [00:03:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10030300 (10Jclark-ctr) [02:04:22] RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [02:29:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [02:29:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T367856)', diff saved to https://phabricator.wikimedia.org/P67086 and previous config saved to /var/cache/conftool/dbconfig/20240731-022920-marostegui.json [02:29:29] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [02:39:22] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:39] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:35:39] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:22] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s6 T371365 [04:34:54] T371365: Switchover s6 master (db1173 -> db1201) - https://phabricator.wikimedia.org/T371365 [04:35:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1201 with weight 0 T371365', diff saved to https://phabricator.wikimedia.org/P67087 and previous config saved to /var/cache/conftool/dbconfig/20240731-043459-marostegui.json [04:35:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s6 T371365 [04:35:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1201 from API/vslow/dump T371365', diff saved to https://phabricator.wikimedia.org/P67088 and previous config saved to /var/cache/conftool/dbconfig/20240731-043528-marostegui.json [04:36:36] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1201 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1058141 (https://phabricator.wikimedia.org/T371365) (owner: 10Gerrit maintenance bot) [04:37:02] (03PS2) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1058142 (https://phabricator.wikimedia.org/T371365) [04:45:32] (03PS1) 10Marostegui: installserver: Do not reimace pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1058289 [04:48:55] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimace pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1058289 (owner: 10Marostegui) [04:49:37] !log Starting s6 eqiad failover from db1173 to db1201 - T371365 [04:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:48] T371365: Switchover s6 master (db1173 -> db1201) - https://phabricator.wikimedia.org/T371365 [04:49:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T371365', diff saved to https://phabricator.wikimedia.org/P67089 and previous config saved to /var/cache/conftool/dbconfig/20240731-044954-root.json [04:50:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1201 to s6 primary and set section read-write T371365', diff saved to https://phabricator.wikimedia.org/P67090 and previous config saved to /var/cache/conftool/dbconfig/20240731-045023-root.json [04:50:26] marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [04:51:17] (03CR) 10Marostegui: [C:03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1058142 (https://phabricator.wikimedia.org/T371365) (owner: 10Gerrit maintenance bot) [04:51:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1173 T371365', diff saved to https://phabricator.wikimedia.org/P67091 and previous config saved to /var/cache/conftool/dbconfig/20240731-045158-marostegui.json [04:55:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s8 T371368 [04:55:53] T371368: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T371368 [04:56:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T371368 [04:56:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67092 and previous config saved to /var/cache/conftool/dbconfig/20240731-045623-root.json [04:56:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1193 with weight 0 T371368', diff saved to https://phabricator.wikimedia.org/P67093 and previous config saved to /var/cache/conftool/dbconfig/20240731-045631-root.json [04:56:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1193 from API/vslow/dump T371368', diff saved to https://phabricator.wikimedia.org/P67094 and previous config saved to /var/cache/conftool/dbconfig/20240731-045649-root.json [04:57:51] (03PS2) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1058150 (https://phabricator.wikimedia.org/T371368) [04:58:05] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1058149 (https://phabricator.wikimedia.org/T371368) (owner: 10Gerrit maintenance bot) [04:58:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T367856)', diff saved to https://phabricator.wikimedia.org/P67095 and previous config saved to /var/cache/conftool/dbconfig/20240731-045832-marostegui.json [04:58:37] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:11:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67096 and previous config saved to /var/cache/conftool/dbconfig/20240731-051129-root.json [05:13:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P67097 and previous config saved to /var/cache/conftool/dbconfig/20240731-051339-marostegui.json [05:20:14] !log Starting s8 eqiad failover from db1209 to db1193 - T371368 [05:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:21] T371368: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T371368 [05:20:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T371368', diff saved to https://phabricator.wikimedia.org/P67098 and previous config saved to /var/cache/conftool/dbconfig/20240731-052036-root.json [05:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1193 to s8 primary and set section read-write T371368', diff saved to https://phabricator.wikimedia.org/P67099 and previous config saved to /var/cache/conftool/dbconfig/20240731-052114-root.json [05:21:33] (03CR) 10Marostegui: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1058150 (https://phabricator.wikimedia.org/T371368) (owner: 10Gerrit maintenance bot) [05:22:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1209 T371368', diff saved to https://phabricator.wikimedia.org/P67100 and previous config saved to /var/cache/conftool/dbconfig/20240731-052216-marostegui.json [05:23:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67101 and previous config saved to /var/cache/conftool/dbconfig/20240731-052308-root.json [05:26:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67102 and previous config saved to /var/cache/conftool/dbconfig/20240731-052634-root.json [05:28:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P67103 and previous config saved to /var/cache/conftool/dbconfig/20240731-052845-marostegui.json [05:38:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67104 and previous config saved to /var/cache/conftool/dbconfig/20240731-053813-root.json [05:41:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67105 and previous config saved to /var/cache/conftool/dbconfig/20240731-054140-root.json [05:43:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T367856)', diff saved to https://phabricator.wikimedia.org/P67106 and previous config saved to /var/cache/conftool/dbconfig/20240731-054352-marostegui.json [05:43:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:43:58] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:44:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:44:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T367856)', diff saved to https://phabricator.wikimedia.org/P67107 and previous config saved to /var/cache/conftool/dbconfig/20240731-054414-marostegui.json [05:45:34] (03PS1) 10Marostegui: db2209: Make it s3 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058291 (https://phabricator.wikimedia.org/T371361) [05:46:21] (03CR) 10Marostegui: [C:03+2] db2209: Make it s3 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058291 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [05:46:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2209 T371361', diff saved to https://phabricator.wikimedia.org/P67108 and previous config saved to /var/cache/conftool/dbconfig/20240731-054653-root.json [05:46:58] T371361: A6 and D3 have 3 db masters each - https://phabricator.wikimedia.org/T371361 [05:47:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2209.codfw.wmnet with reason: Change binlog format [05:47:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2209.codfw.wmnet with reason: Change binlog format [05:50:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Make db2127 vslow and remove it as candidate master T371361', diff saved to https://phabricator.wikimedia.org/P67109 and previous config saved to /var/cache/conftool/dbconfig/20240731-055004-marostegui.json [05:52:26] (03PS1) 10Marostegui: db2127: No longer s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1058292 (https://phabricator.wikimedia.org/T371361) [05:52:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67110 and previous config saved to /var/cache/conftool/dbconfig/20240731-055256-root.json [05:53:16] (03PS1) 10Marostegui: Revert "db2203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1058293 [05:53:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67111 and previous config saved to /var/cache/conftool/dbconfig/20240731-055319-root.json [05:53:23] (03CR) 10Marostegui: [C:03+2] db2127: No longer s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1058292 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [05:56:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67112 and previous config saved to /var/cache/conftool/dbconfig/20240731-055645-root.json [05:56:51] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1058294 (https://phabricator.wikimedia.org/T371455) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:38] (03PS1) 10Marostegui: backups: Add backup1012 [puppet] - 10https://gerrit.wikimedia.org/r/1058295 (https://phabricator.wikimedia.org/T371416) [06:08:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67113 and previous config saved to /var/cache/conftool/dbconfig/20240731-060802-root.json [06:08:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67114 and previous config saved to /var/cache/conftool/dbconfig/20240731-060824-root.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:23:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67115 and previous config saved to /var/cache/conftool/dbconfig/20240731-062308-root.json [06:23:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67116 and previous config saved to /var/cache/conftool/dbconfig/20240731-062330-root.json [06:35:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.763s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:36:01] (03PS1) 10Abijeet Patro: TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) [06:37:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [06:37:18] (03PS1) 10Abijeet Patro: TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) [06:37:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [06:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67117 and previous config saved to /var/cache/conftool/dbconfig/20240731-063814-root.json [06:38:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67118 and previous config saved to /var/cache/conftool/dbconfig/20240731-063835-root.json [06:39:00] hello, I have a couple of patches for backport during the morning UTC window. The CI takes a lot of time to merge the patches. Wondering if we should +2 them right now? [06:40:12] Amir1, urbanecm ^^ -- there is also a possibility that we might have to revert the patches if they cause a spike in Memcache traffic. [06:40:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 3.448s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:44:14] (03CR) 10Marostegui: [C:03+2] Revert "db2203: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1058293 (owner: 10Marostegui) [06:44:18] (03CR) 10Abijeet Patro: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [06:44:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67119 and previous config saved to /var/cache/conftool/dbconfig/20240731-064449-root.json [06:47:25] (03CR) 10Elukey: [C:03+2] Release version 0.5.0-1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1054879 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [06:47:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1179 T371132', diff saved to https://phabricator.wikimedia.org/P67120 and previous config saved to /var/cache/conftool/dbconfig/20240731-064752-root.json [06:47:57] T371132: Provision cookbook not setting serial console and other settings - https://phabricator.wikimedia.org/T371132 [06:48:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:48:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:48:53] (03PS4) 10Slyngshede: IDP: Switch to CAS 7.0 hosts. [dns] - 10https://gerrit.wikimedia.org/r/1057827 (https://phabricator.wikimedia.org/T367487) [06:50:45] !log Upgrading CAS to version 7.0 [06:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:57] (03CR) 10Slyngshede: [C:03+2] IDP: Switch to CAS 7.0 hosts. [dns] - 10https://gerrit.wikimedia.org/r/1057827 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [06:51:12] (03CR) 10Volans: "Possibile typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/1058295 (https://phabricator.wikimedia.org/T371416) (owner: 10Marostegui) [06:51:15] (03CR) 10CI reject: [V:04-1] TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [06:52:26] (03CR) 10CI reject: [V:04-1] TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [06:52:27] (03PS2) 10Marostegui: backups: Add backup1012 [puppet] - 10https://gerrit.wikimedia.org/r/1058295 (https://phabricator.wikimedia.org/T371416) [06:52:46] (03CR) 10Ayounsi: [C:03+2] Replace is_private() with ip.is_ipv4_private_use() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058208 (owner: 10Ayounsi) [06:52:53] (03CR) 10Marostegui: backups: Add backup1012 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058295 (https://phabricator.wikimedia.org/T371416) (owner: 10Marostegui) [06:53:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67121 and previous config saved to /var/cache/conftool/dbconfig/20240731-065320-root.json [06:53:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67122 and previous config saved to /var/cache/conftool/dbconfig/20240731-065341-root.json [06:53:45] (03Merged) 10jenkins-bot: Replace is_private() with ip.is_ipv4_private_use() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1058208 (owner: 10Ayounsi) [06:57:28] (03CR) 10Filippo Giunchedi: [C:03+1] site: add insetup configs for logging-sd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056973 (https://phabricator.wikimedia.org/T370546) (owner: 10Cwhite) [06:59:53] (03CR) 10Filippo Giunchedi: "pay-lb2001 and 2002 have the same address, is that expected ?" [puppet] - 10https://gerrit.wikimedia.org/r/1058261 (https://phabricator.wikimedia.org/T369566) (owner: 10Dwisehaupt) [06:59:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67123 and previous config saved to /var/cache/conftool/dbconfig/20240731-065955-root.json [07:00:04] Amir1 and Urbanecm: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T0700). [07:00:04] joelyrookewmde and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:22] Hi :) [07:00:55] hello [07:01:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db1179.mgmt.eqiad.wmnet with reboot policy GRACEFUL [07:12:02] ah well, the CI for my backport patch is failing :| [07:14:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1179.mgmt.eqiad.wmnet with reboot policy GRACEFUL [07:14:37] oh no :( good luck [07:15:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67124 and previous config saved to /var/cache/conftool/dbconfig/20240731-071500-root.json [07:16:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T371455 [07:16:35] T371455: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T371455 [07:16:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2209 with weight 0 T371455', diff saved to https://phabricator.wikimedia.org/P67125 and previous config saved to /var/cache/conftool/dbconfig/20240731-071645-root.json [07:16:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T371455 [07:17:28] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1058294 (https://phabricator.wikimedia.org/T371455) (owner: 10Gerrit maintenance bot) [07:21:40] (03PS4) 10Alexandros Kosiaris: parsoid-php: remove discovery, conftool, dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1058169 (https://phabricator.wikimedia.org/T359387) [07:21:50] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2221.mgmt.codfw.wmnet with reboot policy GRACEFUL [07:25:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:26:03] (03CR) 10Alexandros Kosiaris: [C:03+2] parsoid-php: remove discovery, conftool, dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1058169 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [07:30:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67126 and previous config saved to /var/cache/conftool/dbconfig/20240731-073006-root.json [07:30:49] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2221.mgmt.codfw.wmnet with reboot policy GRACEFUL [07:31:26] do I need to do anything to help with my patch being deployed? [07:32:23] (03PS1) 10Jelto: phabricator: increase timeout for collab blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1058557 (https://phabricator.wikimedia.org/T371418) [07:32:24] (03PS1) 10Jelto: add byteplus to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1058558 (https://phabricator.wikimedia.org/T371418) [07:33:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_parsoid-php.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:33:50] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10030695 (10ayounsi) a:03Papaul [07:34:22] (03PS1) 10Alexandros Kosiaris: parse[12]001: Move them to wikikube workers [puppet] - 10https://gerrit.wikimedia.org/r/1058559 (https://phabricator.wikimedia.org/T357392) [07:35:54] (03CR) 10MVernon: [C:03+1] "LGTM (I'm slightly worried by the "manual setup" for most of the backup hosts, but per the task backup-format.cfg should work)." [puppet] - 10https://gerrit.wikimedia.org/r/1058295 (https://phabricator.wikimedia.org/T371416) (owner: 10Marostegui) [07:38:32] (03CR) 10Marostegui: "@rcoccioli@wikimedia.org you happy with the ammend?" [puppet] - 10https://gerrit.wikimedia.org/r/1058295 (https://phabricator.wikimedia.org/T371416) (owner: 10Marostegui) [07:38:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_parsoid-php.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:39:10] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'clear' for AS: 64049 [07:39:51] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1058295 (https://phabricator.wikimedia.org/T371416) (owner: 10Marostegui) [07:40:38] (03CR) 10Marostegui: [C:03+2] backups: Add backup1012 [puppet] - 10https://gerrit.wikimedia.org/r/1058295 (https://phabricator.wikimedia.org/T371416) (owner: 10Marostegui) [07:41:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'clear' for AS: 64049 [07:43:36] (03PS1) 10Slavina Stefanova: aptrepo: upgrade k8s components for 1.26 [puppet] - 10https://gerrit.wikimedia.org/r/1058560 (https://phabricator.wikimedia.org/T370246) [07:45:03] (03CR) 10DCausse: "turns out that flink explicitly report all counters as gauge, use deriv instead of silencing pint (unsure if more appropriate tho)" [alerts] - 10https://gerrit.wikimedia.org/r/1058176 (owner: 10DCausse) [07:45:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67127 and previous config saved to /var/cache/conftool/dbconfig/20240731-074512-root.json [07:45:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10030710 (10Marostegui) >>! In T371416#10029062, @RobH wrote: > @Marostegui, > > Please note there has been a slight change in the workflow for rack... [07:46:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10030715 (10Marostegui) [07:46:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67128 and previous config saved to /var/cache/conftool/dbconfig/20240731-074643-root.json [07:49:00] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2222.mgmt.codfw.wmnet with reboot policy GRACEFUL [07:49:02] (03CR) 10David Caro: [C:03+1] "LGTM, let me know when you want it merged and I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/1058560 (https://phabricator.wikimedia.org/T370246) (owner: 10Slavina Stefanova) [07:51:50] (03PS2) 10Alexandros Kosiaris: parse[12]001: Move them to wikikube workers [puppet] - 10https://gerrit.wikimedia.org/r/1058559 (https://phabricator.wikimedia.org/T357392) [07:51:50] (03PS1) 10Alexandros Kosiaris: service: Remove old parsoid-php from catalog [puppet] - 10https://gerrit.wikimedia.org/r/1058562 (https://phabricator.wikimedia.org/T357392) [07:53:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_parsoid-php.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:53:51] (03PS2) 10Alexandros Kosiaris: service: Remove old parsoid-php from catalog [puppet] - 10https://gerrit.wikimedia.org/r/1058562 (https://phabricator.wikimedia.org/T359387) [07:53:53] (03PS3) 10Alexandros Kosiaris: parse[12]001: Move them to wikikube workers [puppet] - 10https://gerrit.wikimedia.org/r/1058559 (https://phabricator.wikimedia.org/T359387) [07:55:48] (03CR) 10Alexandros Kosiaris: [C:03+2] service: Remove old parsoid-php from catalog [puppet] - 10https://gerrit.wikimedia.org/r/1058562 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [07:57:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Wikibase] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058196 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [07:57:27] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2222.mgmt.codfw.wmnet with reboot policy GRACEFUL [07:58:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_parsoid-php.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:00:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67129 and previous config saved to /var/cache/conftool/dbconfig/20240731-080017-root.json [08:01:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67130 and previous config saved to /var/cache/conftool/dbconfig/20240731-080148-root.json [08:03:05] (03PS2) 10Slyngshede: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 [08:03:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_parsoid-php.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:07:30] (03CR) 10Abijeet Patro: "CI Failure appears to be related to: https://phabricator.wikimedia.org/T371324" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [08:08:41] RESOLVED: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_parsoid-php.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:09:51] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058280 (owner: 10TrainBranchBot) [08:10:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Switchover s3 [08:10:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Switchover s3 [08:16:33] !log Starting s3 codfw failover from db2205 to db2209 - T371455 [08:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:38] T371455: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T371455 [08:16:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67131 and previous config saved to /var/cache/conftool/dbconfig/20240731-081654-root.json [08:18:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2205 T371455', diff saved to https://phabricator.wikimedia.org/P67132 and previous config saved to /var/cache/conftool/dbconfig/20240731-081801-root.json [08:18:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058564 [08:18:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058564 (owner: 10TrainBranchBot) [08:21:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67133 and previous config saved to /var/cache/conftool/dbconfig/20240731-082127-root.json [08:24:04] (03PS4) 10Alexandros Kosiaris: parse[12]001: Move them to wikikube workers [puppet] - 10https://gerrit.wikimedia.org/r/1058559 (https://phabricator.wikimedia.org/T359387) [08:25:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:34] (03CR) 10Filippo Giunchedi: [C:03+1] "Thank you for following up! Yeah I'm not sure either if deriv would be more appropriate in this case, I'm open to try though so +1" [alerts] - 10https://gerrit.wikimedia.org/r/1058176 (owner: 10DCausse) [08:26:17] (03CR) 10Filippo Giunchedi: [C:03+1] "Also consider reformatting expressions with newlines and spaces for increased readability, see examples with "expr: |" in this repo" [alerts] - 10https://gerrit.wikimedia.org/r/1058176 (owner: 10DCausse) [08:28:52] (03PS5) 10Alexandros Kosiaris: parse[12]001: Move them to wikikube workers [puppet] - 10https://gerrit.wikimedia.org/r/1058559 (https://phabricator.wikimedia.org/T359387) [08:32:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67134 and previous config saved to /var/cache/conftool/dbconfig/20240731-083159-root.json [08:36:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67135 and previous config saved to /var/cache/conftool/dbconfig/20240731-083633-root.json [08:38:56] (03PS1) 10Tiziano Fogli: admin: promote tappof to root [puppet] - 10https://gerrit.wikimedia.org/r/1058565 [08:43:13] (03PS1) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) [08:43:39] (03PS1) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) [08:44:12] (03PS2) 10Dreamy Jazz: TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [08:44:26] (03PS2) 10Dreamy Jazz: TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [08:44:53] jouncebot: nowandnext [08:44:53] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [08:44:54] In 1 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1000) [08:45:39] Intend to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaModeration/+/1058566 to release versions [08:46:19] (03CR) 10Dreamy Jazz: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [08:47:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67136 and previous config saved to /var/cache/conftool/dbconfig/20240731-084705-root.json [08:47:10] (03CR) 10Dreamy Jazz: [C:03+2] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [08:47:13] (03CR) 10Dreamy Jazz: [C:03+2] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [08:50:40] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10030822 (10hashar) I think that one covers the ssh key pairs used by the Jenkins controller to the agent in prod... [08:51:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67137 and previous config saved to /var/cache/conftool/dbconfig/20240731-085138-root.json [08:51:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [08:51:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [08:56:22] (03CR) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [08:56:26] (03CR) 10Dreamy Jazz: [C:03+2] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [08:57:52] (03PS2) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) [08:57:58] (03PS3) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) [08:58:00] (03CR) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [08:58:16] (03CR) 10Dreamy Jazz: [C:03+2] "The +2's are not causing gate-and-submit-wmf" [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:04:06] My +2's don't seem to be triggering gate-and-submit-wmf [09:04:13] Does anyone know why? [09:05:30] (03PS3) 10Dreamy Jazz: TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [09:05:34] (03PS3) 10Dreamy Jazz: TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [09:06:10] (03CR) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:06:29] (03CR) 10Dreamy Jazz: [C:03+2] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:06:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67138 and previous config saved to /var/cache/conftool/dbconfig/20240731-090643-root.json [09:06:49] (03CR) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:06:51] (03CR) 10Dreamy Jazz: [C:03+2] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:08:03] (03CR) 10CI reject: [V:04-1] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:09:14] (03CR) 10Filippo Giunchedi: [C:03+1] admin: promote tappof to root [puppet] - 10https://gerrit.wikimedia.org/r/1058565 (owner: 10Tiziano Fogli) [09:09:50] CI is also blocked by WikiBase :( [09:10:21] (03CR) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:10:24] (03CR) 10Dreamy Jazz: [C:03+2] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:10:38] (03CR) 10CI reject: [V:04-1] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:11:16] Dreamy_Jazz: I see you just a -1 ? [09:11:19] got* [09:11:29] Yes. It is because of https://phabricator.wikimedia.org/T371460 [09:11:30] Dreamy_Jazz: are you gonna report the error? [09:11:39] It is already reported as https://phabricator.wikimedia.org/T371460 [09:12:16] * Lucas_WMDE looks [09:12:36] (03PS1) 10Marostegui: db2220: Make it candidate master for s7 [puppet] - 10https://gerrit.wikimedia.org/r/1058568 (https://phabricator.wikimedia.org/T371361) [09:13:33] If that ticket needs changes to the wmf branches, we have a catch-22 situation where CI would fail for tests in different extensions [09:13:52] So might have to manually +2 on the verified. [09:14:11] (03CR) 10Marostegui: [C:03+2] db2220: Make it candidate master for s7 [puppet] - 10https://gerrit.wikimedia.org/r/1058568 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [09:14:25] let’s worry about that once we know what the error is and how to fix it in general [09:14:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2220 T371361', diff saved to https://phabricator.wikimedia.org/P67139 and previous config saved to /var/cache/conftool/dbconfig/20240731-091450-root.json [09:14:56] T371361: A6 and D3 have 3 db masters each - https://phabricator.wikimedia.org/T371361 [09:14:59] and in the meantime, IMHO there’s no point in wasting CI time with additional gate-and-submit attempts for a test failure that doesn’t seem to be flaky [09:15:06] Sure. [09:16:57] (03CR) 10CI reject: [V:04-1] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:17:02] (03CR) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:17:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Move db2121 to vslow T371361', diff saved to https://phabricator.wikimedia.org/P67140 and previous config saved to /var/cache/conftool/dbconfig/20240731-091706-root.json [09:17:11] (03CR) 10Ayounsi: Arelion IPv6 renumbering (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1035376 (https://phabricator.wikimedia.org/T365697) (owner: 10Ayounsi) [09:17:14] (03CR) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [09:17:34] (03PS2) 10Ayounsi: Arelion IPv6 renumbering [homer/public] - 10https://gerrit.wikimedia.org/r/1035376 (https://phabricator.wikimedia.org/T365697) [09:17:45] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058564 (owner: 10TrainBranchBot) [09:18:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2220.codfw.wmnet with reason: Maintenance [09:18:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2220.codfw.wmnet with reason: Maintenance [09:20:17] (03PS1) 10Marostegui: db2121: Remove from candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058570 (https://phabricator.wikimedia.org/T371361) [09:20:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67141 and previous config saved to /var/cache/conftool/dbconfig/20240731-092039-root.json [09:21:11] (03CR) 10Marostegui: [C:03+2] db2121: Remove from candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1058570 (https://phabricator.wikimedia.org/T371361) (owner: 10Marostegui) [09:21:12] (03CR) 10CI reject: [V:04-1] TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [09:21:35] (03CR) 10CI reject: [V:04-1] TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [09:21:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67142 and previous config saved to /var/cache/conftool/dbconfig/20240731-092149-root.json [09:23:13] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1058571 (https://phabricator.wikimedia.org/T371462) [09:25:28] MatmaRex: where in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1057315 did you see the build failure from T371460? the latest errors I see are in the selenium job and look totally different [09:25:29] T371460: Build failures from Wikibase\Repo\Tests\ChangeModification\DispatchChangeVisibilityNotificationJobTest::testHandle and Wikibase\Lib\Tests\Store\Sql\SqlChangeStoreTest::testSaveChange_insert - https://phabricator.wikimedia.org/T371460 [09:25:33] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 [09:26:33] Lucas_WMDE: oh. oops [09:26:46] maybe i copied the wrong link. i saw those failures on like 6 different changes [09:27:15] and i just had to copy the one link that had a different problem [09:28:03] Lucas_WMDE: here are other examples: https://gerrit.wikimedia.org/r/q/T371460 [09:32:37] thanks [09:35:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67143 and previous config saved to /var/cache/conftool/dbconfig/20240731-093545-root.json [09:35:53] Dreamy_Jazz: we might be in luck, it looks like the fix might be in MediaModeration ^^ [09:35:58] :D [09:36:06] let me see if I can reproduce it locally now [09:36:07] Lucas_WMDE: i see something weird. i'm looking at the last failure (SqlChangeStoreTest::testSaveChange_insert). i can't tell where the expected timestamp (20230504030201 / 1683169321) is coming from [09:36:19] MatmaRex: I just figured that out, see the last comment [09:36:21] oh, maybe you've got it [09:36:29] I'm pretty sure that the ConvertibleTimestamp should be reset automatically between tests [09:36:37] But maybe something went wrong and it doesn't? [09:36:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67144 and previous config saved to /var/cache/conftool/dbconfig/20240731-093654-root.json [09:40:14] Dreamy_Jazz: yeah, looks like MediaWikiTestCaseTrait::fakeTimestampTearDown() is meant to do that [09:41:14] and the test looks like it should inherit that [09:42:25] ahaa!! [09:42:32] it’s calling that in a data provider [09:42:42] Oh. Oops. [09:42:42] and Wikibase is also using the fake time in a data provider [09:42:49] and data providers aren’t protected by any setup/teardown [09:44:04] So I guess remove `ConvertibleTimestamp::setFakeTime( '20230504030201' );` from MediaModeration first? [09:44:13] That should unblock CI for the master branch I guess [09:44:37] yeah, I just uploaded a patch [09:44:45] and will amend the Wikibase change to try it out with depends-on [09:44:49] (03CR) 10Hnowlan: [C:03+1] parse[12]001: Move them to wikikube workers [puppet] - 10https://gerrit.wikimedia.org/r/1058559 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [09:44:59] +2d [09:45:25] As I presume that this should pass (considering the master branch just seems to be failing with the Wikibase errors) [09:46:49] oh good, meanwhile the Wikibase change failed with other errors [09:47:09] eh, looks like Wikibase CI doesn’t include MediaModeration anyway [09:47:13] so we wouldn’t have seen the error there [09:50:09] o_O [09:50:11] quo vadis, jouncebot [09:50:26] wb jouncebot [09:50:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67145 and previous config saved to /var/cache/conftool/dbconfig/20240731-095050-root.json [09:51:49] (03CR) 10Alexandros Kosiaris: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1058559 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [09:51:50] (03CR) 10Alexandros Kosiaris: [C:03+2] parse[12]001: Move them to wikikube workers [puppet] - 10https://gerrit.wikimedia.org/r/1058559 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [09:52:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67146 and previous config saved to /var/cache/conftool/dbconfig/20240731-095200-root.json [09:52:39] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1035376 (https://phabricator.wikimedia.org/T365697) (owner: 10Ayounsi) [09:55:21] (03PS2) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) [09:55:26] (03PS1) 10Kevin Bazira: ml-services: staging config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058574 (https://phabricator.wikimedia.org/T371465) [09:55:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool db2220', diff saved to https://phabricator.wikimedia.org/P67147 and previous config saved to /var/cache/conftool/dbconfig/20240731-095545-marostegui.json [09:55:51] (03PS4) 10Dreamy Jazz: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) [09:55:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T371462 [09:55:58] T371462: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T371462 [09:56:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2220 with weight 0 T371462', diff saved to https://phabricator.wikimedia.org/P67148 and previous config saved to /var/cache/conftool/dbconfig/20240731-095609-root.json [09:56:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T371462 [09:56:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2220 from API/vslow/dump T371462', diff saved to https://phabricator.wikimedia.org/P67149 and previous config saved to /var/cache/conftool/dbconfig/20240731-095640-root.json [09:57:11] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1058571 (https://phabricator.wikimedia.org/T371462) (owner: 10Gerrit maintenance bot) [09:58:39] (03CR) 10Abijeet Patro: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [09:59:09] (03CR) 10Abijeet Patro: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1000) [10:03:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:05:22] Lucas_WMDE: Dreamy_Jazz: the thing with setFakeTime() in a test provider seems like an easy mistake to make, and quite annoying to debug. i'm wondering how to prevent that in the future. is there some easy way to e.g. disallow calling it from test providers? or detect and fail the test suite if it was called? [10:05:42] ^ [10:06:02] I think that should be done, considering that we want to avoid data providers having access to the global state [10:06:22] the most general solution is probably T332865, though I’m not sure to what extent it can be enforced [10:06:22] T332865: PHPUnit data providers should be simple static functions that return plain data - https://phabricator.wikimedia.org/T332865 [10:06:29] (i haven't found a way to do anything like that in phpunit docs yet) [10:06:42] but even “just” converting the existing data providers ought to help a bit so people won’t copy bad patterns to new tests [10:06:51] We could probably modify `ConvertibleTimestamp` to have some kind of "dont allow fake time" calls setting? [10:06:54] (03PS2) 10Stevemunene: wdqs: create wdqs split pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1054520 (https://phabricator.wikimedia.org/T364368) [10:07:09] Which is called just before the data provider construction and ended after [10:07:20] one idea that came to mind is having setFakeTime() itself check that the current fake time is not null, and throw an exception if so. since that would usually mean that something failed to clear the fake time after setting it. [10:07:24] Dreamy_Jazz: yeah, if PHPUnit has a hook for that (I don’t know if it does) [10:07:55] er, throw an exception if it is not null [10:08:13] but i have no idea if that would break any existing test code [10:08:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:08:20] Perhaps the exception could be added in the before test methods? [10:08:47] Because if the convertible timestamp should be cleared after the test, then it should always be null unless it was set in a place where it doesn't get reset? [10:08:58] hmm, yeah [10:09:29] wouldn't it be enough to do that before the first test only? [10:09:37] (03CR) 10Dreamy Jazz: "It's going to fail until Id616dbac3199dc48ccf2308ad062ac0e256e66f0 is merged." [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:09:47] actually, i guess i don't know in what order everything happens [10:09:59] it'd be harmless anyway [10:10:03] let me try to code that and see [10:10:22] (03PS1) 10MVernon: cluster::management - add s3client profile [puppet] - 10https://gerrit.wikimedia.org/r/1058575 (https://phabricator.wikimedia.org/T279621) [10:10:24] Thanks. [10:10:35] (03CR) 10Dreamy Jazz: [C:03+2] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [10:10:38] (03CR) 10Dreamy Jazz: [C:03+2] Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [10:11:08] I've added the fix from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaModeration/+/1058572 into the wmf branches backport. [10:11:10] (03PS2) 10MVernon: cluster::management - add s3client profile [puppet] - 10https://gerrit.wikimedia.org/r/1058575 (https://phabricator.wikimedia.org/T279621) [10:11:14] So they should now merge [10:11:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [10:11:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [10:13:18] MatmaRex, Dreamy_Jazz: I think tests/phpunit/bootstrap.php could be a place to put some ConvertibleTimestamp::disallowFakeTime() call? [10:13:46] Lucas_WMDE: that probably runs before test providers, no? (i haven't checked) [10:13:54] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@6ef5a7a]: (no justification provided) [10:13:59] I would assume so, yeah [10:14:10] and then re-allow it in… a @beforeClass in MediaWikiTestCaseTrait, I guess [10:14:24] (though doing it per-class is inefficient) [10:14:24] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@6ef5a7a]: (no justification provided) (duration: 00m 30s) [10:14:38] probably worth a separate phab task in any case ^^ [10:14:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T371462 [10:14:43] T371462: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T371462 [10:14:50] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058575 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:15:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T371462 [10:18:23] !log revoke docker-registry.discovery.wmnet old certificate from Puppet CA that would expire in a few days. It hasn't been in use since https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018251 [10:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:27] Lucas_WMDE: oh, i though that by disallowFakeTime() you meant "throw an exception if fake time is currently set", but you must have meant "throw an exception if fake time is set in the future". that's why my question didn't make sense [10:23:38] MatmaRex: yes, that’s what I meant, sorry [10:25:53] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1001.eqiad.wmnet with OS bullseye [10:25:55] my version is in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1058578 [10:26:11] your version would probably be better, but we'd have to add that method to the library [10:26:49] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2001.codfw.wmnet with OS bullseye [10:28:03] (03PS2) 10Ayounsi: Add KPN in the list of critical BGP peers [puppet] - 10https://gerrit.wikimedia.org/r/1003367 (https://phabricator.wikimedia.org/T322630) [10:31:01] 10ops-codfw, 06SRE, 06DC-Ops: Renumber frack server mgmt IPs in codfw - https://phabricator.wikimedia.org/T371468 (10cmooney) 03NEW p:05Triage→03Medium [10:31:31] (03CR) 10Cathal Mooney: [C:03+1] Add KPN in the list of critical BGP peers [puppet] - 10https://gerrit.wikimedia.org/r/1003367 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi) [10:33:23] !log Starting s7 codfw failover from db2218 to db2220 - T371462 [10:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:28] T371462: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T371462 [10:35:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2220 to s7 primary T371462', diff saved to https://phabricator.wikimedia.org/P67150 and previous config saved to /var/cache/conftool/dbconfig/20240731-103513-root.json [10:37:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2218 T371462', diff saved to https://phabricator.wikimedia.org/P67151 and previous config saved to /var/cache/conftool/dbconfig/20240731-103704-marostegui.json [10:37:52] (03Merged) 10jenkins-bot: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058566 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [10:38:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67152 and previous config saved to /var/cache/conftool/dbconfig/20240731-103811-root.json [10:38:33] (03Merged) 10jenkins-bot: Unblock CI [extensions/MediaModeration] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058567 (https://phabricator.wikimedia.org/T371324) (owner: 10Dreamy Jazz) [10:39:01] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1058566|Unblock CI (T371324)]], [[gerrit:1058567|Unblock CI (T371324)]] [10:39:06] T371324: MediaModeration PHPUnit runs fails after RawMessage code change with "Premature access to service container" or different message text - https://phabricator.wikimedia.org/T371324 [10:39:31] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1001.eqiad.wmnet with reason: host reimage [10:41:24] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1058566|Unblock CI (T371324)]], [[gerrit:1058567|Unblock CI (T371324)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:41:55] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [10:42:24] (03CR) 10Dreamy Jazz: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:42:29] (03CR) 10Dreamy Jazz: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:42:57] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1001.eqiad.wmnet with reason: host reimage [10:43:24] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2001.codfw.wmnet with reason: host reimage [10:46:24] (03CR) 10Ayounsi: "Tested manually (and was required) on netbox-dev2003 netbox repo." [puppet] - 10https://gerrit.wikimedia.org/r/1058193 (owner: 10Ayounsi) [10:46:30] !log dreamyjazz@deploy1003 Finished scap: Backport for [[gerrit:1058566|Unblock CI (T371324)]], [[gerrit:1058567|Unblock CI (T371324)]] (duration: 07m 29s) [10:46:35] T371324: MediaModeration PHPUnit runs fails after RawMessage code change with "Premature access to service container" or different message text - https://phabricator.wikimedia.org/T371324 [10:46:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2001.codfw.wmnet with reason: host reimage [10:47:31] (03CR) 10Volans: [C:03+1] "shouldn't hurt :)" [puppet] - 10https://gerrit.wikimedia.org/r/1058193 (owner: 10Ayounsi) [10:48:27] CI should now be unblocked :D [10:49:50] abijeet_: Your wmf branch backports should now be passing in CI [10:50:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:51:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:51:22] (03PS1) 10Arturo Borrero Gonzalez: cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) [10:51:45] (03CR) 10CI reject: [V:04-1] cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) (owner: 10Arturo Borrero Gonzalez) [10:51:46] jouncebot: nowandnext [10:51:47] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1000) [10:51:47] In 0 hour(s) and 8 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1100) [10:52:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056495 (owner: 10D3r1ck01) [10:52:44] (03PS2) 10Arturo Borrero Gonzalez: cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) [10:53:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67153 and previous config saved to /var/cache/conftool/dbconfig/20240731-105317-root.json [10:54:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058143 (https://phabricator.wikimedia.org/T371364) (owner: 10Dreamy Jazz) [10:55:20] (03CR) 10CI reject: [V:04-1] cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) (owner: 10Arturo Borrero Gonzalez) [10:56:14] (03PS3) 10Arturo Borrero Gonzalez: cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) [10:58:10] Dreamy_Jazz: are you deploying, please? [10:58:19] Not right now [10:58:25] ack [10:58:34] (03PS1) 10Urbanecm: EventStreamConfig: Re-enable mediawiki_eventbus on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058582 (https://phabricator.wikimedia.org/T371433) [10:58:44] so I'll ship the above then [10:59:06] (03CR) 10Elukey: "LGTM, I added some comments just to better understand, if those are not concerns I'll +1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [10:59:31] 👍 [10:59:36] (03CR) 10Urbanecm: [C:03+2] "fix UBN" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058582 (https://phabricator.wikimedia.org/T371433) (owner: 10Urbanecm) [11:00:05] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1100). [11:00:13] (03Merged) 10jenkins-bot: EventStreamConfig: Re-enable mediawiki_eventbus on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058582 (https://phabricator.wikimedia.org/T371433) (owner: 10Urbanecm) [11:00:41] (03PS1) 10Máté Szabó: Revert "Produce a limited set of event streams on private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058583 (https://phabricator.wikimedia.org/T371433) [11:00:43] (03PS4) 10Arturo Borrero Gonzalez: cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) [11:00:58] (03CR) 10Elukey: "Tiziano can you confirm that you completed the related goal to obtain root access in https://office.wikimedia.org/wiki/SRE/Training_Checkl" [puppet] - 10https://gerrit.wikimedia.org/r/1058565 (owner: 10Tiziano Fogli) [11:01:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1001.eqiad.wmnet with OS bullseye [11:01:40] (03PS1) 10GergesShamon: [arwiki] Set noindex for namespace user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058584 (https://phabricator.wikimedia.org/T371470) [11:02:27] (03Abandoned) 10Máté Szabó: Revert "Produce a limited set of event streams on private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058583 (https://phabricator.wikimedia.org/T371433) (owner: 10Máté Szabó) [11:03:09] (03CR) 10CI reject: [V:04-1] cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) (owner: 10Arturo Borrero Gonzalez) [11:03:39] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1058582|EventStreamConfig: Re-enable mediawiki_eventbus on private wikis (T371433)]] [11:03:40] (03PS5) 10Arturo Borrero Gonzalez: cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) [11:03:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058584 (https://phabricator.wikimedia.org/T371470) (owner: 10GergesShamon) [11:03:49] T371433: JobQueueError: Could not enqueue jobs - https://phabricator.wikimedia.org/T371433 [11:05:14] Dreamy_Jazz: would you be able to verify you can login to CU wiki at mwdebug? I don't have an unfamiliar device handy :) [11:05:21] Sure. [11:05:23] ty [11:05:51] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1058582|EventStreamConfig: Re-enable mediawiki_eventbus on private wikis (T371433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:05:55] Testing... [11:05:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2001.codfw.wmnet with OS bullseye [11:05:56] Dreamy_Jazz: please go ahead [11:05:57] (03CR) 10Ayounsi: [C:03+2] python_deploy_venv: update submodules URL in case it's needed [puppet] - 10https://gerrit.wikimedia.org/r/1058193 (owner: 10Ayounsi) [11:06:27] Successfully logged in [11:06:35] urbanecm: [11:06:40] yay! [11:07:00] thanking on officewiki works too [11:07:11] !log urbanecm@deploy1003 urbanecm: Continuing with sync [11:07:15] !log klausman@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:07:42] (03CR) 10Ayounsi: [C:03+2] Add KPN in the list of critical BGP peers [puppet] - 10https://gerrit.wikimedia.org/r/1003367 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi) [11:08:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67154 and previous config saved to /var/cache/conftool/dbconfig/20240731-110822-root.json [11:11:03] !log Removing /var/lib/puppet/server/ssl/ca/signed/docker-registry.discovery.wmnet.pem on puppetmaster1001 [11:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:42] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1058582|EventStreamConfig: Re-enable mediawiki_eventbus on private wikis (T371433)]] (duration: 08m 02s) [11:11:47] T371433: JobQueueError: Could not enqueue jobs - https://phabricator.wikimedia.org/T371433 [11:11:52] okay, hopefully things work in prod now too :) [11:19:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.018s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:22:14] the link to the alert for this ^ suggests it already resolved (there is no alert in a.w.o) [11:22:36] https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-parsoid&refresh=1m&var-release=main&var-container_name=All&var-site=&var-kubernetes_pod_name=All suggests flapping, maybe some template edit? [11:23:25] nothing out of the ordinary if I zoom out to 24h though [11:23:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67155 and previous config saved to /var/cache/conftool/dbconfig/20240731-112327-root.json [11:23:30] (03CR) 10Volans: "Thanks for the review and comments, replies inline. I'll change the code after your reply" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:24:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.018s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:25:45] !log akosiaris@cumin1002 conftool action : set/weight=10; selector: name=parse1001.eqiad.wmnet [11:25:51] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: name=parse1001.eqiad.wmnet [11:27:16] (03PS3) 10Volans: confctl: add native support for RO in conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (https://phabricator.wikimedia.org/T362893) (owner: 10Giuseppe Lavagetto) [11:27:18] (03PS1) 10Volans: dbctl: add new module to interact with dbctl [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058586 (https://phabricator.wikimedia.org/T362893) [11:35:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on dbstore1007.eqiad.wmnet with reason: Long schema change [11:35:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on dbstore1007.eqiad.wmnet with reason: Long schema change [11:38:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67156 and previous config saved to /var/cache/conftool/dbconfig/20240731-113833-root.json [11:53:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67158 and previous config saved to /var/cache/conftool/dbconfig/20240731-115338-root.json [11:55:42] !log klausman@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:00:37] (03CR) 10FNegri: [C:04-1] cloudcumins: deploy gitlab token for tofu-infra (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) (owner: 10Arturo Borrero Gonzalez) [12:01:23] (03PS6) 10Arturo Borrero Gonzalez: cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) [12:01:34] jouncebot: nowandnext [12:01:34] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [12:01:34] In 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1300) [12:01:38] (03CR) 10Arturo Borrero Gonzalez: cloudcumins: deploy gitlab token for tofu-infra (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) (owner: 10Arturo Borrero Gonzalez) [12:01:39] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) (owner: 10Arturo Borrero Gonzalez) [12:02:17] (03CR) 10Dzahn: [C:03+1] phabricator: increase timeout for collab blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1058557 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [12:02:18] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058564 (owner: 10TrainBranchBot) [12:02:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058143 (https://phabricator.wikimedia.org/T371364) (owner: 10Dreamy Jazz) [12:02:27] (03PS1) 10Ayounsi: Revert^4 "Netbox 4: point prod service to new servers" [puppet] - 10https://gerrit.wikimedia.org/r/1058591 [12:03:05] (03Merged) 10jenkins-bot: Grant checkuser-temporary-account-no-preference to suppress group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058143 (https://phabricator.wikimedia.org/T371364) (owner: 10Dreamy Jazz) [12:03:24] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1058143|Grant checkuser-temporary-account-no-preference to suppress group (T371364)]] [12:03:30] T371364: Assign checkuser-temporary-account-no-preference to the suppress group on all wikis - https://phabricator.wikimedia.org/T371364 [12:06:13] !log akosiaris@cumin1002 conftool action : set/weight=10; selector: name=parse2001.codfw.wmnet [12:06:22] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: name=parse2001.codfw.wmnet [12:07:14] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1058143|Grant checkuser-temporary-account-no-preference to suppress group (T371364)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:07:47] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [12:08:41] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3460/console" [puppet] - 10https://gerrit.wikimedia.org/r/1058557 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [12:08:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67159 and previous config saved to /var/cache/conftool/dbconfig/20240731-120844-root.json [12:10:19] !Log Running `mwscript extensions/MediaModeration/maintenance/updateMetrics.php --wiki=commonswiki --verbose` [12:11:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058594 [12:11:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058594 (owner: 10TrainBranchBot) [12:11:46] !log Running `mwscript extensions/MediaModeration/maintenance/updateMetrics.php --wiki=commonswiki --verbose [12:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:21] !log dreamyjazz@deploy1003 Finished scap: Backport for [[gerrit:1058143|Grant checkuser-temporary-account-no-preference to suppress group (T371364)]] (duration: 08m 57s) [12:12:26] T371364: Assign checkuser-temporary-account-no-preference to the suppress group on all wikis - https://phabricator.wikimedia.org/T371364 [12:13:10] (03PS2) 10Hnowlan: group0, group1: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) [12:15:20] (03CR) 10Elukey: mysql_legacy: instance improvements (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [12:19:29] Dreamy_Jazz, thanks! [12:22:05] !log klausman@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:25:31] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.8 to future netbox prod - ayounsi@cumin1002 - T336275 [12:25:41] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [12:27:34] (03Abandoned) 10MVernon: cluster::management - add s3client profile [puppet] - 10https://gerrit.wikimedia.org/r/1058575 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:28:07] (03CR) 10Filippo Giunchedi: [C:03+1] "+1 on the change itself, I don't have the bandwidth rn to implement ensure => absent for benthos though sth definitely needed and I'm happ" [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [12:30:29] !log temporary disabling puppet on cp-ulsfo to test remove benthos from cp4037 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1057823) (T370741) [12:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:48] Hi Lucas_WMDE, I have another patch for backport today. CI takes 30 minutes to merge, should we merge the patch in for deployment now? [12:31:07] T370741: Remove Benthos from ulsfo hosts - https://phabricator.wikimedia.org/T370741 [12:33:59] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [12:34:23] abijeet: I think we can still wait a bit [12:34:29] maybe ten minutes before the window or so [12:34:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.8 to future netbox prod - ayounsi@cumin1002 - T336275 [12:34:56] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [12:39:01] !log temporary depooling cp4037 to test remove all Benthos resources (T370741) [12:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:07] T370741: Remove Benthos from ulsfo hosts - https://phabricator.wikimedia.org/T370741 [12:39:10] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [12:40:39] (03CR) 10Fabfur: [C:03+2] hiera:benthos: remove benthos from ulsfo cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [12:44:15] !log klausman@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:45:02] (03CR) 10Elukey: [C:03+2] Revert^4 "Netbox 4: point prod service to new servers" [puppet] - 10https://gerrit.wikimedia.org/r/1058591 (owner: 10Ayounsi) [12:46:39] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [12:50:32] !log repool cp4037, haproxy configuration modified to exclude benthos logging (T370741) [12:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:41] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [12:50:47] T370741: Remove Benthos from ulsfo hosts - https://phabricator.wikimedia.org/T370741 [12:52:28] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058196 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [12:52:41] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [12:52:45] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [12:52:50] thanks! [12:53:06] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4044*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-drmrs or A:cp-text_ [12:53:06] drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru) [12:53:11] it looks like Translate CI is sufficiently slower than Wikibase CI that we should be able to deploy Wikibase while Translate finishes merging [12:53:50] !log upgrade cp4044 to ATS 9.2.5: T339134 [12:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:55] T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134 [12:54:14] (03CR) 10Kamila Součková: "I'm a little suspicious due to seeing things sitting in the queue and not getting picked up until I click "reset transcode", at which poin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [12:54:22] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=1) Rolling upgrade/restart of Apache Traffic Server on P{cp4044*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp- [12:54:22] drmrs or A:cp-text_drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru) [12:57:35] !log update debmonitor-server and python3-debmonitor to bookworm-wikimedia - T368744 [12:57:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet [reason: pooling after cookbook depooled as puppet was disabled] [12:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:50] T368744: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1300). Please do the needful. [13:00:05] Daimona, joelyrookewmde, abijeet, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:15] I can deploy! [13:00:29] I deployed my patch before the window, so nothing for me for this window. [13:00:33] cool! hi again! [13:00:37] hi! [13:00:37] \o/ [13:00:43] o/ [13:00:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058278 (https://phabricator.wikimedia.org/T370938) (owner: 10Daimona Eaytoy) [13:02:08] (03Merged) 10jenkins-bot: beta: Enable invitation lists for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058278 (https://phabricator.wikimedia.org/T370938) (owner: 10Daimona Eaytoy) [13:02:14] (03CR) 10Ayounsi: [C:03+2] Homer wmf-netbox: fix Netbox 4 breaking changes [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1050379 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:02:45] hm, https://integration.wikimedia.org/ci/view/Beta/ looks shorter than I remember [13:02:52] wasn’t there a separate job for the config? [13:03:10] yeah, https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/ according to my history (now not found) [13:03:35] uhm, no idea [13:03:39] apparently deleted 2022-08-04 according to SAL 🤷 [13:04:13] welp. [13:04:23] looks like https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/506746/console includes your config change [13:04:34] so everything’s fine, I’m just out-of-date on how to track beta deployment progress [13:05:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058196 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [13:05:10] yup was checking that, I'm still shocked by the perfect timing of the patch getting merged just before the run began [13:05:15] hehe [13:06:32] guess I'll make some coffee while scap is scapping [13:08:17] scap done, I'm gonna test and see how I managed to break beta today @HouseOfM [13:08:38] I'm sure it's all good :) [13:10:53] let me guess, beta logstash is borked [13:11:42] * Lucas_WMDE sees a nonzero amount of messages in beta logstash [13:11:56] (now, whether they’re useful or in any way complete is a different question…) [13:12:19] yeah on a closer look, it's not broken. But still, something seems broken, judging from malformed/truncated log entries [13:12:26] (03Merged) 10jenkins-bot: Fix tracking parameter casing [extensions/Wikibase] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058196 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [13:12:47] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1058196|Fix tracking parameter casing (T370045)]] [13:13:00] (03PS1) 10Ayounsi: Release v0.7.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1058600 [13:13:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234#10031821 (10Jhancock.wm) 05Open→03Resolved [13:13:12] T370045: Monitor sidebar wikidata link usage - https://phabricator.wikimedia.org/T370045 [13:13:16] !log running `sudo cumin -b 1 -s300 A:cp-ulsfo 'depool-cdn && sleep 30 && enable-puppet "T370741" && run-puppet-agent && pool-cdn'` (T370741) [13:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:34] T370741: Remove Benthos from ulsfo hosts - https://phabricator.wikimedia.org/T370741 [13:16:01] (03CR) 10Slyngshede: Release v0.7.0 (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1058600 (owner: 10Ayounsi) [13:16:33] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, joelyrookewmde: Backport for [[gerrit:1058196|Fix tracking parameter casing (T370045)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:16:53] The new beta host is deploy04, right? [13:17:07] yes [13:17:15] (03CR) 10Ayounsi: Release v0.7.0 (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1058600 (owner: 10Ayounsi) [13:17:19] joelyrookewmde: can you test the change with WikimediaDebug? [13:17:21] (03CR) 10Slyngshede: [C:03+1] "Seems reasonable. See minor nit." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1058600 (owner: 10Ayounsi) [13:17:34] (03CR) 10Ayounsi: [C:03+2] Release v0.7.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1058600 (owner: 10Ayounsi) [13:18:02] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:18:49] not sure - I need to find a sitelink to a wikidata article but not sure if there are any on wikitech [13:19:08] pretty sure Wikitech isn’t a Wikibase client [13:19:11] but some other wiki should work [13:19:16] e.g. testwiki? [13:19:32] https://test.wikipedia.org/wiki/Main_Page has a Wikidata sitelink [13:19:34] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.7.0 - ayounsi@cumin1002 [13:19:56] I found one regarding a gadget [13:20:03] all looks as expected :) [13:20:17] ok, so good to deploy? [13:20:39] yes [13:20:41] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, joelyrookewmde: Continuing with sync [13:21:32] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:23:59] (03Merged) 10jenkins-bot: TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1058496 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:24:06] (03Merged) 10jenkins-bot: TranslatablePage: Store source page ids as string in WAN cache [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1058495 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:24:14] Alright, I found: Error 1728: Cannot load from mysql.proc. The table is probably corrupted [13:24:25] (03PS1) 10Ilias Sarantopoulos: httpbb: remove ores-legacy old staging tests [puppet] - 10https://gerrit.wikimedia.org/r/1058602 [13:24:32] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:25:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.7.0 - ayounsi@cumin1002 [13:25:17] !log lucaswerkmeister-wmde@deploy1003 Finished scap: Backport for [[gerrit:1058196|Fix tracking parameter casing (T370045)]] (duration: 12m 30s) [13:25:42] T370045: Monitor sidebar wikidata link usage - https://phabricator.wikimedia.org/T370045 [13:26:39] ouch, that doesn’t sound good [13:26:44] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1058496|TranslatablePage: Store source page ids as string in WAN cache (T366455)]], [[gerrit:1058495|TranslatablePage: Store source page ids as string in WAN cache (T366455)]] [13:27:10] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [13:27:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10031866 (10Jclark-ctr) wikikube-worker1250 1129. # 2 wikikube-worker1251 1131. # 1 wikikube-worker1252 1132. # 5 wikikube-worker1253 1133. # 4 w... [13:27:38] Hi [13:28:30] @Daimona that was triggered by me going to Special:MyInvitationLists FYI [13:28:43] I have a change for review in the UTC late backport window. Is there a current time to work on my change? [13:28:55] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, abi: Backport for [[gerrit:1058496|TranslatablePage: Store source page ids as string in WAN cache (T366455)]], [[gerrit:1058495|TranslatablePage: Store source page ids as string in WAN cache (T366455)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:29:03] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3461/console" [puppet] - 10https://gerrit.wikimedia.org/r/1058557 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [13:29:06] Gerges: currently I’m deploying two backports [13:29:12] abijeet: can you test the change? [13:29:31] Ah yeah, I see now. I don't see other errors though, except for a nonsensical message about a class not being found, but that's from this morning and so unrelated [13:29:54] Lucas_WMDE, ok [13:30:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058594 (owner: 10TrainBranchBot) [13:30:11] (03CR) 10Jelto: [V:03+1 C:03+2] phabricator: increase timeout for collab blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1058557 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [13:31:11] Alright so, huh, I don't know what to do with the corrupted table thingy [13:31:18] Lucas_WMDE: Tell me if you have time to review my changes [13:31:36] Gerges: there’s only one change, right? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1058584 ? [13:31:54] yeah, not sure how to proceed… [13:31:56] I can probably deploy that after the current backports, looks reasonable [13:32:27] Lucas_WMDE, test on any of the debug servers? [13:32:34] yes [13:32:48] scap backport works again so now testing is again possible on all debug servers ^^ [13:32:54] Yes only one change :) [13:33:33] (03PS2) 10Ottomata: Remove docroot/mediawiki.org/beacon/event/index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055443 (https://phabricator.wikimedia.org/T353817) [13:33:33] (03PS1) 10Ottomata: EventStreamConfig - fix for private wiki streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058603 (https://phabricator.wikimedia.org/T346046) [13:34:00] (03PS2) 10Ilias Sarantopoulos: httpbb: remove ores-legacy old staging tests [puppet] - 10https://gerrit.wikimedia.org/r/1058602 [13:34:20] (03PS2) 10Ottomata: EventStreamConfig - fix for private wiki streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058603 (https://phabricator.wikimedia.org/T346046) [13:34:25] Lucas_WMDE, looks good. [13:34:45] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, abi: Continuing with sync [13:34:53] ok, let’s see how it behaves then [13:37:12] (03CR) 10Ebernhardson: [C:03+1] EventStreamConfig - fix for private wiki streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058603 (https://phabricator.wikimedia.org/T346046) (owner: 10Ottomata) [13:37:29] (03CR) 10Andrew Bogott: "For more context about motivation, have a look at https://phabricator.wikimedia.org/T364492. The status quo has never worked properly or " [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [13:37:29] Any DBA around? [13:38:28] (03CR) 10Ottomata: [C:03+2] EventStreamConfig - fix for private wiki streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058603 (https://phabricator.wikimedia.org/T346046) (owner: 10Ottomata) [13:39:07] (03Merged) 10jenkins-bot: EventStreamConfig - fix for private wiki streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058603 (https://phabricator.wikimedia.org/T346046) (owner: 10Ottomata) [13:39:10] !log upgrade pdns-recursor to 4.8.8 from from 4.8.7 on dns6001 [13:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:18] !log lucaswerkmeister-wmde@deploy1003 Finished scap: Backport for [[gerrit:1058496|TranslatablePage: Store source page ids as string in WAN cache (T366455)]], [[gerrit:1058495|TranslatablePage: Store source page ids as string in WAN cache (T366455)]] (duration: 12m 34s) [13:39:32] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=recdns [reason: pdns-rec upgrade] [13:39:49] abijeet: grafana looks okay so far [13:40:01] Lucas_WMDE, yup, agreed. [13:40:36] Lucas_WMDE: since the window is running, can we ship https://gerrit.wikimedia.org/r/1058603 next? An UBN a few hours ago unfortunately started leaking private wiki details [13:40:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=recdns [reason: [done] pdns-rec upgrade] [13:40:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:41:00] o_O [13:41:08] I’m confused [13:41:10] well, the fix that was applied to the ubn [13:41:21] so it’s currently merged but not deployed? [13:41:33] sorry we just merged it [13:41:45] I guess I can deploy it then [13:41:47] so, yes. [13:41:47] thank you. [13:42:00] i was in UBN mode and only just checked to see that there was a window happening now [13:42:16] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1058603|EventStreamConfig - fix for private wiki streams (T346046 T371433)]] [13:42:38] T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 [13:42:39] T371433: JobQueueError: Could not enqueue jobs - https://phabricator.wikimedia.org/T371433 [13:43:31] thanks Lucas_WMDE [13:43:39] ottomata, ebernhardson: anything to test on WikimediaDebug there? [13:43:46] or should I just enter `y` as soon as scap backport asks me? [13:43:51] Lucas_WMDE: i can test [13:43:55] is it there? [13:43:56] ok, just a moment then [13:44:01] which mwdebug? [13:44:23] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, otto: Backport for [[gerrit:1058603|EventStreamConfig - fix for private wiki streams (T346046 T371433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:44:24] now it should be on any mwdebug [13:44:31] k 1 min... [13:45:16] abijeet: looking at https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&var-kClass=pagetranslation&from=now-1h&to=now, the “total hit-good latency” went down a bit, which feels vaguely plausible [13:45:35] the “cache-hit rate” also apparently went down, that one confuses me more – the number of hits should be the same [13:45:42] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:45:42] just a bit less data in each value [13:46:07] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:46:44] (also the decrease in cache-hit rate is a bit too early to be caused by the full rollout, really) [13:46:47] actually, testing is not that easy. anyone got a quick and easy way to access a privatewikis action api? [13:46:49] (03PS3) 10Ilias Sarantopoulos: httpbb: remove ores-legacy old staging tests [puppet] - 10https://gerrit.wikimedia.org/r/1058602 [13:47:02] no idea tbh [13:47:07] but isn’t officewiki one of them? [13:47:08] hmm acutally maybe i can do from browser [13:47:11] i was doing in curl [13:47:13] trying [13:47:13] ah [13:47:33] https://office.wikimedia.org/wiki/Special:ApiSandbox ? [13:47:46] (just guessing that it has the same $wgArticlePath as everything else – I don’t have access to it ^^) [13:48:30] ty ^ trying [13:49:09] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s6 [13:49:16] Lucas_WMDE: great. LGTM [13:49:17] please proceed [13:49:22] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, otto: Continuing with sync [13:49:25] ok! [13:49:34] jouncebot: next [13:49:34] In 0 hour(s) and 10 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1400) [13:49:57] gonna be pretty tight for Gerges’ change then :S [13:50:10] I think that might not be enough time [13:50:25] Ok [13:50:41] (03PS1) 10Jelto: gitlab: enable throttling for all GitLab instances [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) [13:51:15] Can I test my change if I am not in the UTC late backport window? [13:51:54] I’m not sure what you mean [13:52:48] OK, never mind. [13:53:00] (03CR) 10Klausman: [C:03+2] httpbb: remove ores-legacy old staging tests [puppet] - 10https://gerrit.wikimedia.org/r/1058602 (owner: 10Ilias Sarantopoulos) [13:53:48] !log lucaswerkmeister-wmde@deploy1003 Finished scap: Backport for [[gerrit:1058603|EventStreamConfig - fix for private wiki streams (T346046 T371433)]] (duration: 11m 31s) [13:54:02] T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 [13:54:03] T371433: JobQueueError: Could not enqueue jobs - https://phabricator.wikimedia.org/T371433 [13:54:16] Does this task T359815 need to be reviewed by Editing-team? [13:54:17] T359815: Enable Visual Editor on Wikipedia namespace on Armenian Wikipedia - https://phabricator.wikimedia.org/T359815 [13:54:25] Lucas_WMDE, thanks, I'll leave a comment on the task soon [13:54:33] sounds good, thanks [13:54:36] (03PS1) 10Ayounsi: fetch_device_interfaces: get all VC interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1058609 [13:54:43] Gerges: I don’t think 6 minutes is enough time to deploy your change now, sorry [13:54:45] I’ll close the window [13:54:50] !log UTC afternoon backport+config window done [13:54:50] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3462/co" [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [13:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:25] Ok [13:55:47] Gerges: I think with Trizek’s comment, T359815 is okay to move forward [13:55:53] at least that looks similar to the other recent VE tasks [13:56:10] Ok [13:56:18] the wiki community should just be aware of it, IIUC [13:56:18] Thanks:) [13:58:11] (03PS1) 10Ssingh: hiera: lvs/interfaces: add note about updating VLANs [puppet] - 10https://gerrit.wikimedia.org/r/1058611 [13:58:50] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3463/console" [puppet] - 10https://gerrit.wikimedia.org/r/1058611 (owner: 10Ssingh) [13:59:04] (03CR) 10Jelto: [V:03+1] gitlab: enable throttling for all GitLab instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [13:59:50] 06SRE, 06Infrastructure-Foundations, 10netops: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501 (10cmooney) 03NEW p:05Triage→03Low [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1400) [14:00:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:55] (03CR) 10Ssingh: [C:03+2] durum: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057951 (owner: 10Dzahn) [14:01:17] (03PS1) 10Cathal Mooney: Example of QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) [14:01:43] (03CR) 10CI reject: [V:04-1] Example of QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney) [14:04:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:35] (03CR) 10Elukey: [C:03+1] fetch_device_interfaces: get all VC interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1058609 (owner: 10Ayounsi) [14:06:19] (03CR) 10Volans: "replies inline, addressed comments, ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [14:06:20] (03PS2) 10Volans: mysql_legacy: instance improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) [14:09:47] (03CR) 10CDanis: [C:03+1] confctl: add native support for RO in conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (https://phabricator.wikimedia.org/T362893) (owner: 10Giuseppe Lavagetto) [14:12:41] (03CR) 10CI reject: [V:04-1] mysql_legacy: instance improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [14:14:04] (03PS3) 10Volans: mysql_legacy: instance improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) [14:16:22] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10032068 (10Volans) @Papaul what's the timeline for deciding to not reimage anymore buster host so we can just have puppet 7 and solve all problems? [14:17:58] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4044*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-drmrs or A:cp-text_ [14:17:58] drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru) [14:19:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:19:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:20:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2148', diff saved to https://phabricator.wikimedia.org/P67160 and previous config saved to /var/cache/conftool/dbconfig/20240731-141959-marostegui.json [14:20:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4044*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp- [14:21:10] drmrs or A:cp-text_drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru) [14:21:12] !log [done] upgrade cp4044 to ATS 9.2.5: T339134 [14:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:49] T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134 [14:26:04] (03CR) 10Elukey: [C:03+1] mysql_legacy: instance improvements (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [14:28:22] (03CR) 10Ayounsi: [C:03+2] Cookbooks: fix Netbox 4 breaking changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:30:04] (03PS4) 10Volans: mysql_legacy: instance improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) [14:33:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.83s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:33:17] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db2223.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:33:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 5%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240731-143340-root.json [14:35:39] (03CR) 10Dzahn: [C:03+1] gitlab: enable throttling for all GitLab instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [14:36:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:36] (03CR) 10Volans: [C:03+2] "Last PS changed only docstrings to improve the generated documentation, merging." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [14:38:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.85s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:39:22] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:07] (03CR) 10Hnowlan: "I think this is just backlogged queues - although my recent tests have all been eventually successful. If a video is chronically failing t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:41:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:35] (03Merged) 10jenkins-bot: mysql_legacy: instance improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058225 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [14:45:13] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2223.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:46:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633#10032189 (10Jclark-ctr) Confirmed: Service Request 195103368 was successfully submitted. [14:48:33] (03PS3) 10Ayounsi: netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) [14:48:34] (03PS4) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) [14:48:34] (03PS1) 10Ayounsi: sync-netbox-hiera: fix mgmt interface list query [cookbooks] - 10https://gerrit.wikimedia.org/r/1058618 [14:48:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67162 and previous config saved to /var/cache/conftool/dbconfig/20240731-144850-root.json [14:48:57] (03CR) 10Ayounsi: "Tested on https://netbox.wikimedia.org/graphql/" [cookbooks] - 10https://gerrit.wikimedia.org/r/1058618 (owner: 10Ayounsi) [14:51:03] (03CR) 10Elukey: [C:03+1] "Looks good but I have zero knowledge about this" [cookbooks] - 10https://gerrit.wikimedia.org/r/1058618 (owner: 10Ayounsi) [14:51:49] (03CR) 10Ayounsi: [C:03+2] sync-netbox-hiera: fix mgmt interface list query [cookbooks] - 10https://gerrit.wikimedia.org/r/1058618 (owner: 10Ayounsi) [14:54:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:31] (03Merged) 10jenkins-bot: sync-netbox-hiera: fix mgmt interface list query [cookbooks] - 10https://gerrit.wikimedia.org/r/1058618 (owner: 10Ayounsi) [14:56:59] (03PS4) 10Ayounsi: netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) [14:56:59] (03PS5) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) [14:59:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:22] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:35] (03CR) 10Elukey: [C:03+1] netbox.netbox-extra: trigger syncdatasource (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:03:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67163 and previous config saved to /var/cache/conftool/dbconfig/20240731-150356-root.json [15:04:09] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db2224.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:11:02] !log jgiannelos@deploy1003 Started deploy [restbase/deploy@59a40a0]: (no justification provided) [15:11:37] (03PS1) 10Filippo Giunchedi: WIP? [puppet] - 10https://gerrit.wikimedia.org/r/1058622 [15:11:37] (03PS1) 10Filippo Giunchedi: benthos: revert output batches for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/1058623 (https://phabricator.wikimedia.org/T369737) [15:12:36] (03Abandoned) 10Filippo Giunchedi: WIP? [puppet] - 10https://gerrit.wikimedia.org/r/1058622 (owner: 10Filippo Giunchedi) [15:12:44] (03PS2) 10Filippo Giunchedi: benthos: revert output batches for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/1058623 (https://phabricator.wikimedia.org/T369737) [15:14:09] (03PS5) 10Ayounsi: netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) [15:14:09] (03PS6) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) [15:14:09] (03PS1) 10Ayounsi: sync-netbox-hiera: VMs status are now lowercase [cookbooks] - 10https://gerrit.wikimedia.org/r/1058625 [15:17:08] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1058625 (owner: 10Ayounsi) [15:17:39] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2224.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:19:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67164 and previous config saved to /var/cache/conftool/dbconfig/20240731-151901-root.json [15:19:06] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db2225.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:19:48] (03PS2) 10Ayounsi: sync-netbox-hiera: VMs status are now lowercase [cookbooks] - 10https://gerrit.wikimedia.org/r/1058625 [15:19:48] (03PS6) 10Ayounsi: netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) [15:19:49] (03PS7) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) [15:20:15] (03CR) 10Elukey: [C:03+1] sync-netbox-hiera: VMs status are now lowercase [cookbooks] - 10https://gerrit.wikimedia.org/r/1058625 (owner: 10Ayounsi) [15:22:38] (03CR) 10Kamila Součková: [C:03+1] "OK, LGTM then, get the popcorn :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [15:24:14] (03CR) 10Ayounsi: [C:03+2] sync-netbox-hiera: VMs status are now lowercase [cookbooks] - 10https://gerrit.wikimedia.org/r/1058625 (owner: 10Ayounsi) [15:26:50] (03CR) 10Ayounsi: [C:03+2] fetch_device_interfaces: get all VC interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1058609 (owner: 10Ayounsi) [15:27:29] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2225.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:28:50] (03Merged) 10jenkins-bot: sync-netbox-hiera: VMs status are now lowercase [cookbooks] - 10https://gerrit.wikimedia.org/r/1058625 (owner: 10Ayounsi) [15:28:50] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db2226.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:28:54] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: CR1058609 - ayounsi@cumin1002 [15:30:25] !log jgiannelos@deploy1003 Finished deploy [restbase/deploy@59a40a0]: (no justification provided) (duration: 19m 22s) [15:30:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: CR1058609 - ayounsi@cumin1002 [15:30:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67165 and previous config saved to /var/cache/conftool/dbconfig/20240731-153407-root.json [15:34:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:02] (03CR) 10Ottomata: [C:04-1] "-1 for now while we work out some stuff..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055443 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:36:14] (03CR) 10Cwhite: [C:03+1] "🚀" [puppet] - 10https://gerrit.wikimedia.org/r/1058623 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [15:36:28] (03CR) 10Filippo Giunchedi: [C:03+2] benthos: revert output batches for webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/1058623 (https://phabricator.wikimedia.org/T369737) (owner: 10Filippo Giunchedi) [15:36:48] sukhe: merging your patch too [15:38:51] godog: please do [15:39:20] {{done}} [15:39:29] thanks [15:40:47] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2226.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:43:36] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudcumins: deploy gitlab token for tofu-infra [puppet] - 10https://gerrit.wikimedia.org/r/1058581 (https://phabricator.wikimedia.org/T370414) (owner: 10Arturo Borrero Gonzalez) [15:49:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67166 and previous config saved to /var/cache/conftool/dbconfig/20240731-154912-root.json [15:49:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:37] (03PS1) 10CDanis: jaeger: freshen IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058634 [15:51:37] (03PS1) 10CDanis: jaeger: bump lookback window [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058635 [15:51:50] (03PS1) 10Andrew Bogott: Export scratch nfs for cloud-vps project statanalyser [puppet] - 10https://gerrit.wikimedia.org/r/1058636 (https://phabricator.wikimedia.org/T326904) [15:52:43] (03CR) 10Filippo Giunchedi: [C:03+1] jaeger: bump lookback window [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058635 (owner: 10CDanis) [15:52:51] (03CR) 10CDanis: [C:03+2] jaeger: bump lookback window [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058635 (owner: 10CDanis) [15:53:32] (03CR) 10Andrew Bogott: [C:03+2] Export scratch nfs for cloud-vps project statanalyser [puppet] - 10https://gerrit.wikimedia.org/r/1058636 (https://phabricator.wikimedia.org/T326904) (owner: 10Andrew Bogott) [15:53:36] (03CR) 10Filippo Giunchedi: [C:03+1] jaeger: freshen IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058634 (owner: 10CDanis) [15:53:45] (03CR) 10CDanis: [C:03+2] jaeger: freshen IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058634 (owner: 10CDanis) [15:54:42] (03Merged) 10jenkins-bot: jaeger: freshen IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058634 (owner: 10CDanis) [15:54:43] (03Merged) 10jenkins-bot: jaeger: bump lookback window [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058635 (owner: 10CDanis) [15:55:51] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [15:56:03] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:57:24] (03PS1) 10DCausse: wdqs: fix monitoring_user_agents for scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1058638 [15:58:17] (03PS13) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [15:58:50] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db2227.mgmt.codfw.wmnet with reboot policy GRACEFUL [16:04:52] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [16:07:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:08:14] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2227.mgmt.codfw.wmnet with reboot policy GRACEFUL [16:08:43] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host db2228.mgmt.codfw.wmnet with reboot policy GRACEFUL [16:09:14] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [16:17:03] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2228.mgmt.codfw.wmnet with reboot policy GRACEFUL [16:24:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:27:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T367856)', diff saved to https://phabricator.wikimedia.org/P67167 and previous config saved to /var/cache/conftool/dbconfig/20240731-162712-marostegui.json [16:27:28] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [16:28:10] (03PS2) 10Dwisehaupt: icinga: Add frqueue2003 pay-lb2001 and pay-lb2002 [puppet] - 10https://gerrit.wikimedia.org/r/1058261 (https://phabricator.wikimedia.org/T369566) [16:29:06] (03CR) 10Dwisehaupt: "Oops, that was a typo. Fixed now." [puppet] - 10https://gerrit.wikimedia.org/r/1058261 (https://phabricator.wikimedia.org/T369566) (owner: 10Dwisehaupt) [16:29:08] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T371190#10032769 (10VRiley-WMF) a:03VRiley-WMF [16:29:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:29:22] (03PS1) 10Elukey: install_server: default to puppet 7 in late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1058641 [16:35:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10032811 (10Volans) @papaul We've debugged this live during the tooling and automation office hours and we think it's a race condition due do the fact that `late_com... [16:39:12] (03PS2) 10Elukey: install_server: fix late_command.sh to avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/1058641 (https://phabricator.wikimedia.org/T369654) [16:42:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P67168 and previous config saved to /var/cache/conftool/dbconfig/20240731-164219-marostegui.json [16:42:33] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1058641 (https://phabricator.wikimedia.org/T369654) (owner: 10Elukey) [16:43:07] (03CR) 10Elukey: [C:03+2] install_server: fix late_command.sh to avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/1058641 (https://phabricator.wikimedia.org/T369654) (owner: 10Elukey) [16:49:32] (03CR) 10Cathal Mooney: [C:03+1] "as if we'd ever forget that anyway :P" [puppet] - 10https://gerrit.wikimedia.org/r/1058611 (owner: 10Ssingh) [16:50:23] 06SRE, 06collaboration-services, 06serviceops, 10Release-Engineering-Team (Radar): replace production buster deployment servers - https://phabricator.wikimedia.org/T364656#10032916 (10thcipriani) [16:57:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P67169 and previous config saved to /var/cache/conftool/dbconfig/20240731-165726-marostegui.json [16:59:07] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T371190#10032997 (10VRiley-WMF) 05Open→03Resolved @Marostegui I have swapped out Disk in slot 0. It looks like it should be good to go. Please let us know if there are any issues with this. Thanks! [16:59:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1700) [17:04:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T367856)', diff saved to https://phabricator.wikimedia.org/P67170 and previous config saved to /var/cache/conftool/dbconfig/20240731-171233-marostegui.json [17:12:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [17:12:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [17:12:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T367856)', diff saved to https://phabricator.wikimedia.org/P67171 and previous config saved to /var/cache/conftool/dbconfig/20240731-171255-marostegui.json [17:12:59] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [17:22:34] (03PS14) 10DCausse: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [17:22:34] (03PS1) 10DCausse: wdqs: drop deprecated hosts hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/1058649 [17:23:33] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [17:27:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192#10033083 (10VRiley-WMF) a:03VRiley-WMF [17:34:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:01] (03CR) 10DCausse: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [17:40:58] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:42:45] (03PS1) 10Ssingh: sre.dns.roll-upgrade-ats: update cookbook (changes below) [cookbooks] - 10https://gerrit.wikimedia.org/r/1058652 [17:43:09] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: lvs/interfaces: add note about updating VLANs [puppet] - 10https://gerrit.wikimedia.org/r/1058611 (owner: 10Ssingh) [17:46:36] (03CR) 10CI reject: [V:04-1] sre.dns.roll-upgrade-ats: update cookbook (changes below) [cookbooks] - 10https://gerrit.wikimedia.org/r/1058652 (owner: 10Ssingh) [17:47:55] (03PS2) 10Ssingh: sre.dns.roll-upgrade-ats: update cookbook (changes below) [cookbooks] - 10https://gerrit.wikimedia.org/r/1058652 [17:49:19] (03CR) 10Ssingh: "test-cookbook -c 1058652 --ps 2 --dry-run sre.cdn.roll-upgrade-ats --reason 'testing dry run' --query "P{cp4044*}" --version "9.2.5"" [cookbooks] - 10https://gerrit.wikimedia.org/r/1058652 (owner: 10Ssingh) [17:50:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:23] (03PS1) 10Lucas Werkmeister: p:toolforge::bastion: add deprecated banner [puppet] - 10https://gerrit.wikimedia.org/r/1058654 [17:56:58] (03CR) 10Lucas Werkmeister: p:toolforge::bastion: add deprecated banner (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1058654 (owner: 10Lucas Werkmeister) [17:57:02] (03CR) 10CI reject: [V:04-1] p:toolforge::bastion: add deprecated banner [puppet] - 10https://gerrit.wikimedia.org/r/1058654 (owner: 10Lucas Werkmeister) [18:00:05] brennen and dduvall: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T1800). [18:00:58] (03CR) 10Lucas Werkmeister: "*gremlin noises*" [puppet] - 10https://gerrit.wikimedia.org/r/1058654 (owner: 10Lucas Werkmeister) [18:01:24] o/ [18:01:35] i'll commence in a few minutes. [18:02:21] (03PS2) 10Lucas Werkmeister: p:toolforge::bastion: add deprecated banner [puppet] - 10https://gerrit.wikimedia.org/r/1058654 [18:07:16] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10033207 (10Jhancock.wm) [18:09:36] !log 1.43.0-wmf.16 train (T366961): no current blockers, logs clean, rolling to group1. [18:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:54] T366961: 1.43.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T366961 [18:10:00] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058657 (https://phabricator.wikimedia.org/T366961) [18:10:01] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058657 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [18:10:43] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058657 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [18:17:34] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:17:44] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.16 refs T366961 [18:17:53] T366961: 1.43.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T366961 [18:24:20] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:24:31] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:25:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [18:27:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:29:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host vrts2002.mgmt.codfw.wmnet with reboot policy FORCED [18:29:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:37:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:49] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:40:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host vrts2002.mgmt.codfw.wmnet with reboot policy FORCED [18:42:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:46] I'll tweak the phabricator alert tomorrow to be less noisy. It should resolve after some minutes. I'm just on my phone at the moment [18:43:00] Thanks, I see that it was a brief spike [18:43:09] However, what was causing all the tcp retransmits? [18:44:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:44:49] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:45:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host vrts2002.mgmt.codfw.wmnet with reboot policy FORCED [18:48:26] db connection failure [18:48:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host vrts2002.mgmt.codfw.wmnet with reboot policy FORCED [18:55:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host vrts2002.mgmt.codfw.wmnet with reboot policy FORCED [18:56:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host vrts2002.mgmt.codfw.wmnet with reboot policy FORCED [19:00:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:10:44] 06SRE, 10LPL Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#10033367 (10Fuzzy) And yet another law – [[ https://he.wikisource.org/wiki/פקודת_הרוקחים | The Israeli... [19:11:51] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@ea93090]: deploy latest DAGS to analyics Airflow instance. [19:13:21] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@ea93090]: deploy latest DAGS to analyics Airflow instance. (duration: 01m 30s) [19:17:53] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:20:25] (03CR) 10BCornwall: [C:03+1] sre.dns.roll-upgrade-ats: update cookbook (changes below) [cookbooks] - 10https://gerrit.wikimedia.org/r/1058652 (owner: 10Ssingh) [19:20:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:20:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:23:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:23:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host alert2002.mgmt.codfw.wmnet with reboot policy FORCED [19:30:26] (03PS1) 10JHathaway: git-sync-upstream: execute the entire script as gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/1058675 [19:30:39] (03CR) 10Ssingh: "Ready for review from Traffic." [cookbooks] - 10https://gerrit.wikimedia.org/r/1058652 (owner: 10Ssingh) [19:33:22] (03CR) 10CI reject: [V:04-1] git-sync-upstream: execute the entire script as gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/1058675 (owner: 10JHathaway) [19:34:51] (03PS2) 10JHathaway: git-sync-upstream: execute the entire script as gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/1058675 (https://phabricator.wikimedia.org/T364492) [19:35:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058675 (https://phabricator.wikimedia.org/T364492) (owner: 10JHathaway) [19:35:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:39:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:20] (03CR) 10JHathaway: cloud-vps puppetservers: remove use of the 'gitpuppet' user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [19:41:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host alert2002.mgmt.codfw.wmnet with reboot policy FORCED [19:51:11] jouncebot: next [19:51:12] In 0 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T2000) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T2000) [20:00:05] xSavitar, Gerges, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:35] \o [20:00:40] o/ [20:01:09] hi ! i can deploy unless someone else in the queue would like to do it? [20:01:13] Here [20:01:52] cjming, you can deploy, I'll be here to test it out :) [20:02:01] alrighty then! [20:02:09] (03PS4) 10D3r1ck01: [wmf-config] Remove trailing slash in SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056495 [20:02:52] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: old netbox [20:03:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: old netbox [20:03:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056495 (owner: 10D3r1ck01) [20:03:57] (03Merged) 10jenkins-bot: [wmf-config] Remove trailing slash in SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056495 (owner: 10D3r1ck01) [20:04:10] (03PS1) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) [20:04:14] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1056495|[wmf-config] Remove trailing slash in SSO domain]] [20:06:25] !log cjming@deploy1003 cjming, d3r1ck01: Backport for [[gerrit:1056495|[wmf-config] Remove trailing slash in SSO domain]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:06:28] xSavitar: can i sync? [20:06:57] Give me a min to test on beta quickly [20:07:08] sure thing - standing by [20:07:41] yes go ahead. [20:07:46] !log cjming@deploy1003 cjming, d3r1ck01: Continuing with sync [20:09:47] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:12:13] (03PS2) 10GergesShamon: [arwiki] Set noindex for namespace user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058584 (https://phabricator.wikimedia.org/T371470) [20:12:19] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1056495|[wmf-config] Remove trailing slash in SSO domain]] (duration: 08m 04s) [20:12:33] xSavitar: should be live :) [20:13:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058584 (https://phabricator.wikimedia.org/T371470) (owner: 10GergesShamon) [20:13:42] Gerges: i think i have to run the namespace dupes script after your patch [20:13:54] (03Merged) 10jenkins-bot: [arwiki] Set noindex for namespace user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058584 (https://phabricator.wikimedia.org/T371470) (owner: 10GergesShamon) [20:14:13] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1058584|[arwiki] Set noindex for namespace user (T371470)]] [20:14:15] cjming, confirming. [20:14:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:14:30] T371470: Set noindex for user pages on Arabic Wikipedia - https://phabricator.wikimedia.org/T371470 [20:15:35] cjming, all looks good. Thanks you very much! [20:15:46] *Thank [20:16:07] yay - yw! [20:16:18] !log cjming@deploy1003 cjming, gergesshamon: Backport for [[gerrit:1058584|[arwiki] Set noindex for namespace user (T371470)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:16:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [20:16:54] actually maybe i don't need to run namespaceDupes [20:17:09] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:17:25] Gerges: ok to sync? [20:19:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:20:36] cjming: re namespaceDupes, for 1058584? [20:20:51] RhinosF1: yes [20:21:17] cjming: definitely don't need namespaceDupes for that [20:21:29] ah - gtk - thanks for confirming [20:21:50] Gerges: awaiting your confirmation to go live [20:22:06] Only needs to be done if you're adding a namespace, alias or Interwiki where titles might become inaccessible [20:22:22] gotcha - thanks for clarification [20:23:23] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1250.mgmt.eqiad.wmnet with reboot policy FORCED [20:23:42] Sorry cjming, I'm late, what did I miss? [20:23:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1251.mgmt.eqiad.wmnet with reboot policy FORCED [20:23:53] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1252.mgmt.eqiad.wmnet with reboot policy FORCED [20:23:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1253.mgmt.eqiad.wmnet with reboot policy FORCED [20:23:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1254.mgmt.eqiad.wmnet with reboot policy FORCED [20:23:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1255.mgmt.eqiad.wmnet with reboot policy FORCED [20:24:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1259.mgmt.eqiad.wmnet with reboot policy FORCED [20:24:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1256.mgmt.eqiad.wmnet with reboot policy FORCED [20:24:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1257.mgmt.eqiad.wmnet with reboot policy FORCED [20:24:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1258.mgmt.eqiad.wmnet with reboot policy FORCED [20:24:14] hi Gerges: just wanting confirmation that i should sync your patch i.e. go live [20:24:31] it's up on test servers if it's testable ^^ [20:26:58] cjming: all fine :) [20:27:09] cool [20:27:11] !log cjming@deploy1003 cjming, gergesshamon: Continuing with sync [20:28:34] 10ops-magru, 06SRE: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554 (10ops-monitoring-bot) 03NEW [20:31:41] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1058584|[arwiki] Set noindex for namespace user (T371470)]] (duration: 17m 28s) [20:31:54] Gerges: should be live :) [20:32:00] T371470: Set noindex for user pages on Arabic Wikipedia - https://phabricator.wikimedia.org/T371470 [20:32:02] Thanks:) [20:32:12] yw! [20:32:37] ebernhardson: rebasing your patch on master [20:33:05] excellent [20:33:25] whoops - getting a merge conflict -- can you resolve? [20:33:30] sure, sec [20:34:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1257.mgmt.eqiad.wmnet with reboot policy FORCED [20:34:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1257.mgmt.eqiad.wmnet with reboot policy FORCED [20:35:29] (03PS4) 10Ebernhardson: beta: Enable NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055484 (https://phabricator.wikimedia.org/T355267) [20:35:46] cjming: patch updated [20:35:54] great - thx [20:36:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055484 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [20:36:44] (03Merged) 10jenkins-bot: beta: Enable NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055484 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [20:37:05] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1055484|beta: Enable NetworkSession extension (T355267)]] [20:37:17] T355267: Add extension NetworkSession to all wmf wikis - https://phabricator.wikimedia.org/T355267 [20:39:22] !log cjming@deploy1003 ebernhardson, cjming: Backport for [[gerrit:1055484|beta: Enable NetworkSession extension (T355267)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:39:33] ebernhardson: good to sync? [20:39:49] cjming: checking, but it shouldn't have done anything to prod [20:40:15] presumably not but probably prudent to double-check [20:40:17] cjming: yea looks good [20:40:22] !log cjming@deploy1003 ebernhardson, cjming: Continuing with sync [20:40:24] mostly just looking at Special:Version with debug enabled [20:44:52] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1055484|beta: Enable NetworkSession extension (T355267)]] (duration: 07m 47s) [20:45:04] ebernhardson: hopefully live! [20:45:06] T355267: Add extension NetworkSession to all wmf wikis - https://phabricator.wikimedia.org/T355267 [20:45:11] cjming: thanks [20:45:17] np [20:45:58] !log end of UTC late backport window [20:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:57] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp7015.magru.wmnet [20:48:56] 10ops-magru, 06Traffic: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554#10033696 (10BCornwall) 05Open→03In progress p:05Triage→03High [20:49:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1258.mgmt.eqiad.wmnet with reboot policy FORCED [20:49:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1258.mgmt.eqiad.wmnet with reboot policy FORCED [20:50:27] 10ops-magru, 06Traffic: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554#10033702 (10BCornwall) ` Jul 31 20:19:29 cp7015 kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) Jul 31 20:19:33 cp7015 kernel: mpt3sas_cm0: log_info(0x31110d00): originator(PL), c... [20:52:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1258.mgmt.eqiad.wmnet with reboot policy FORCED [20:53:19] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1250.mgmt.eqiad.wmnet with reboot policy FORCED [20:53:24] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1252.mgmt.eqiad.wmnet with reboot policy FORCED [20:53:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1251.mgmt.eqiad.wmnet with reboot policy FORCED [20:53:47] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1259.mgmt.eqiad.wmnet with reboot policy FORCED [20:54:09] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1253.mgmt.eqiad.wmnet with reboot policy FORCED [20:54:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1254.mgmt.eqiad.wmnet with reboot policy FORCED [20:55:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1255.mgmt.eqiad.wmnet with reboot policy FORCED [20:55:53] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1256.mgmt.eqiad.wmnet with reboot policy FORCED [20:56:29] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1257.mgmt.eqiad.wmnet with reboot policy FORCED [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240731T2100) [21:02:11] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:02:47] (03CR) 10LMata: [C:03+1] admin: promote tappof to root [puppet] - 10https://gerrit.wikimedia.org/r/1058565 (owner: 10Tiziano Fogli) [21:04:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:06:54] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:09:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:10:00] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp7015.magru.wmnet with reason: T371554 [21:10:07] T371554: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554 [21:10:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp7015.magru.wmnet with reason: T371554 [21:16:01] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@82674dc]: deploy hot airflow analytics dag hot fix T368756 [21:16:06] T368756: Airflow job to orchestrate the emission mechanism - https://phabricator.wikimedia.org/T368756 [21:17:07] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@82674dc]: deploy hot airflow analytics dag hot fix T368756 (duration: 01m 05s) [21:18:29] 10ops-magru, 06SRE: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371559 (10ops-monitoring-bot) 03NEW [21:18:42] (03CR) 10Dzahn: [C:03+1] add byteplus to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1058558 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [21:26:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1256.mgmt.eqiad.wmnet with reboot policy FORCED [21:27:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1256.mgmt.eqiad.wmnet with reboot policy FORCED [21:27:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1256.mgmt.eqiad.wmnet with reboot policy FORCED [21:28:07] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1256.mgmt.eqiad.wmnet with reboot policy FORCED [21:50:25] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [21:52:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:04:28] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1058106 (https://phabricator.wikimedia.org/T371102) (owner: 10Filippo Giunchedi) [22:05:58] (03CR) 10Cwhite: [C:03+1] "🚀" [puppet] - 10https://gerrit.wikimedia.org/r/1057819 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [22:09:05] (03CR) 10Cwhite: [C:04-1] "Reapplying -1 - unresolved comments." [puppet] - 10https://gerrit.wikimedia.org/r/1037571 (owner: 10JHathaway) [22:09:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host alert2002.mgmt.codfw.wmnet with reboot policy FORCED [22:10:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host alert2002.mgmt.codfw.wmnet with reboot policy FORCED [22:10:15] (03CR) 10Cwhite: [C:03+2] site: add insetup configs for logging-sd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056973 (https://phabricator.wikimedia.org/T370546) (owner: 10Cwhite) [22:12:45] 06SRE, 06Infrastructure-Foundations: Netbox dns record generation not working - https://phabricator.wikimedia.org/T371565 (10cmooney) 03NEW p:05Triage→03High [22:15:31] !log pt1979@cumin1002 START - Cookbook sre.dns.netbox [22:17:43] !log pt1979@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [22:19:49] !log pt1979@cumin1002 START - Cookbook sre.dns.netbox [22:23:21] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:33:09] (03PS1) 10Brennen Bearnes: logspam-watch: Add version column, group errors [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) [22:33:17] (03PS1) 10Andrew Bogott: rabbitmq: created cinder-specific rabbit user [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) [22:33:21] (03PS1) 10Andrew Bogott: Switch cinder to the new cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) [22:37:32] (03PS2) 10Andrew Bogott: rabbitmq: created cinder-specific rabbit user [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) [22:37:32] (03PS2) 10Andrew Bogott: Switch cinder to the new cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) [22:37:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [22:37:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [22:44:01] (03PS1) 10Andrew Bogott: Fake passwords for cinder rabbitmq user [labs/private] - 10https://gerrit.wikimedia.org/r/1058711 [22:44:30] (03PS3) 10Andrew Bogott: rabbitmq: create cinder-specific rabbit user [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) [22:44:30] (03PS3) 10Andrew Bogott: Switch cinder to the new cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) [22:45:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [22:45:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [22:48:45] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Fake passwords for cinder rabbitmq user [labs/private] - 10https://gerrit.wikimedia.org/r/1058711 (owner: 10Andrew Bogott) [22:49:32] (03CR) 10Dzahn: "is there a ticket about replacing the bastions? User questions like https://phabricator.wikimedia.org/T371556#10033703 are coming in, woul" [puppet] - 10https://gerrit.wikimedia.org/r/1058654 (owner: 10Lucas Werkmeister) [22:50:57] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [22:51:38] (03PS4) 10Andrew Bogott: rabbitmq: create cinder-specific rabbit user [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) [22:51:39] (03PS4) 10Andrew Bogott: Switch cinder to the new cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) [22:53:33] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [22:56:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10034047 (10colewhite) a:05colewhite→03None Puppet is ready. [22:57:56] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10034051 (10Dzahn) @Arnoldokoth Can you add this host to puppet site.pp and partman so we unblock Jhancock.wm early? [23:01:58] (03PS5) 10Andrew Bogott: Switch cinder to the new cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) [23:02:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [23:13:00] (03PS6) 10Andrew Bogott: Switch cinder to the new cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) [23:13:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [23:21:30] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [23:26:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:27:07] (03CR) 10Andrew Bogott: [C:03+2] Switch cinder to the new cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058709 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [23:29:43] (03CR) 10Lucas Werkmeister: "T314665 is probably the closest one (with its subtask T360488 also having seen some discussion), but it’s not exactly user-facing. But if " [puppet] - 10https://gerrit.wikimedia.org/r/1058654 (owner: 10Lucas Werkmeister) [23:31:00] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10034118 (10Dzahn) ACK! I see the key in `modules/profile/manifests/ci/agent.pp` and of course we could make a ne... [23:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058717 [23:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058717 (owner: 10TrainBranchBot) [23:57:13] (03PS1) 10Andrew Bogott: Switch cinder to the new trove rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256) [23:57:35] (03CR) 10CI reject: [V:04-1] Switch cinder to the new trove rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [23:57:53] (03PS2) 10Andrew Bogott: Switch trove to the new trove rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256) [23:58:16] (03CR) 10CI reject: [V:04-1] Switch trove to the new trove rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [23:58:41] (03PS3) 10Andrew Bogott: Switch trove to the new trove rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256)