[00:02:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1046118 (owner: 10TrainBranchBot) [00:05:47] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:26] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T367648#9896544 (10phaultfinder) [00:32:48] FIRING: KubernetesCalicoDown: mw2321.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2321.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:59:02] PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:59:04] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T367678 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:59:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T367678 (10ops-monitoring-bot) 03NEW [01:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:14:22] RECOVERY - Host elastic2099 is UP: PING WARNING - Packet loss = 71%, RTA = 115.07 ms [02:20:44] PROBLEM - Host elastic2099 is DOWN: PING CRITICAL - Packet loss = 100% [02:38:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:38:52] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:43:52] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 40.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:05:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:32:48] FIRING: KubernetesCalicoDown: mw2321.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2321.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:35:22] (03PS1) 10Stevemunene: [WIP] wdqs: create wdqs split pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1046120 (https://phabricator.wikimedia.org/T364368) [04:39:05] (03PS1) 10Stevemunene: wdqs: microsites for wdqs graph split [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) [04:40:54] (03PS2) 10Stevemunene: wdqs: microsites for wdqs graph split [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) [04:42:29] (03CR) 10Giuseppe Lavagetto: [C:03+2] Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155 (owner: 10JMeybohm) [04:44:30] (03Merged) 10jenkins-bot: Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155 (owner: 10JMeybohm) [04:46:22] (03CR) 10Giuseppe Lavagetto: [C:03+2] Release 4.0.1 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043601 (owner: 10JMeybohm) [04:48:25] (03Merged) 10jenkins-bot: Release 4.0.1 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043601 (owner: 10JMeybohm) [05:01:37] (03PS1) 10Giuseppe Lavagetto: Actually change version in setup.py [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1046122 [05:02:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:03:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [05:03:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [05:03:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T367261)', diff saved to https://phabricator.wikimedia.org/P65054 and previous config saved to /var/cache/conftool/dbconfig/20240617-050324-marostegui.json [05:03:29] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [05:07:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:07:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:07:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:07:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:07:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:07:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T364069)', diff saved to https://phabricator.wikimedia.org/P65055 and previous config saved to /var/cache/conftool/dbconfig/20240617-050756-marostegui.json [05:08:04] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:08:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [05:08:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [05:08:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65056 and previous config saved to /var/cache/conftool/dbconfig/20240617-050849-root.json [05:08:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T367261)', diff saved to https://phabricator.wikimedia.org/P65057 and previous config saved to /var/cache/conftool/dbconfig/20240617-050852-marostegui.json [05:08:57] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [05:09:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65058 and previous config saved to /var/cache/conftool/dbconfig/20240617-050944-root.json [05:12:02] (03CR) 10Giuseppe Lavagetto: [C:03+2] Actually change version in setup.py [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1046122 (owner: 10Giuseppe Lavagetto) [05:18:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367261)', diff saved to https://phabricator.wikimedia.org/P65059 and previous config saved to /var/cache/conftool/dbconfig/20240617-051805-marostegui.json [05:18:10] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [05:21:50] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:23:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65060 and previous config saved to /var/cache/conftool/dbconfig/20240617-052355-root.json [05:24:06] (03PS1) 10Stevemunene: wdqs: add the query main and scholarly roles [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) [05:24:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65061 and previous config saved to /var/cache/conftool/dbconfig/20240617-052450-root.json [05:25:25] (03PS1) 10Giuseppe Lavagetto: Update ci server targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046124 [05:25:25] (03PS1) 10Giuseppe Lavagetto: Remove buster builds [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046125 [05:25:25] (03PS1) 10Giuseppe Lavagetto: Updating docker-pkg to 4.0.1 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046126 [05:33:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P65062 and previous config saved to /var/cache/conftool/dbconfig/20240617-053312-marostegui.json [05:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:39:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65063 and previous config saved to /var/cache/conftool/dbconfig/20240617-053902-root.json [05:39:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65064 and previous config saved to /var/cache/conftool/dbconfig/20240617-053955-root.json [05:45:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:48:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P65065 and previous config saved to /var/cache/conftool/dbconfig/20240617-054819-marostegui.json [05:48:33] 06SRE, 10LDAP-Access-Requests: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681 (10AndyRussG) 03NEW [05:50:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:54:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65066 and previous config saved to /var/cache/conftool/dbconfig/20240617-055407-root.json [05:54:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:55:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65067 and previous config saved to /var/cache/conftool/dbconfig/20240617-055501-root.json [05:59:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:03:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T367261)', diff saved to https://phabricator.wikimedia.org/P65068 and previous config saved to /var/cache/conftool/dbconfig/20240617-060326-marostegui.json [06:03:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [06:03:31] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [06:03:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [06:03:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [06:03:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [06:03:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T367261)', diff saved to https://phabricator.wikimedia.org/P65069 and previous config saved to /var/cache/conftool/dbconfig/20240617-060352-marostegui.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:47] (03PS1) 10Ilias Sarantopoulos: ml-services: add dummy articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046133 (https://phabricator.wikimedia.org/T360455) [06:08:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367261)', diff saved to https://phabricator.wikimedia.org/P65070 and previous config saved to /var/cache/conftool/dbconfig/20240617-060812-marostegui.json [06:09:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65071 and previous config saved to /var/cache/conftool/dbconfig/20240617-060913-root.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65072 and previous config saved to /var/cache/conftool/dbconfig/20240617-061006-root.json [06:11:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T352010)', diff saved to https://phabricator.wikimedia.org/P65073 and previous config saved to /var/cache/conftool/dbconfig/20240617-061105-ladsgroup.json [06:11:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:23:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P65074 and previous config saved to /var/cache/conftool/dbconfig/20240617-062319-marostegui.json [06:24:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65075 and previous config saved to /var/cache/conftool/dbconfig/20240617-062418-root.json [06:25:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65076 and previous config saved to /var/cache/conftool/dbconfig/20240617-062511-root.json [06:26:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P65077 and previous config saved to /var/cache/conftool/dbconfig/20240617-062612-ladsgroup.json [06:32:45] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Kafka roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043607 (owner: 10Muehlenhoff) [06:34:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:36:50] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:37:48] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for remaining mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/1043606 (owner: 10Muehlenhoff) [06:38:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P65078 and previous config saved to /var/cache/conftool/dbconfig/20240617-063826-marostegui.json [06:39:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:39:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65079 and previous config saved to /var/cache/conftool/dbconfig/20240617-063923-root.json [06:41:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P65080 and previous config saved to /var/cache/conftool/dbconfig/20240617-064118-ladsgroup.json [06:44:18] (03CR) 10Hashar: [C:03+1] "+1 since I suggested that for doc.wikimedia.org ( T349166 )." [puppet] - 10https://gerrit.wikimedia.org/r/1044731 (https://phabricator.wikimedia.org/T367627) (owner: 10EoghanGaffney) [06:51:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:51:52] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 391.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:53:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367261)', diff saved to https://phabricator.wikimedia.org/P65081 and previous config saved to /var/cache/conftool/dbconfig/20240617-065335-marostegui.json [06:53:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [06:53:40] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [06:53:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [06:53:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T367261)', diff saved to https://phabricator.wikimedia.org/P65082 and previous config saved to /var/cache/conftool/dbconfig/20240617-065357-marostegui.json [06:56:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T352010)', diff saved to https://phabricator.wikimedia.org/P65083 and previous config saved to /var/cache/conftool/dbconfig/20240617-065625-ladsgroup.json [06:56:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [06:56:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:56:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [06:56:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T352010)', diff saved to https://phabricator.wikimedia.org/P65084 and previous config saved to /var/cache/conftool/dbconfig/20240617-065647-ladsgroup.json [06:59:32] (03CR) 10Muehlenhoff: [C:03+2] Remove Pontoon support for Puppet 5 puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/1043757 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:00:04] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T0700). nyaa~ [07:00:04] Jhs: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T367261)', diff saved to https://phabricator.wikimedia.org/P65085 and previous config saved to /var/cache/conftool/dbconfig/20240617-070009-marostegui.json [07:00:21] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [07:01:07] (03CR) 10Muehlenhoff: [C:03+2] nginx: Drop workaround for history Puppet bug [puppet] - 10https://gerrit.wikimedia.org/r/1043735 (owner: 10Muehlenhoff) [07:01:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:02:35] \o present [07:03:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:06:14] (03PS1) 10Muehlenhoff: mwmaint: Stop including profile::openldap::management [puppet] - 10https://gerrit.wikimedia.org/r/1046318 (https://phabricator.wikimedia.org/T367490) [07:11:50] urbanecm, Amir1, are you here? [07:14:42] (03CR) 10Muehlenhoff: [C:04-1] codesearch: add support for docker-ce on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [07:15:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P65086 and previous config saved to /var/cache/conftool/dbconfig/20240617-071516-marostegui.json [07:24:10] (03CR) 10JMeybohm: [C:03+1] Update ci server targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046124 (owner: 10Giuseppe Lavagetto) [07:24:44] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1046318 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [07:26:34] (03CR) 10JMeybohm: [C:03+1] Remove buster builds [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046125 (owner: 10Giuseppe Lavagetto) [07:28:16] (03CR) 10JMeybohm: [C:03+1] Updating docker-pkg to 4.0.1 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046126 (owner: 10Giuseppe Lavagetto) [07:30:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P65087 and previous config saved to /var/cache/conftool/dbconfig/20240617-073023-marostegui.json [07:36:04] 06SRE, 10LDAP-Access-Requests: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9896892 (10WMDECyn) I approve the request on WMDE's behalf [07:40:52] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:43:53] (03PS36) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [07:45:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T367261)', diff saved to https://phabricator.wikimedia.org/P65088 and previous config saved to /var/cache/conftool/dbconfig/20240617-074530-marostegui.json [07:45:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [07:45:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [07:45:36] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [07:45:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T367261)', diff saved to https://phabricator.wikimedia.org/P65089 and previous config saved to /var/cache/conftool/dbconfig/20240617-074542-marostegui.json [07:48:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:50:49] (03CR) 10Muehlenhoff: [C:03+2] mwmaint: Stop including profile::openldap::management [puppet] - 10https://gerrit.wikimedia.org/r/1046318 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [07:52:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T367261)', diff saved to https://phabricator.wikimedia.org/P65090 and previous config saved to /var/cache/conftool/dbconfig/20240617-075234-marostegui.json [07:52:39] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [07:55:18] (03PS1) 10Peter Fischer: Search update pipeline: use dedicated user agents [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046591 (https://phabricator.wikimedia.org/T362310) [07:58:37] (03PS1) 10Muehlenhoff: Disable openldap::management timers on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1046592 (https://phabricator.wikimedia.org/T367490) [07:59:10] (03CR) 10Jelto: [C:03+2] aptrepo: bump gitlab-runner and gitlab-ce to 17.0 [puppet] - 10https://gerrit.wikimedia.org/r/1043764 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto) [08:04:12] (03CR) 10DCausse: [C:03+1] wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [08:05:47] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P65091 and previous config saved to /var/cache/conftool/dbconfig/20240617-080741-marostegui.json [08:08:52] (03PS1) 10Brouberol: monitor admin_ng pending changes for dse-k8s-eqiad [alerts] - 10https://gerrit.wikimedia.org/r/1046593 (https://phabricator.wikimedia.org/T331894) [08:10:50] (03PS1) 10Muehlenhoff: profile::openldap::management: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1046594 (https://phabricator.wikimedia.org/T367490) [08:13:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:13:49] (03PS1) 10Muehlenhoff: Drop ldap-admins access group from mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1046596 (https://phabricator.wikimedia.org/T367490) [08:14:33] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [08:15:11] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-jobrunner: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043705 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [08:15:54] (03CR) 10Giuseppe Lavagetto: [C:03+1] "Maybe we will need more replicas here, but let's start with the default 3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [08:16:40] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [08:16:50] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:18:05] (03CR) 10Kevin Bazira: ml-services: add dummy articlequality model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046133 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [08:18:47] (03CR) 10Slyngshede: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1046594 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:18:55] (03CR) 10Slyngshede: [C:03+1] profile::openldap::management: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1046594 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:18:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9897053 (10Clement_Goubert) Sure, no problem. Rescheduled. [08:19:03] (03CR) 10Brouberol: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [08:19:26] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [08:22:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P65092 and previous config saved to /var/cache/conftool/dbconfig/20240617-082248-marostegui.json [08:24:50] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:24:53] (03PS2) 10Muehlenhoff: profile::openldap::management: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1046594 (https://phabricator.wikimedia.org/T367490) [08:25:09] !log brouberol@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-test-eqiad [08:25:58] (03PS1) 10Urbanecm: Backport to master [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046597 [08:26:05] (03PS2) 10Clément Goubert: mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) [08:26:05] (03PS2) 10Clément Goubert: mw-jobrunner: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043705 (https://phabricator.wikimedia.org/T365265) [08:26:05] (03PS2) 10Clément Goubert: mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) [08:26:05] (03PS2) 10Clément Goubert: mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) [08:26:06] (03PS2) 10Clément Goubert: mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) [08:26:29] (03CR) 10Volans: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:28:52] (03CR) 10Muehlenhoff: [C:03+2] profile::openldap::management: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1046594 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:29:36] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043804 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:32:24] (03CR) 10Muehlenhoff: [C:03+2] Disable openldap::management timers on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1046592 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:32:26] jouncebot: nowandnext [08:32:26] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [08:32:26] In 1 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1000) [08:32:48] FIRING: KubernetesCalicoDown: mw2321.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2321.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:33:02] on it ^ [08:33:24] (03PS1) 10Urbanecm: throttle: Fix exemption for ongoing course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046599 [08:33:44] (03CR) 10Peter Fischer: [C:03+2] "Sure, here's a patch for dedicated UAs: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1046591" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [08:33:45] cjming: i presume it is not a very good idea to deploy a fix of a throttle rule? :)) [08:34:48] urbanecm: you can, it will just throw an error at the pull-k8s-image stage [08:34:55] ack, ty [08:35:01] it is down DOWN. [08:35:06] (03CR) 10Urbanecm: [C:03+2] throttle: Fix exemption for ongoing course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046599 (owner: 10Urbanecm) [08:35:07] I can't even reach the management interface [08:35:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046599 (owner: 10Urbanecm) [08:35:44] (03Merged) 10jenkins-bot: throttle: Fix exemption for ongoing course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046599 (owner: 10Urbanecm) [08:36:20] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1046599|throttle: Fix exemption for ongoing course]] [08:36:27] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9897100 (10ayounsi) It's necessary to do the diff on all target devices anyway, so that behavior is fine. For example, if we run `homer "*ulsfo*" commit "foo"` to change a SSH k... [08:37:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T367261)', diff saved to https://phabricator.wikimedia.org/P65093 and previous config saved to /var/cache/conftool/dbconfig/20240617-083755-marostegui.json [08:38:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [08:38:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [08:38:04] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:40:00] !log powercycling rdb1014 [08:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:43:34] RECOVERY - SSH on rdb1014 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:43:36] RECOVERY - Host rdb1014 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [08:44:12] (03CR) 10Awight: [C:03+1] "This is correct, thanks for the fix! Reference Previews settings should default to enabled and not be conditional on user creation date." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) (owner: 10Func) [08:44:34] PROBLEM - Check health of redis instance on 6379 on rdb1014 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 https://wikitech.wikimedia.org/wiki/Redis [08:45:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:45:56] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:49:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T367261)', diff saved to https://phabricator.wikimedia.org/P65094 and previous config saved to /var/cache/conftool/dbconfig/20240617-084906-marostegui.json [08:49:08] PROBLEM - Host rdb1014 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:12] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:49:13] erm [08:49:15] <_joe_> claime: have you set mw2231 as inactive? [08:49:17] <_joe_> uh [08:49:19] it just rebooted on me [08:49:21] again [08:49:23] <_joe_> uhm [08:49:27] <_joe_> sigh [08:49:39] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=mw2321.codfw.wmnet [08:49:48] now I have [08:49:52] <_joe_> ok :) [08:50:14] <_joe_> we need to build that dsh file from the k8s api eventually [08:50:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:50:25] 06SRE, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Split out ldap management from mwmaint - https://phabricator.wikimedia.org/T367490#9897179 (10MoritzMuehlenhoff) 05Open→03Resolved The LDAP management parts have been split off to the new ldap-maint1001/ldap-maint2001 hosts. [08:50:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:52:11] (03CR) 10Michael Große: "Should this also have the communityconfiguration-deployment hashtag?" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046597 (owner: 10Urbanecm) [08:52:43] (03CR) 10Urbanecm: "Yep, done!" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046597 (owner: 10Urbanecm) [08:53:24] !log hardcycling rdb1014 [08:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:29] (03PS2) 10Urbanecm: Backport all commits from master [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046597 (https://phabricator.wikimedia.org/T364895) [08:54:53] 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9897254 (10ayounsi) Can we move the cables instead of moving the servers ? For example Port 44 to 47 can be used right away at... [08:55:11] (03PS1) 10Muehlenhoff: Add component/jdk21 for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046601 (https://phabricator.wikimedia.org/T367487) [08:55:36] RECOVERY - Host rdb1014 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:55:38] PROBLEM - Check health of redis instance on 6379 on rdb1014 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 https://wikitech.wikimedia.org/wiki/Redis [08:56:28] (03CR) 10Michael Große: "those commit IDs don't match what is actually in master. For example 1046018: build: Updating @wikimedia/codex to 1.7.0 | https://gerrit.w" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046597 (https://phabricator.wikimedia.org/T364895) (owner: 10Urbanecm) [08:56:34] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702 (10Clement_Goubert) 03NEW [08:57:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9897295 (10klausman) Tuesday sounds good. I'll drain and shutdown the machine on Tuesday 17:00 CEST/15:00 UTC/10:00CDT, does that w... [08:57:35] (03PS2) 10Lucas Werkmeister (WMDE): Check EntitySchemaIsRepo in more hook handlers [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046598 (https://phabricator.wikimedia.org/T363153) [08:57:47] (03CR) 10Muehlenhoff: [C:03+2] Add component/jdk21 for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046601 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [08:58:56] (03CR) 10Lucas Werkmeister (WMDE): "I think it should be possible to backport this without race condition errors: if any bare-metal servers see the new `extension.json` befor" [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046598 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [09:01:16] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9897371 (10Volans) I like the last proposal but I was thinking that there is an additional case: 1. apply to this device and ask for the next one unless already cached and appro... [09:01:25] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1046599|throttle: Fix exemption for ongoing course]] (duration: 25m 05s) [09:01:53] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9897373 (10ayounsi) I don't understand why the need to be moved to get upgraded to 10G. If we take for example wikikube-ctrl2001... [09:03:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036613 (https://phabricator.wikimedia.org/T363801) (owner: 10Sergio Gimeno) [09:03:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043784 (https://phabricator.wikimedia.org/T364895) (owner: 10Urbanecm) [09:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T364069)', diff saved to https://phabricator.wikimedia.org/P65095 and previous config saved to /var/cache/conftool/dbconfig/20240617-090405-marostegui.json [09:04:10] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:04:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P65096 and previous config saved to /var/cache/conftool/dbconfig/20240617-090413-marostegui.json [09:04:32] <_joe_> !log removed damaged AOF file for redis rdb1014-6379, resyncing with primary [09:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:38] RECOVERY - Check health of redis instance on 6379 on rdb1014 is OK: OK: REDIS 6.0.16 on 127.0.0.1:6379 has 0 databases (), up 48 seconds https://wikitech.wikimedia.org/wiki/Redis [09:05:34] !log brouberol@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-test-eqiad [09:06:40] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9897418 (10ayounsi) Yeah I think it's what I tried to mean with > We can also decide that batch means to silently skip any device that have a different diff, to not risk blockin... [09:09:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:11:19] (03CR) 10Klausman: [V:03+2 C:03+2] "Confirmed working now:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman) [09:12:01] (03CR) 10Klausman: [V:03+2 C:03+2] golang: Add version 1.22 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman) [09:14:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:17:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046116 (https://phabricator.wikimedia.org/T367674) (owner: 10Jon Harald Søby) [09:18:51] (03PS1) 10Majavah: P:openstack: designate: Only run wmcs-dnsleaks on a single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1046606 [09:19:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P65097 and previous config saved to /var/cache/conftool/dbconfig/20240617-091912-marostegui.json [09:19:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P65098 and previous config saved to /var/cache/conftool/dbconfig/20240617-091920-marostegui.json [09:19:29] (03PS3) 10JMeybohm: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) [09:20:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2931/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046606 (owner: 10Majavah) [09:21:19] (03PS2) 10Majavah: P:openstack: designate: Only run wmcs-dnsleaks on a single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1046606 [09:21:32] (03PS2) 10Hashar: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) [09:22:00] (03CR) 10JMeybohm: Allow to only report images of supported Debian versions (033 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [09:22:54] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2932/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046606 (owner: 10Majavah) [09:23:17] (03CR) 10JMeybohm: Allow to only report images of supported Debian versions (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [09:23:39] jouncebot: nowandnext [09:23:39] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [09:23:39] In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1000) [09:24:34] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9897529 (10cmooney) For my part I like “3” as set out by Volans above. @ayounsi is your proposal that “batch” would be a valid answer (in addition to yes/no) when presented wit... [09:24:42] (03PS2) 10Majavah: Stop loading OSM i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041742 (https://phabricator.wikimedia.org/T161553) [09:25:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041742 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [09:26:37] (03Merged) 10jenkins-bot: Stop loading OSM i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041742 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [09:26:52] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1041742|Stop loading OSM i18n (T161553)]] [09:26:55] (03PS4) 10JMeybohm: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) [09:26:57] T161553: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553 [09:28:51] for a minute I thought taavi was undeploying Open Source Maps from our infra! :) [09:29:19] (03PS5) 10JMeybohm: Allow to only report images of supported Debian versions [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) [09:30:10] (03PS1) 10Clément Goubert: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 [09:31:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:31:24] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9897565 (10MoritzMuehlenhoff) >>! In T367487#9891993, @SLyngshede-WMF wrote: > I've run a test build, Java 21 is a hard requirement, it cannot be older or newer. > Otherwise the overlay upgrade conta... [09:32:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:32:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:32:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:33:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.359 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:33:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:33:36] (03PS1) 10Klausman: golang: Move test.sh back to example.sh and explain why [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046609 [09:34:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P65099 and previous config saved to /var/cache/conftool/dbconfig/20240617-093419-marostegui.json [09:34:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T367261)', diff saved to https://phabricator.wikimedia.org/P65100 and previous config saved to /var/cache/conftool/dbconfig/20240617-093427-marostegui.json [09:34:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [09:34:31] (03CR) 10JMeybohm: [C:03+1] "Sounds reasonable, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043846 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:34:33] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [09:34:33] (03CR) 10Volans: [C:04-1] "It needs some adjustments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [09:34:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [09:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:36:14] (03CR) 10JMeybohm: mw-on-k8s: Deploy statsd exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [09:36:55] (03PS2) 10Klausman: golang: Move test.sh back to example.sh and explain why [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046609 [09:38:07] (03CR) 10JMeybohm: [V:03+1] mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [09:38:07] (03CR) 10JMeybohm: [V:03+1] mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [09:38:07] (03CR) 10JMeybohm: [V:03+1] mw-jobrunner: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043705 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [09:38:07] (03CR) 10JMeybohm: [V:03+1] mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [09:40:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [09:40:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [09:40:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T367261)', diff saved to https://phabricator.wikimedia.org/P65101 and previous config saved to /var/cache/conftool/dbconfig/20240617-094034-marostegui.json [09:40:39] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [09:40:44] (03CR) 10Clément Goubert: mw-on-k8s: Deploy statsd exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [09:40:55] (03PS3) 10Klausman: golang: Move test.sh back to example.sh and explain why [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046609 [09:41:12] (03PS4) 10Klausman: golang: Move test.sh back to example.sh and explain why [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046609 [09:42:23] (03PS3) 10Hashar: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) [09:43:16] (03CR) 10EoghanGaffney: [C:03+2] lists: Allow 'some files vanished' errors in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1044731 (https://phabricator.wikimedia.org/T367627) (owner: 10EoghanGaffney) [09:44:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T367261)', diff saved to https://phabricator.wikimedia.org/P65102 and previous config saved to /var/cache/conftool/dbconfig/20240617-094417-marostegui.json [09:44:50] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:48:51] !log taavi@deploy1002 taavi: Backport for [[gerrit:1041742|Stop loading OSM i18n (T161553)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:48:57] T161553: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553 [09:49:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T364069)', diff saved to https://phabricator.wikimedia.org/P65103 and previous config saved to /var/cache/conftool/dbconfig/20240617-094926-marostegui.json [09:49:28] !log taavi@deploy1002 taavi: Continuing with sync [09:49:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:49:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:49:30] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:49:44] (03CR) 10JMeybohm: [C:03+2] Updating docker-pkg to 4.0.1 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046126 (owner: 10Giuseppe Lavagetto) [09:49:47] (03CR) 10JMeybohm: [C:03+2] Remove buster builds [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046125 (owner: 10Giuseppe Lavagetto) [09:49:49] (03CR) 10JMeybohm: [C:03+2] Update ci server targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046124 (owner: 10Giuseppe Lavagetto) [09:50:11] (03CR) 10JMeybohm: [V:03+2 C:03+2] Updating docker-pkg to 4.0.1 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046126 (owner: 10Giuseppe Lavagetto) [09:50:14] (03CR) 10JMeybohm: [V:03+2 C:03+2] Remove buster builds [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046125 (owner: 10Giuseppe Lavagetto) [09:50:16] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update ci server targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1046124 (owner: 10Giuseppe Lavagetto) [09:51:48] 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9897652 (10cmooney) @ayounsi perhaps I was a little quick to conclude all the blocks were assigned, you are correct. The advan... [09:51:57] !log jayme@deploy1002 Started deploy [docker-pkg/deploy@4dbea81]: Update docker-pkg to 4.0.1 [09:52:13] (03CR) 10Vgutierrez: [C:04-1] cloudelastic: enable IPIP for LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) (owner: 10Bking) [09:52:35] !log jayme@deploy1002 Finished deploy [docker-pkg/deploy@4dbea81]: Update docker-pkg to 4.0.1 (duration: 00m 38s) [09:53:24] (03CR) 10Volans: [C:04-1] "If we want to go this way we need changes on the server side first." [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [09:53:42] !log jayme@deploy1002 Started deploy [docker-pkg/deploy@38eb04d]: Update docker-pkg to 4.0.1 [09:54:07] !log jayme@deploy1002 Finished deploy [docker-pkg/deploy@38eb04d]: Update docker-pkg to 4.0.1 (duration: 00m 24s) [09:55:56] (03CR) 10JMeybohm: [C:03+1] "The docker-pkg bug should be fixed now" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046609 (owner: 10Klausman) [09:58:08] (03CR) 10Muehlenhoff: cli: modify get_distro_name to return the version id (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [09:58:30] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039590 (owner: 10PipelineBot) [09:58:44] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041526 (owner: 10PipelineBot) [09:59:00] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041642 (owner: 10PipelineBot) [09:59:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65104 and previous config saved to /var/cache/conftool/dbconfig/20240617-095924-marostegui.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1000) [10:01:00] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1041742|Stop loading OSM i18n (T161553)]] (duration: 34m 07s) [10:01:04] T161553: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553 [10:01:44] !log brouberol@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-jumbo-eqiad [10:01:58] !log draining and cordoning mw2321 - T367702 [10:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:02] T367702: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702 [10:02:21] (03PS1) 10Muehlenhoff: Add a build hook for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1046612 (https://phabricator.wikimedia.org/T367487) [10:02:56] (03CR) 10CI reject: [V:04-1] Add a build hook for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1046612 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [10:03:36] (03CR) 10FNegri: P:openstack: designate: Only run wmcs-dnsleaks on a single cloudcontrol (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046606 (owner: 10Majavah) [10:04:23] (03PS3) 10Majavah: P:openstack: designate: Only run wmcs-dnsleaks on a single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1046606 [10:04:35] (03CR) 10Majavah: P:openstack: designate: Only run wmcs-dnsleaks on a single cloudcontrol (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046606 (owner: 10Majavah) [10:07:10] (03PS1) 10Slyngshede: SSH Key mgmt: Ensure that keys are trimmed [software/bitu] - 10https://gerrit.wikimedia.org/r/1046613 (https://phabricator.wikimedia.org/T366525) [10:07:57] (03CR) 10FNegri: [C:03+1] P:openstack: designate: Only run wmcs-dnsleaks on a single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1046606 (owner: 10Majavah) [10:08:47] (03CR) 10Majavah: [C:03+2] P:openstack: designate: Only run wmcs-dnsleaks on a single cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/1046606 (owner: 10Majavah) [10:08:55] !log Depooling mw2323.codfw.wmnet,mw2324.codfw.wmnet,mw2326.codfw.wmnet,mw2327.codfw.wmnet,mw2328.codfw.wmnet,mw2329.codfw.wmnet for reimage [10:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:04] !log Depooling mw2323.codfw.wmnet,mw2324.codfw.wmnet,mw2326.codfw.wmnet,mw2327.codfw.wmnet,mw2328.codfw.wmnet,mw2329.codfw.wmnet for reimage - T351074 [10:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:09] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:10:12] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [10:10:42] (03PS2) 10Muehlenhoff: Add a build hook for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1046612 (https://phabricator.wikimedia.org/T367487) [10:11:15] (03CR) 10FNegri: [C:03+1] "This page should be updated (s/1005/1006/) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks#Debu" [puppet] - 10https://gerrit.wikimedia.org/r/1046606 (owner: 10Majavah) [10:11:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:11:57] (03PS1) 10Clément Goubert: kubernetes: reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1046614 (https://phabricator.wikimedia.org/T351074) [10:12:08] (03CR) 10Klausman: [V:03+2 C:03+2] golang: Move test.sh back to example.sh and explain why [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046609 (owner: 10Klausman) [10:13:45] (03CR) 10Muehlenhoff: [C:03+2] Add a build hook for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1046612 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [10:14:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65106 and previous config saved to /var/cache/conftool/dbconfig/20240617-101431-marostegui.json [10:16:33] (03CR) 10Hnowlan: [C:04-1] kubernetes: reimage 6 appservers to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046614 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:17:32] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:18:40] (03CR) 10Kamila Součková: [C:03+1] Add records for shellbox-video service [dns] - 10https://gerrit.wikimedia.org/r/1043815 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:20:57] (03PS2) 10Clément Goubert: kubernetes: reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1046614 (https://phabricator.wikimedia.org/T351074) [10:21:38] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix AAAA records for mw232[3-9] - cgoubert@cumin1002" [10:22:50] (03PS3) 10Clément Goubert: kubernetes: reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1046614 (https://phabricator.wikimedia.org/T351074) [10:23:59] (03CR) 10Hnowlan: [C:03+1] kubernetes: reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1046614 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:24:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix AAAA records for mw232[3-9] - cgoubert@cumin1002" [10:24:07] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:24:11] (03CR) 10Clément Goubert: kubernetes: reimage 6 appservers to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046614 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:24:46] (03CR) 10Hnowlan: [C:03+2] Add records for shellbox-video service [dns] - 10https://gerrit.wikimedia.org/r/1043815 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:26:13] (03CR) 10Clément Goubert: [C:03+2] kubernetes: reimage 6 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1046614 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:26:38] !log restarting db2183, db2184 [10:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:59] (03PS1) 10Muehlenhoff: Cleanup pbuilder hooks [puppet] - 10https://gerrit.wikimedia.org/r/1046618 [10:29:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T367261)', diff saved to https://phabricator.wikimedia.org/P65107 and previous config saved to /var/cache/conftool/dbconfig/20240617-102938-marostegui.json [10:29:44] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [10:31:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2323 to wikikube-worker2003 [10:31:25] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:32:18] (03CR) 10Muehlenhoff: [C:03+2] Cleanup pbuilder hooks [puppet] - 10https://gerrit.wikimedia.org/r/1046618 (owner: 10Muehlenhoff) [10:33:40] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2323 to wikikube-worker2003 - cgoubert@cumin1002" [10:34:23] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2003 [10:34:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2003 [10:34:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2323 to wikikube-worker2003 - cgoubert@cumin1002" [10:34:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:34:53] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2003 [10:35:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2003 [10:35:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2323 to wikikube-worker2003 [10:35:34] (03PS1) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1046620 [10:35:41] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2324 to wikikube-worker2004 [10:35:47] (03PS2) 10Kamila Součková: Revert^2 "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1046620 [10:35:58] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:37:15] !log restarting ms-backup200[12], backup2004-7,11 [10:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:14] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2324 to wikikube-worker2004 - cgoubert@cumin1002" [10:39:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2324 to wikikube-worker2004 - cgoubert@cumin1002" [10:39:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:39:29] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2004 [10:39:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2004 [10:39:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2324 to wikikube-worker2004 [10:40:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2326 to wikikube-worker2007 [10:40:44] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:43:34] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046621 [10:43:34] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2326 to wikikube-worker2007 - cgoubert@cumin1002" [10:44:26] jouncebot: now [10:44:26] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1000) [10:44:31] jouncebot: next [10:44:31] In 2 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1300) [10:45:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2326 to wikikube-worker2007 - cgoubert@cumin1002" [10:45:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:45:27] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2007 [10:45:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2007 [10:45:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2326 to wikikube-worker2007 [10:46:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2327 to wikikube-worker2008 [10:46:37] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:48:12] !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [10:49:10] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2327 to wikikube-worker2008 - cgoubert@cumin1002" [10:50:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2321.codfw.wmnet with reason: hardware issue [10:50:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2321.codfw.wmnet with reason: hardware issue [10:50:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2327 to wikikube-worker2008 - cgoubert@cumin1002" [10:50:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:50:32] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2008 [10:50:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9897889 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5a0a3114-e7df-43df-8946-f917148b1d30) set by cgoubert@cumin1002 f... [10:50:46] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2008 [10:50:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2327 to wikikube-worker2008 [10:51:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2328 to wikikube-worker2009 [10:51:34] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:52:14] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2003.codfw.wmnet with OS bullseye [10:52:24] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9897892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2003.co... [10:54:07] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2328 to wikikube-worker2009 - cgoubert@cumin1002" [10:54:25] FIRING: SystemdUnitFailed: ferm.service on kubernetes2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2328 to wikikube-worker2009 - cgoubert@cumin1002" [10:55:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:55:29] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2009 [10:57:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2009 [10:57:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2328 to wikikube-worker2009 [10:57:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2329 to wikikube-worker2010 [10:58:00] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:59:25] FIRING: [2x] SystemdUnitFailed: ferm.service on kubernetes2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:53] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-ctrl2003.codfw.wmnet with OS bullseye [11:01:09] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2003.codfw.wmnet with OS bullseye [11:01:21] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9897897 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2003.co... [11:03:17] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/2023 ToU updates" "Wikimedia Foundation/Legal/2023 ToU updates" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [11:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:22] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [11:03:31] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2329 to wikikube-worker2010 - cgoubert@cumin1002" [11:06:12] anybody mind if I use this open window for deploy? [11:06:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2329 to wikikube-worker2010 - cgoubert@cumin1002" [11:06:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:06:34] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2010 [11:07:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2010 [11:07:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2329 to wikikube-worker2010 [11:07:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2003.codfw.wmnet with OS bullseye [11:08:08] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2004.codfw.wmnet with OS bullseye [11:08:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2007.codfw.wmnet with OS bullseye [11:08:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2008.codfw.wmnet with OS bullseye [11:09:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2009.codfw.wmnet with OS bullseye [11:09:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2010.codfw.wmnet with OS bullseye [11:09:59] mvolz: effie is rebooting k8s nodes in eqiad [11:10:17] ah so definitely not a good time :) [11:10:20] or will be, but sync up with her [11:10:56] effie: would you ping me when you're done? [11:11:06] mvolz: sure, [11:11:08] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/2023 ToU updates/About" "Wikimedia Foundation/Legal/2023 ToU updates/About" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [11:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:12] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [11:13:39] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt1035.eqiad.wmnet with reason: reimage and move to OVS [11:13:53] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt1035.eqiad.wmnet with reason: reimage and move to OVS [11:14:08] (03Abandoned) 10Dreamy Jazz: Blank translation of 'log-name-tag' in az [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1040139 (https://phabricator.wikimedia.org/T361695) (owner: 10Dreamy Jazz) [11:14:12] PROBLEM - ircecho bot process on irc2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [11:14:25] RESOLVED: [2x] SystemdUnitFailed: ferm.service on kubernetes2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:15:26] (03CR) 10Kamila Součková: [C:03+2] Revert^2 "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1046620 (owner: 10Kamila Součková) [11:16:03] AppserversUnreachable that's me, should resolve [11:16:03] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1035.eqiad.wmnet with OS bookworm [11:16:11] (03PS1) 10Majavah: hieradata: Move cloudvirt1035 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1046630 (https://phabricator.wikimedia.org/T364457) [11:16:58] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2003.codfw.wmnet with reason: host reimage [11:16:59] (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1035 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1046630 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [11:17:33] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on archiva1002.wikimedia.org with reason: Upgrading to bullseye [11:17:46] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on archiva1002.wikimedia.org with reason: Upgrading to bullseye [11:17:53] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/2023 ToU updates/LandingCNTranslate" "Wikimedia Foundation/Legal/2023 ToU updates/LandingCNTranslate" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [11:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:59] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [11:18:32] FIRING: UdpMxIrcEchoThroughput: irc2002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [11:21:03] PROBLEM - Host mw2324 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:45] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/2023 ToU updates/Office hours" "Wikimedia Foundation/Legal/2023 ToU updates/Office hours" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [11:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2003.codfw.wmnet with reason: host reimage [11:23:32] RESOLVED: UdpMxIrcEchoThroughput: irc2002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [11:23:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2003.codfw.wmnet with reason: host reimage [11:24:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2004.codfw.wmnet with reason: host reimage [11:24:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2007.codfw.wmnet with reason: host reimage [11:24:57] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2008.codfw.wmnet with reason: host reimage [11:25:36] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2009.codfw.wmnet with reason: host reimage [11:25:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2010.codfw.wmnet with reason: host reimage [11:26:05] RECOVERY - Host mw2324 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [11:26:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2003.codfw.wmnet with reason: host reimage [11:26:54] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/2023 ToU updates/Office hours/Announcement" "Wikimedia Foundation/Legal/2023 ToU updates/Office hours/Announcement" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [11:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:58] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [11:29:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2004.codfw.wmnet with reason: host reimage [11:30:57] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/2023 ToU updates/Office hours/Reminder" "Wikimedia Foundation/Legal/2023 ToU updates/Office hours/Reminder" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [11:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2007.codfw.wmnet with reason: host reimage [11:34:19] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage [11:34:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2009.codfw.wmnet with reason: host reimage [11:37:13] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1035.eqiad.wmnet with reason: host reimage [11:37:25] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Trust and Safety/Case Review Committee" "Wikimedia Foundation/Legal/Community Resilience and Sustainability/Trust and Safety/Case Review Committee" "Zabe" --reason "per request [[:phab:T367217|T367217]]" [11:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:31] T367217: Request to move translatable page: Trust and Safety - https://phabricator.wikimedia.org/T367217 [11:39:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2010.codfw.wmnet with reason: host reimage [11:43:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2008.codfw.wmnet with reason: host reimage [11:44:25] FIRING: SystemdUnitFailed: ferm.service on mw2314:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:13] (03CR) 10Btullis: [C:03+1] "Great, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1046593 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:47:41] (03CR) 10Brouberol: [C:03+2] monitor admin_ng pending changes for dse-k8s-eqiad [alerts] - 10https://gerrit.wikimedia.org/r/1046593 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:47:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2003.codfw.wmnet with OS bullseye [11:47:59] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9898064 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2003.codfw.... [11:48:15] (03PS2) 10Ilias Sarantopoulos: ml-services: add dummy articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046133 (https://phabricator.wikimedia.org/T360455) [11:48:32] hmm, calico-node issues in codfw [11:49:14] oom [11:49:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2004.codfw.wmnet with OS bullseye [11:49:25] RESOLVED: SystemdUnitFailed: ferm.service on mw2314:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:49] (03CR) 10Ilias Sarantopoulos: ml-services: add dummy articlequality model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046133 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [11:50:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:51:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2007.codfw.wmnet with OS bullseye [11:51:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2003.codfw.wmnet with OS bullseye [11:52:42] (03PS1) 10Clément Goubert: calico-node: Bump memory to 1Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046635 [11:53:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: archiva [11:54:12] (03PS1) 10Muehlenhoff: Switch archiva to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1046636 (https://phabricator.wikimedia.org/T349619) [11:54:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2009.codfw.wmnet with OS bullseye [11:54:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2003.codfw.wmnet [11:54:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker2003.codfw.wmnet [11:55:43] (03CR) 10Muehlenhoff: [C:03+2] Switch archiva to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1046636 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:57:02] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9898086 (10SGupta-WMF) @hnowlan The service does have any api spec , which the... [11:58:46] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9898091 (10SGupta-WMF) @Scott_French The CI pipeline is ready , can you have a... [11:58:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2010.codfw.wmnet with OS bullseye [12:00:55] FIRING: [2x] SystemdUnitFailed: ferm.service on mw2314:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:15] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2003.codfw.wmnet [12:01:25] !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl2003.codfw.wmnet [12:02:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: archiva [12:02:39] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1035.eqiad.wmnet with OS bookworm [12:03:25] (03Abandoned) 10Clément Goubert: calico-node: Bump memory to 1Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046635 (owner: 10Clément Goubert) [12:03:46] !log homer 'cr*codfw*' commit 'T351074' [12:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:50] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:04:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2008.codfw.wmnet with OS bullseye [12:04:58] !log restart db1204, db1205 [12:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:55] FIRING: [3x] SystemdUnitFailed: ferm.service on mw2314:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:52] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9898123 (10MoritzMuehlenhoff) [12:07:01] (03CR) 10Kevin Bazira: "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046133 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [12:07:36] PROBLEM - SSH on wikikube-ctrl2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:07:41] (03CR) 10Kevin Bazira: [C:03+1] ml-services: add dummy articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046133 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [12:07:47] kamila_: ^ [12:07:57] Missing a downtime maybe? [12:07:59] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 15830 [12:08:22] oh, yeah, I keep being too optimistic about how long it takes to reboot the pile of rust [12:08:40] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 453, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:08:56] PROBLEM - Host wikikube-ctrl2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:26] it'll be up in a sec, thanks claime [12:09:26] RECOVERY - SSH on wikikube-ctrl2003 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:09:28] RECOVERY - Host wikikube-ctrl2003 is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms [12:09:37] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 15830 [12:10:55] RESOLVED: [2x] SystemdUnitFailed: ferm.service on mw2367:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:41] !log pooling and uncordoning wikikube-worker2003.codfw.wmnet wikikube-worker2004.codfw.wmnet wikikube-worker2007.codfw.wmnet wikikube-worker2008.codfw.wmnet wikikube-worker2009.codfw.wmnet wikikube-worker2010.codfw.wmnet - T351074 [12:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:46] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:14:51] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2003.codfw.wmnet|wikikube-worker2004.codfw.wmnet|wikikube-worker2007.codfw.wmnet|wikikube-worker2008.codfw.wmnet|wikikube-worker2009.codfw.wmnet|wikikube-worker2010.codfw.wmnet),cluster=kubernetes,service=kubesvc [12:15:40] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T367736 (10Clement_Goubert) 03NEW [12:17:46] (03PS5) 10Btullis: Initial import of ceph-csi-rbd chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) [12:20:13] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046649 [12:27:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T352010)', diff saved to https://phabricator.wikimedia.org/P65108 and previous config saved to /var/cache/conftool/dbconfig/20240617-122700-ladsgroup.json [12:27:06] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:27:21] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9898238 (10kamila) @Milimetric I cannot find your signatures on L3 nor NDA, could you please ensure that you have signed those? [12:27:48] (03PS9) 10Arnaudb: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) [12:28:20] !log restarting ms-backup100[12], backup1004-7,11 [12:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:45] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: add dummy articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046133 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [12:30:02] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:31:02] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:52] (03Merged) 10jenkins-bot: ml-services: add dummy articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046133 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [12:33:14] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9898255 (10kamila) @Milimetric also, you provided an SSH key fingerprint, we need the public key. It should start with something like `ssh-rsa AAAA...` [12:34:00] (03CR) 10Arnaudb: mariadb: bugfixes mysql_legacy (0313 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [12:34:15] !log fetch HAProxy 2.8.10 into thirdparty/haproxy28 component for bullseye-wikimedia (apt.wm.o) [12:34:16] (03CR) 10Arnaudb: mariadb: bugfixes mysql_legacy (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [12:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:53] !log upgrading HAProxy to version 2.8.10 on cp4051 [12:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:04] (03PS1) 10Muehlenhoff: irc.w.o: Add support for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1046659 (https://phabricator.wikimedia.org/T331702) [12:36:07] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:38:51] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9898298 (10MoritzMuehlenhoff) >>! In T365074#9898255, @kamila wrote: > @Milimetric also, you provided an SSH key fingerprint, we need the public key. It should start with somethi... [12:38:53] (03CR) 10DCausse: [C:03+1] Search update pipeline: use dedicated user agents [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046591 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [12:39:56] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9898339 (10kamila) >>! In T365074#9898298, @MoritzMuehlenhoff wrote: >>>! In T365074#9898255, @kamila wrote: >> @Milimetric also, you provided an SSH key fingerprint, we need the... [12:40:00] (03CR) 10JMeybohm: [C:03+1] tox: pin style dependencies to avoid CI failures [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043748 (owner: 10Hashar) [12:40:49] (03PS1) 10Kamila Součková: Revert "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1046660 [12:40:56] (03CR) 10CI reject: [V:04-1] Revert "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1046660 (owner: 10Kamila Součková) [12:42:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P65109 and previous config saved to /var/cache/conftool/dbconfig/20240617-124207-ladsgroup.json [12:42:50] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:02] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:02] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:44:50] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:45:02] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:45:02] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:47:36] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [12:53:37] !log upload fifo-log-demux 0.7.5 to apt.wm.o (bullseye-wikimedia) [12:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P65110 and previous config saved to /var/cache/conftool/dbconfig/20240617-125715-ladsgroup.json [12:58:47] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:17] (03PS3) 10Elukey: cli: modify get_distro_name to return the version id [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) [12:59:21] (03PS1) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1046665 (https://phabricator.wikimedia.org/T364383) [12:59:25] !log joal@deploy1002 Started deploy [airflow-dags/analytics@a8843e6]: (no justification provided) [12:59:29] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@a8843e6]: (no justification provided) (duration: 00m 03s) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1300). [13:00:05] Lucas_WMDE, urbanecm, and Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] (03PS1) 10Btullis: Add a Cephx user key for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1046666 (https://phabricator.wikimedia.org/T327259) [13:00:08] 🎁 (present) [13:00:13] i can deploy today [13:00:20] (03PS3) 10Bking: cloudelastic: enable IPIP for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) [13:00:28] (03PS2) 10Btullis: Add a dummy Cephx user key for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1046666 (https://phabricator.wikimedia.org/T327259) [13:00:39] Jhs: you should've said gift instead :)) [13:00:59] (03PS2) 10Jon Harald Søby: Enable subpages for the main namespace in sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046116 (https://phabricator.wikimedia.org/T367674) [13:01:01] (03CR) 10Bking: cloudelastic: enable IPIP for LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) (owner: 10Bking) [13:01:03] (03CR) 10Urbanecm: [C:03+2] Enable subpages for the main namespace in sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046116 (https://phabricator.wikimedia.org/T367674) (owner: 10Jon Harald Søby) [13:01:07] Lucas_WMDE: around? [13:01:15] (03PS3) 10Sergio Gimeno: CommunityConfiguration: set feedback url instead of bug tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036613 (https://phabricator.wikimedia.org/T363801) [13:01:18] (03CR) 10Urbanecm: [C:03+2] CommunityConfiguration: set feedback url instead of bug tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036613 (https://phabricator.wikimedia.org/T363801) (owner: 10Sergio Gimeno) [13:01:38] (03PS3) 10Urbanecm: Backport all commits from master [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046597 (https://phabricator.wikimedia.org/T364895) [13:01:39] urbanecm, Boromir_it_is_a_gift.gift [13:01:41] (03CR) 10Urbanecm: [C:03+2] Backport all commits from master [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046597 (https://phabricator.wikimedia.org/T364895) (owner: 10Urbanecm) [13:01:42] (03Merged) 10jenkins-bot: Enable subpages for the main namespace in sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046116 (https://phabricator.wikimedia.org/T367674) (owner: 10Jon Harald Søby) [13:01:45] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2933/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046665 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [13:01:50] (03CR) 10Elukey: [C:04-1] "Requires a change in the backend, holding for the moment." [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [13:01:59] (03Merged) 10jenkins-bot: CommunityConfiguration: set feedback url instead of bug tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036613 (https://phabricator.wikimedia.org/T363801) (owner: 10Sergio Gimeno) [13:02:43] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1046116|Enable subpages for the main namespace in sourceswiki (T367674)]], [[gerrit:1036613|CommunityConfiguration: set feedback url instead of bug tool (T363801)]] [13:02:45] * MichaelG_WMF is around and mainly observing [13:02:50] T367674: Let main namespace pages have subpages in the multilingual Wikisource - https://phabricator.wikimedia.org/T367674 [13:02:50] T363801: Bug report url is not rendered in the error message - https://phabricator.wikimedia.org/T363801 [13:03:11] hi MichaelG_WMF! :) [13:03:19] !log disable puppet on A:cp-ulsfo before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046665 - T364383 [13:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:23] T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383 [13:03:39] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt1036.eqiad.wmnet with reason: reimage and move to OVS [13:03:52] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt1036.eqiad.wmnet with reason: reimage and move to OVS [13:05:01] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1036.eqiad.wmnet with OS bookworm [13:05:10] (03PS1) 10Majavah: hieradata: Move cloudvirt1036 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1046667 (https://phabricator.wikimedia.org/T364457) [13:05:11] mvolz: you may deploy as well after folks are done deploying [13:05:21] tnx! [13:05:52] (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1036 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1046667 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [13:05:54] o/ [13:05:56] sorry for the delay [13:06:26] urbanecm: are you deploying already? [13:06:30] correct [13:06:32] ok :) [13:06:46] (03CR) 10Urbanecm: [C:03+2] Check EntitySchemaIsRepo in more hook handlers [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046598 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:06:47] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747 (10Kgraessle) 03NEW [13:06:56] note, I might have to retract the config change [13:07:02] (but the backport is okay to deploy anyway) [13:07:28] (03CR) 10Volans: "replies inline with my 2 cents" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [13:07:44] (03PS1) 10Majavah: sre.hosts.reimage: Only print 'starting reimage' when it starts [cookbooks] - 10https://gerrit.wikimedia.org/r/1046668 [13:07:50] !log urbanecm@deploy1002 urbanecm, jhsoby, sgimeno: Backport for [[gerrit:1046116|Enable subpages for the main namespace in sourceswiki (T367674)]], [[gerrit:1036613|CommunityConfiguration: set feedback url instead of bug tool (T363801)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:07:55] T367674: Let main namespace pages have subpages in the multilingual Wikisource - https://phabricator.wikimedia.org/T367674 [13:07:56] T363801: Bug report url is not rendered in the error message - https://phabricator.wikimedia.org/T363801 [13:08:03] Jhs: can you test your patch at mwdebug, please? [13:08:04] (03PS2) 10JMeybohm: helmfile_psp: Remove seccomp/apparmor mutations from PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) [13:08:12] grmbl, why display no work [13:08:17] urbanecm, tested, everything looks as expected 👍 [13:08:21] might have to reboot -.- [13:08:23] that was quick :) [13:08:28] (03PS15) 10Bking: team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) [13:08:51] urbanecm, yeah, i already tested all 3 things i needed to check when i saw your message 😅 [13:08:57] hehe [13:09:58] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Set fifo-log-demux prometheus port for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1046665 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [13:10:02] I think I have to reboot… hopefully I’ll be back before the gate-and-submit for my backport finishes [13:10:10] otherwise feel free to continue with other deployments anyway :) [13:10:29] Lucas_WMDE: good luck bringing your display back! [13:10:35] !log urbanecm@deploy1002 urbanecm, jhsoby, sgimeno: Continuing with sync [13:11:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) (owner: 10Bking) [13:11:33] (03CR) 10Volans: sre.hosts.reimage: Only print 'starting reimage' when it starts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1046668 (owner: 10Majavah) [13:12:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T352010)', diff saved to https://phabricator.wikimedia.org/P65111 and previous config saved to /var/cache/conftool/dbconfig/20240617-131222-ladsgroup.json [13:12:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [13:12:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [13:12:28] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:13:22] * Lucas_WMDE back [13:13:28] (successfully \o/) [13:13:32] !log rolling upgrade on A:cp-ulsfo to fifo-log-demux 0.7.5 - T364383 [13:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:36] T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383 [13:13:44] welcome back Lucas_WMDE ! [13:14:21] !log brouberol@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-jumbo-eqiad [13:14:39] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2001.codfw.wmnet [13:14:44] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2002.codfw.wmnet [13:17:40] (03CR) 10Lucas Werkmeister (WMDE): [C:04-2] "Don’t deploy yet – we’re probably announcing this as a breaking change with two-week advance notice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE)) [13:17:54] (03CR) 10Btullis: [V:03+2 C:03+2] Add a dummy Cephx user key for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1046666 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:18:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9898559 (10Jclark-ctr) 05Open→03Resolved [13:20:25] (03PS1) 10Muehlenhoff: Remove account end data/contact for tandic [puppet] - 10https://gerrit.wikimedia.org/r/1046672 [13:21:34] okay... "UPGRADE FAILED: http2: client connection lost" is probably not what i should expect to see [13:22:18] Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: Get "https://kubemaster.svc.codfw.wmnet:6443/api/v1/namespaces/mw-jobrunner/services/mediawiki-main-tls-service": http2: client connection lost [13:22:20] this is the full error [13:22:39] claime: effie: any idea what's happening? [13:23:20] urbanecm: I think this might be related with some kubemasters work happening on codfw [13:23:24] (03Merged) 10jenkins-bot: Backport all commits from master [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046597 (https://phabricator.wikimedia.org/T364895) (owner: 10Urbanecm) [13:23:56] urbanecm: what is the current status? [13:24:10] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage [13:24:21] effie: scap printed the above-quoted error during syncing to k8s, and happily processes further [13:24:39] this is it https://usercontent.irccloud-cdn.com/file/sAVDMtAd/image.png [13:25:13] ok it is rolling back [13:25:34] oh, "rollback completed". should've been yellow, probably [13:25:44] so...i rerun, and we should be good? [13:25:51] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1046116|Enable subpages for the main namespace in sourceswiki (T367674)]], [[gerrit:1036613|CommunityConfiguration: set feedback url instead of bug tool (T363801)]] (duration: 23m 07s) [13:25:57] T367674: Let main namespace pages have subpages in the multilingual Wikisource - https://phabricator.wikimedia.org/T367674 [13:25:57] T363801: Bug report url is not rendered in the error message - https://phabricator.wikimedia.org/T363801 [13:26:17] urbanecm: let me check something [13:26:20] okay [13:26:56] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1036.eqiad.wmnet with reason: host reimage [13:27:02] claime: what areyou thinking? [13:27:30] effie: just that it's not worth doing a whole scap rerun if it's only mw-jobrunner [13:28:05] claime: the patch is definitely rolled back for webserver. i do not see its effect at test.wikipedia.org, for example. [13:28:23] ah [13:28:26] but, i'll be scap syncing more stuff, so we can also do everything at once instead of rerunning [13:28:29] then rerun, by all means [13:28:31] claime: it is parsoid on the screenshot [13:28:36] cool yes +1 [13:28:53] urbanecm: just check with kamila_ what is the state of kubemasters atm [13:28:53] err [13:28:59] there's something strange then [13:29:08] because the helmfile output is for mw-jobrunner [13:29:34] (03PS1) 10Fabfur: hiera: install haproxy 2.8.10 on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1046674 (https://phabricator.wikimedia.org/T367756) [13:29:35] ah it probably failed both [13:29:35] i was more wondering what's the codfw work effie mentioned, and whether it is likely i run into the same issue again. [13:29:41] I don't think anything I've done should be disruptive, if it is, something has gone wrong :D [13:29:58] claime: iirc they are running in parallel right ? [13:30:03] so both may [13:30:22] effie: yeah [13:30:22] there is another failed output for parsoid, indeed [13:30:46] kamila_: 2001/2003 are pooled inactive, 2002 is pooled, is that the expected state? [13:30:51] (03PS2) 10Fabfur: hiera: install haproxy 2.8.10 on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1046674 (https://phabricator.wikimedia.org/T367756) [13:31:09] (03CR) 10Elukey: "Left some comments only related to DRY some values that are repeated across multiple classes (basically keeping the netbox class as source" [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:31:22] sorry [13:31:25] I think 2003 should pooled and 1-2 inactive [13:31:27] 2001 and 2003 inactive [13:31:37] god dammit fingers [13:31:43] 2001 and 2002 inactive [13:31:44] (03PS1) 10Ssingh: conftool-data: add ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046675 (https://phabricator.wikimedia.org/T366360) [13:31:48] 2003 pooled [13:31:57] claime: 2001/2 I just pooled inactive in preparation of making them go offline [13:31:58] claime: what does get nodes tell us ? [13:32:10] (03PS1) 10Majavah: hieradata: Move cloudvirt-wdqs* to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1046676 (https://phabricator.wikimedia.org/T364457) [13:32:14] (I have not ssh-ed yet) [13:32:16] but I haven't decommed them yet, doing that now, just need to clean up DNS SRV [13:32:28] ahah [13:32:35] scrolling through the output a bit more, scap also failed to connect to mw2321. which...should be inactive according to https://sal.toolforge.org/log/nitjJZABhuQtenzvP2ig? [13:32:36] 2001 and 2002 aren't cordoned, and 2003 is [13:32:39] fixing that [13:32:45] (03Merged) 10jenkins-bot: Check EntitySchemaIsRepo in more hook handlers [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046598 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:33:04] oh, I guess uncordon wasn't in my checklist, thanks claime '^^ [13:33:08] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [13:33:20] (03CR) 10Elukey: cli: modify get_distro_name to return the version id (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [13:33:22] though I'm not sure if cordoning does anything for these [13:33:36] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [13:33:51] !log Uncordoned wikikube-ctrl2003.codfw.wmnet [13:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:54] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm [13:34:02] !log Drained and cordoned wikikube-ctrl2001.codfw.wmnet wikikube-ctrl2002.codfw.wmnet [13:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:14] thank you claime [13:34:37] urbanecm: yeah, except we can't remove it from dsh easily, you can ignore that error [13:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:34:51] (it's hardware borked) [13:35:01] claime: okay. i thought inactive is supposed to remove it from dsh? maybe i'm misremembering :) [13:35:11] urbanecm: not for the k8s-pull thing [13:35:14] ah [13:35:34] but any errors to k8s-pull can be ignored as long as it's not like half the fleet [13:35:37] (03CR) 10Elukey: [C:03+2] "Thanks :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043804 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:35:38] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046674 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [13:35:39] (03CR) 10Kamila Součková: Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková) [13:35:39] noted :) [13:35:42] so...safe to continue for me? or still something to do/check? [13:35:48] (03PS4) 10Kamila Součková: Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 [13:35:54] it just means that they'll pull the image when they startup the containers instead of in advance [13:36:02] urbanecm: yeah go ahead for me [13:36:38] * kamila_ will wait with the decoms just in case [13:37:19] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1046597|Backport all commits from master (T364895)]], [[gerrit:1046598|Check EntitySchemaIsRepo in more hook handlers (T363153)]] [13:37:24] i'll continue with the next patches, and let the k8s portion of the fleet re-sync from that [13:37:25] T364895: Enable CommunityConfiguration on pilot wikis: Arabic & Spanish Wikipedia: Jun 17, 2024 - https://phabricator.wikimedia.org/T364895 [13:37:25] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [13:37:45] kamila_: do you want me to ping you once done with my deployments? [13:37:57] urbanecm: that'd be great, thanks [13:37:59] will do [13:38:04] <3 [13:39:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:39:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:39:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P65112 and previous config saved to /var/cache/conftool/dbconfig/20240617-133951-ladsgroup.json [13:39:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:40:04] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046678 [13:42:13] (03PS1) 10Bking: elasticsearch: add Search Platform and DPE SRE as alert recipients [puppet] - 10https://gerrit.wikimedia.org/r/1046679 (https://phabricator.wikimedia.org/T367435) [13:42:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046679 (https://phabricator.wikimedia.org/T367435) (owner: 10Bking) [13:43:08] !log urbanecm@deploy1002 lucaswerkmeister-wmde, urbanecm: Backport for [[gerrit:1046597|Backport all commits from master (T364895)]], [[gerrit:1046598|Check EntitySchemaIsRepo in more hook handlers (T363153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:43:08] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [13:43:14] T364895: Enable CommunityConfiguration on pilot wikis: Arabic & Spanish Wikipedia: Jun 17, 2024 - https://phabricator.wikimedia.org/T364895 [13:43:14] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [13:43:22] * Lucas_WMDE looks [13:43:24] !log urbanecm@deploy1002 Sync cancelled. [13:43:29] oh? [13:43:45] i didn't cancel anything... [13:43:48] rerunning [13:43:48] (I didn’t fully follow the discussion) [13:43:49] hm [13:43:50] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [13:43:57] Lucas_WMDE: we should be good to continue [13:43:59] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [13:44:06] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1046597|Backport all commits from master (T364895)]], [[gerrit:1046598|Check EntitySchemaIsRepo in more hook handlers (T363153)]] [13:44:10] scap just cancelled the deployment for no apparent reason [13:44:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9898750 (10Jhancock.wm) [13:45:33] weird [13:45:36] !log brouberol@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling reboot on A:datahubsearch [13:45:40] it has happened to me before [13:46:08] maybe scrolling left some data in the input stream for scap, and it decided to interpret it as "abort" [13:48:04] (03PS1) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1046681 (https://phabricator.wikimedia.org/T364383) [13:48:16] Lucas_WMDE: what about your config change? should it be included at the end? :) [13:48:29] !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [13:48:37] !log urbanecm@deploy1002 urbanecm, lucaswerkmeister-wmde: Backport for [[gerrit:1046597|Backport all commits from master (T364895)]], [[gerrit:1046598|Check EntitySchemaIsRepo in more hook handlers (T363153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:48:38] it should not, no [13:48:42] ack [13:48:43] T364895: Enable CommunityConfiguration on pilot wikis: Arabic & Spanish Wikipedia: Jun 17, 2024 - https://phabricator.wikimedia.org/T364895 [13:48:43] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [13:48:48] Lucas_WMDE: can you check your backport then? [13:48:53] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046681 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [13:48:55] we decided it’s a breaking change after all, so we need to announce it first and wait two weeks :/ [13:48:55] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1003.eqiad.wmnet with reason: host reimage [13:48:57] urbanecm: can do [13:49:03] not that much to test but I can check that nothing obvious broke [13:49:23] ack [13:49:28] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:49:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9898763 (10elukey) @Jhancock.wm I'd ask another favor when you have a moment. Could you send me over email a picture of the label of one of the Supermicro nodes? We are... [13:49:48] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1046672 (owner: 10Muehlenhoff) [13:50:19] urbanecm: looks good as far as I can tell [13:50:23] good [13:50:34] !log urbanecm@deploy1002 urbanecm, lucaswerkmeister-wmde: Continuing with sync [13:50:38] let's go ahead then [13:50:43] 10SRE-tools, 10Dumps-Generation, 06Infrastructure-Foundations, 06serviceops, 07IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142#9898774 (10akosiaris) [13:51:10] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1003.eqiad.wmnet with reason: host reimage [13:51:14] (03PS2) 10Urbanecm: Growth: Enable CommunityConfiguration on arwiki, eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043784 (https://phabricator.wikimedia.org/T364895) [13:51:39] and now...the project my team spent the last ~year on! :) [13:51:44] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9898775 (10kamila) [13:51:57] (03CR) 10Urbanecm: [C:03+2] "Let's do it! 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043784 (https://phabricator.wikimedia.org/T364895) (owner: 10Urbanecm) [13:52:06] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9898779 (10kamila) a:03kamila [13:52:06] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1036.eqiad.wmnet with OS bookworm [13:52:38] (03Merged) 10jenkins-bot: Growth: Enable CommunityConfiguration on arwiki, eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043784 (https://phabricator.wikimedia.org/T364895) (owner: 10Urbanecm) [13:53:01] (03CR) 10Muehlenhoff: [C:03+2] Remove account end data/contact for tandic [puppet] - 10https://gerrit.wikimedia.org/r/1046672 (owner: 10Muehlenhoff) [13:53:15] !incidents [13:53:15] No incidents occurred in the past 24 hours for team SRE [13:53:17] urbanecm: :o [13:53:26] 10SRE-tools, 10Dumps-Generation, 06Infrastructure-Foundations, 06serviceops, 07IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142#9898785 (10akosiaris) 05Open→03Resolved a:03akosiaris I 've removed * dumpsdata[1001-1003].eqiad.wmn... [13:53:31] (03CR) 10DCausse: [C:03+1] elasticsearch: add Search Platform and DPE SRE as alert recipients [puppet] - 10https://gerrit.wikimedia.org/r/1046679 (https://phabricator.wikimedia.org/T367435) (owner: 10Bking) [13:53:32] ooh, is that the first non-beta wikis? [13:53:41] (with CommConf) [13:53:43] Lucas_WMDE: if you don't count testwiki :) [13:53:46] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:46] hehe [13:54:46] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [13:56:23] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9898793 (10kamila) 05In progress→03Stalled Stalled on @Milimetric signing L3 [13:57:09] 06SRE, 10SRE-Access-Requests: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9898796 (10kamila) 05Open→03Stalled waiting for approval [14:00:53] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1046597|Backport all commits from master (T364895)]], [[gerrit:1046598|Check EntitySchemaIsRepo in more hook handlers (T363153)]] (duration: 16m 47s) [14:00:59] T364895: Enable CommunityConfiguration on pilot wikis: Arabic & Spanish Wikipedia: Jun 17, 2024 - https://phabricator.wikimedia.org/T364895 [14:01:00] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [14:01:04] and synced [14:01:18] !log brouberol@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling reboot on A:datahubsearch [14:01:18] urbanecm: if you can give me 2 minutes before the next patch, I have a few nodes to depool [14:01:25] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1043784|Growth: Enable CommunityConfiguration on arwiki, eswiki (T364895)]] [14:01:28] claime: i already started scap :( [14:01:29] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [14:01:32] no worries :D [14:01:35] It can wait [14:01:50] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Trust and Safety/Case Review Committee/Call for applicants" "Wikimedia Foundation/Legal/Community Resilience and Sustainability/Trust and Safety/Case Review Committee/Call for applicants" "Zabe" --reason "per request [[:phab:T367217|T367217]]" [14:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:54] T367217: Request to move translatable page: Trust and Safety - https://phabricator.wikimedia.org/T367217 [14:02:19] (03PS1) 10Clément Goubert: kubernetes: reimage 4 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1046682 (https://phabricator.wikimedia.org/T351074) [14:02:36] !log disable puppet on A:cp-codfw before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046681 - T364383 [14:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:40] T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383 [14:03:47] (03CR) 10Vgutierrez: [C:03+2] hiera: Set fifo-log-demux prometheus port for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1046681 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [14:03:59] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Trust and Safety/Case Review Committee/Charter" "Wikimedia Foundation/Legal/Community Resilience and Sustainability/Trust and Safety/Case Review Committee/Charter" "Zabe" --reason "per request [[:phab:T367217|T367217]]" [14:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:28] btullis: Ben Tullis: Add a dummy Cephx user key for the cephcsi plugin to use (8948c4d) ---> that's pending to be merged FYI [14:04:35] (03CR) 10Herron: [C:03+1] logstash: add auto_offset_reset to kafka input [puppet] - 10https://gerrit.wikimedia.org/r/1042917 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [14:05:54] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1043784|Growth: Enable CommunityConfiguration on arwiki, eswiki (T364895)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:05:58] (03CR) 10Herron: [C:03+1] "Looks good to me but AFAIK will be hold until week of the 24th" [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [14:06:09] !log rolling upgrade on A:cp-codfw to fifo-log-demux 0.7.5 - T364383 [14:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:49] !log urbanecm@deploy1002 urbanecm: Continuing with sync [14:06:52] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/2023 ToU updates/Proposed update" "Wikimedia Foundation/Legal/2023 ToU updates/Proposed update" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [14:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:56] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [14:07:28] (03PS1) 10Ssingh: dnsbox: announce ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) [14:07:36] !log taavi@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudvirt-wdqs1001.eqiad.wmnet [14:07:47] MichaelG_WMF: it's working! :) https://usercontent.irccloud-cdn.com/file/976No7Q0/image.png [14:08:03] * MichaelG_WMF is already looking and testing^^ [14:08:23] so far things look good indeed [14:08:34] yup [14:08:37] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:08:45] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2934/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [14:08:55] though I guess we haven't run the script yet and not yet flipped the switch for GE [14:09:13] MichaelG_WMF: we did both [14:09:33] i ran the conversion script in debug server instead [14:09:34] right, I was confused why I saw data in the config [14:09:51] (i could've do a `scap pull` on mwmaint and run it there instead, too) [14:10:11] MichaelG_WMF: transcript is at https://phabricator.wikimedia.org/T364895#9898823, if interested [14:10:21] * MichaelG_WMF looks [14:10:59] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/2023 ToU updates/talkheader" "Wikimedia Foundation/Legal/2023 ToU updates/talkheader" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [14:11:01] (03PS2) 10Ssingh: dnsbox: announce ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) [14:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:07] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix AAAA records for 4 mw servers - cgoubert@cumin1002" [14:12:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix AAAA records for 4 mw servers - cgoubert@cumin1002" [14:12:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:13:05] (03CR) 10Scott French: "Thanks, Janis!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043846 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:13:06] mh, looking at the logs - neither wiki had personalized praise or leveling up config touched? That's certainly interesting. But I guess not something that affects deployment [14:13:10] (03CR) 10Scott French: [C:03+2] mediawiki-dev: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043846 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:13:35] 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757 (10MunizaA) 03NEW [14:13:46] (03Merged) 10jenkins-bot: mediawiki-dev: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043846 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:14:30] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9898890 (10Papaul) @ayounsi wikikube-ctrl2001 is racked on u13 if we move it and plug it in port 44-47 it will mess up the hard w... [14:14:53] jouncebot: nowandnext [14:14:53] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [14:14:53] In 1 hour(s) and 15 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1530) [14:16:03] !log killing updateMenteeData.php --wiki=enwiki --statsd --dbshard s1 [14:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:29] (it wasn't picking up the new lb config, please fix it) [14:16:51] (03CR) 10Hnowlan: [C:03+1] kubernetes: reimage 4 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1046682 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:16:59] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1043784|Growth: Enable CommunityConfiguration on arwiki, eswiki (T364895)]] (duration: 15m 34s) [14:17:04] T364895: Enable CommunityConfiguration on pilot wikis: Arabic & Spanish Wikipedia: Jun 17, 2024 - https://phabricator.wikimedia.org/T364895 [14:17:35] !log Depooling mw2323.codfw.wmnet,mw2324.codfw.wmnet,mw2326.codfw.wmnet,mw2327.codfw.wmnet,mw2328.codfw.wmnet,mw2329.codfw.wmnet for reimage - T351074 [14:17:38] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm [14:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:39] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [14:18:02] ugh wrong paste [14:18:16] !log Depooling mw1359.eqiad.wmnet,mw1364.eqiad.wmnet,mw1365.eqiad.wmnet,mw1412.eqiad.wmnet for reimage - T351074 [14:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:24] anyway, we should be live! [14:18:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.005e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:18:40] Amir1: that script is not supposed to run for very long though [14:18:44] (updateMenteeData) [14:18:59] mind logging a task for me to look? [14:19:07] kamila_: done with my deployments :) [14:19:22] thanks urbanecm :-) [14:19:44] (03CR) 10Kamila Součková: [C:03+2] Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková) [14:20:13] (03PS1) 10Ssingh: durum: switch NTP peers to ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046689 (https://phabricator.wikimedia.org/T366360) [14:20:17] (03PS1) 10Herron: Revert "istio_slos: add secondary recording rules" [puppet] - 10https://gerrit.wikimedia.org/r/1046690 (https://phabricator.wikimedia.org/T359879) [14:20:31] (03CR) 10Clément Goubert: [C:03+2] kubernetes: reimage 4 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1046682 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:20:39] (03CR) 10Bking: [C:03+2] elasticsearch: add Search Platform and DPE SRE as alert recipients [puppet] - 10https://gerrit.wikimedia.org/r/1046679 (https://phabricator.wikimedia.org/T367435) (owner: 10Bking) [14:21:19] inflatador: ok to merge your change? [14:21:22] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2935/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046689 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [14:21:29] claime Y, was just about to ask [14:21:37] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/Announcement/2023 OC and CRC appointments process" "Wikimedia Foundation/Legal/Announcement/2023 OC and CRC appointments process" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [14:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:42] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [14:21:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet - https://phabricator.wikimedia.org/T367499#9898915 (10Jclark-ctr) 05Open→03Resolved @Marostegui Updated idrac and bios firmware took server down to min config... [14:21:46] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl2001.codfw.wmnet [14:22:09] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2001.eqiad.wmnet [14:22:30] inflatador: done [14:23:10] !log taavi@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt-wdqs1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:23:56] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1359.eqiad.wmnet [14:24:04] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1359.eqiad.wmnet [14:24:41] argh firmware too old :( [14:25:15] ;_; [14:27:12] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/Brand Stewardship Report" "Wikimedia Foundation/Legal/Brand Stewardship Report" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [14:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:17] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [14:28:15] (03PS3) 10Scott French: mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) [14:28:15] (03PS1) 10Scott French: mediawiki: enable securityContext in all canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) [14:28:16] (03PS1) 10Scott French: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) [14:28:44] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Trust and Safety/Case Review Committee/Legal agreement" "Wikimedia Foundation/Legal/Community Resilience and Sustainability/Trust and Safety/Case Review Committee/Legal agreement" "Zabe" --reason "per request [[:phab:T367217|T367217]]" [14:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:49] T367217: Request to move translatable page: Trust and Safety - https://phabricator.wikimedia.org/T367217 [14:28:55] 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119#9898956 (10CDanis) Alternatives to consider: * Make this a required field instead of adding a default [harder up-front but potentially safer] * Make omitting this field wmf pup... [14:29:14] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1364.eqiad.wmnet [14:29:21] (03PS2) 10Kamila Součková: Revert "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1046660 [14:29:55] hnowlan: I must have stepped on a black cat under a ladder. Second one's idrac isn't responding [14:30:06] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1364.eqiad.wmnet [14:30:12] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:31:03] (03CR) 10Kamila Součková: [C:03+2] Revert "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1046660 (owner: 10Kamila Součková) [14:31:27] jouncebot: now [14:31:27] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [14:31:32] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9898979 (10CDanis) Suggestions from discussion at I/F meeting: * It's probably not necess... [14:31:43] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Trust and Safety/Resources" "Wikimedia Foundation/Legal/Community Resilience and Sustainability/Trust and Safety/Resources" "Zabe" --reason "per request [[:phab:T367217|T367217]]" [14:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:57] Any objections to me using the open window rn? [14:32:59] mvolz: kamila_ might be doing something of impact for deployments – not 100% sure. [14:33:35] mvolz: go ahead and see what happens? :D [14:33:42] ok :) [14:34:32] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ml-staging2003 to codfw - jhancock@cumin2002" [14:34:37] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Trust and Safety/Resources/What is a conduct warning" "Wikimedia Foundation/Legal/Community Resilience and Sustainability/Trust and Safety/Resources/What is a conduct warning" "Zabe" --reason "per request [[:phab:T367217|T367217]]" [14:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:41] T367217: Request to move translatable page: Trust and Safety - https://phabricator.wikimedia.org/T367217 [14:35:35] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1045990 (owner: 10PipelineBot) [14:36:23] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:36:25] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1045990 (owner: 10PipelineBot) [14:36:40] mvolz: it should be in a stable-ish state rn, but I am in the middle of removing k8s masters, so if you see something weird, lmk and I'll get it into a more consistent state [14:37:12] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Trust and Safety/Tools and processes" "Wikimedia Foundation/Legal/Community Resilience and Sustainability/Trust and Safety/Tools and processes" "Zabe" --reason "per request [[:phab:T367217|T367217]]" [14:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:28] (03PS1) 10Eevans: restbase1028: Upgrade to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1046695 (https://phabricator.wikimedia.org/T350567) [14:38:41] !log joal@deploy1002 Started deploy [airflow-dags/analytics@b682892]: (no justification provided) [14:38:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:15] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@b682892]: (no justification provided) (duration: 00m 33s) [14:39:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ml-staging2003 to codfw - jhancock@cumin2002" [14:39:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:41:07] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/Committee appointments" "Wikimedia Foundation/Legal/Committee appointments" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [14:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:12] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [14:43:03] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [14:43:04] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [14:43:37] mvolz: also, can you please ping me once you're done? [14:43:50] sure [14:43:54] thanks <3 [14:44:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [14:44:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:44:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl2001.codfw.wmnet [14:44:07] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9899054 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl2001.codfw.... [14:44:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:44:36] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1364.eqiad.wmnet [14:44:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:44:36] (03CR) 10Ssingh: "PCC output for cumin:O:dnsbox:" [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [14:44:41] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1364.eqiad.wmnet [14:45:13] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1365.eqiad.wmnet [14:45:20] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1365.eqiad.wmnet [14:45:28] Great, none of them then. [14:46:15] urbanecm: sure thing! [14:47:36] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/Committee appointments/Announcement" "Wikimedia Foundation/Legal/Committee appointments/Announcement" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [14:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:41] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [14:48:02] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [14:48:46] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1412.eqiad.wmnet [14:48:54] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1412.eqiad.wmnet [14:50:11] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/Committee appointments/Announcement/Short" "Wikimedia Foundation/Legal/Committee appointments/Announcement/Short" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [14:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9899106 (10Jhancock.wm) sent [14:50:48] (03PS5) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [14:50:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1046695 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [14:51:32] 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet - https://phabricator.wikimedia.org/T367766 (10Clement_Goubert) 03NEW [14:52:05] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt-wdqs1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:52:08] (03PS1) 10Clément Goubert: Revert "kubernetes: reimage 4 appservers to kubernetes" [puppet] - 10https://gerrit.wikimedia.org/r/1046697 [14:52:16] I'm gonna revert my patch and chose 4 other servers... [14:52:23] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt-wdqs1001.eqiad.wmnet [14:53:34] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [14:54:25] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1444.eqiad.wmnet [14:55:52] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department/FAQ On Countering Terrorist and Violent Extremist Content on Wikimedia Projects" "Wikimedia Foundation/Legal/FAQ On Countering Terrorist and Violent Extremist Content on Wikimedia Projects" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [14:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:57] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [14:55:57] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1444.eqiad.wmnet [14:56:02] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [14:56:51] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [14:57:16] (03CR) 10Clément Goubert: [C:03+2] Revert "kubernetes: reimage 4 appservers to kubernetes" [puppet] - 10https://gerrit.wikimedia.org/r/1046697 (owner: 10Clément Goubert) [14:57:57] 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119#9899155 (10MoritzMuehlenhoff) One other option: Add a separate wrapper define systemd::timer::job_capped which has the timeout as a mandatory argument (but without a default).... [14:58:02] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [14:58:08] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046698 (https://phabricator.wikimedia.org/T128546) [14:58:30] (03PS1) 10Brouberol: dse-k8s: setup a discovery record for all deployed applications [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) [14:58:37] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [14:58:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:21] !log taavi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [14:59:26] (03CR) 10Brouberol: dse-k8s: setup a discovery record for all deployed applications (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol) [14:59:38] (03CR) 10Eevans: [C:03+2] restbase1028: Upgrade to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1046695 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [15:01:14] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:03:33] !log cgoubert@cumin1002 conftool action : set/weight=30:pooled=yes; selector: name=(mw1359.eqiad.wmnet|mw1364.eqiad.wmnet|mw1365.eqiad.wmnet|mw1412.eqiad.wmnet) [15:03:46] !log Repooling mw1359.eqiad.wmnet,mw1364.eqiad.wmnet,mw1365.eqiad.wmnet,mw1412.eqiad.wmnet pending fw upgrade - T351074 [15:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:51] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:05:35] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9899198 (10Gehel) [15:05:56] 07Puppet, 10Cloud-VPS: systemd-timer-mail-wrapper should not send mail as root@wikimedia.org from Cloud VPS - https://phabricator.wikimedia.org/T367028#9899208 (10joanna_borun) [15:06:16] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9899221 (10elukey) p:05Triage→03Medium [15:08:46] (03CR) 10Vgutierrez: [C:03+1] hiera: install haproxy 2.8.10 on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1046674 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [15:08:48] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9899269 (10joanna_borun) p:05Triage→03Medium a:03CDanis [15:09:56] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9899294 (10CDanis) p:05Triage→03Medium [15:10:02] 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119#9899295 (10CDanis) p:05Triage→03Low [15:10:17] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9899300 (10jhathaway) p:05Triage→03Medium [15:10:27] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#9899302 (10joanna_borun) p:05Triage→03Low [15:10:28] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [15:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:33] (03PS2) 10Brouberol: dse-k8s: setup a discovery record for all deployed applications [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) [15:11:12] kamila_: i guess i'm done though contemplating reverting stuff since this latest patch did not fix the issue :P. It's messing up logs, though - so I guess we can live with messed up logs since that should affect users. :( [15:11:23] err should not [15:11:54] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#9899310 (10joanna_borun) p:05Triage→03High [15:12:55] (03CR) 10Brouberol: "The target DNS record exists." [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol) [15:12:57] mvolz: feel free to revert, I can keep myself occupied with something else :-D [15:13:31] (03PS1) 10Clément Goubert: kubernetes: Reimage 3 appservers as kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/1046705 (https://phabricator.wikimedia.org/T351074) [15:14:07] kamila_: I think I'm going to leave it because it's been junk since thursday anyway, go ahead. just need to fix it... sigh. [15:14:39] (03CR) 10Fabfur: [C:03+2] hiera: install haproxy 2.8.10 on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1046674 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [15:15:46] Ok, thanks and gl with fixing it mvolz! [15:15:57] (03CR) 10Clément Goubert: mw-on-k8s: Deploy statsd exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [15:16:56] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl2002.codfw.wmnet [15:16:56] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts wikikube-ctrl2002.codfw.wmnet [15:17:17] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10Mail: Update fundraising mail / firewall settings to use new production mx-in hosts - https://phabricator.wikimedia.org/T367573#9899341 (10cmooney) p:05Triage→03Medium [15:17:17] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl2002.codfw.wmnet [15:17:45] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [15:18:13] (03CR) 10Hnowlan: [C:03+1] kubernetes: Reimage 3 appservers as kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/1046705 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:19:16] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:21] 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9899352 (10ayounsi) We had a quick look at the network side and couldn't find any smoking gun. In the future if you could run a packet capture on both sides `tcpdump -i in... [15:19:33] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: use dedicated user agents [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046591 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [15:20:02] (03CR) 10Elukey: [C:03+1] Revert "istio_slos: add secondary recording rules" [puppet] - 10https://gerrit.wikimedia.org/r/1046690 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [15:20:10] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Reimage 3 appservers as kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/1046705 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:20:25] (03Merged) 10jenkins-bot: Search update pipeline: use dedicated user agents [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046591 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [15:20:40] !log draining transport circuits in/out of eqdfw in advance of router power-supply work/upgrade T366864 [15:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:44] T366864: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864 [15:21:28] !log Depooling mw1444.eqiad.wmnet,mw1447.eqiad.wmnet,mw1489.eqiad.wmnet for reimage - T351074 [15:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:32] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:23:27] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:24:12] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1444.eqiad.wmnet [15:24:37] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1444.eqiad.wmnet [15:24:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1444 to wikikube-worker1019 [15:25:58] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:26:03] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [15:26:38] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4037.*} and A:cp [15:27:18] RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 465, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:44] (03PS1) 10Andrew Bogott: Prepare for decom of cloudvirt-wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1046707 (https://phabricator.wikimedia.org/T367773) [15:28:11] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4037.*} and A:cp [15:28:24] (03PS6) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [15:28:36] !log upgrading haproxy to 2.8.10 on cp4037 (T367756) [15:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:41] T367756: Upgrade ulsfo hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [15:28:57] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:28:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [15:28:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl2002.codfw.wmnet [15:29:05] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9899430 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl2002.codfw.... [15:29:16] (03PS7) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [15:30:04] jan_drewniak: Your horoscope predicts another Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1530). [15:31:05] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [15:31:31] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1444 to wikikube-worker1019 - cgoubert@cumin1002" [15:32:10] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1001.eqiad.wmnet [15:32:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1444 to wikikube-worker1019 - cgoubert@cumin1002" [15:32:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:32:32] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1019 [15:33:07] (03PS1) 10Clare Ming: extension-list: Add Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) [15:33:10] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9899455 (10kamila) wikikube-ctrl1001 looks happy, thanks for the help! I have decommed wikikube-ctrl1001 and 1002, they're good... [15:33:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1019 [15:33:44] (03CR) 10CDobbins: varnish: show better error for 429s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [15:33:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1444 to wikikube-worker1019 [15:33:50] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046698 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:33:53] (03CR) 10CDobbins: varnish: show better error for 429s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [15:34:10] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1447 to wikikube-worker1020 [15:34:27] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:34:27] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046698 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:36:48] andrewbogott: I've caught your decom changes for cloudvirt-wdqs1001.eqiad.wmnet. in my netbox cookbook run, ok to proceed? [15:37:02] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: JunOS upgrade and PSU swap on cr2-eqdfw [15:37:11] claime: yes [15:37:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: JunOS upgrade and PSU swap on cr2-eqdfw [15:37:25] andrewbogott: ack, in progress [15:37:27] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9899463 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8c9e583f-3e7e-41d4-ab2f-0f862f085c35) set by cmooney@cumin1002 for 1:30:00 on 2 host(s) and their services with rea... [15:37:41] claime, I thought I approved it though? Is there a step I need to be doing outside of the cookbook prompts? [15:37:48] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1447 to wikikube-worker1020 - cgoubert@cumin1002" [15:37:55] andrewbogott: huh, weird [15:38:01] it's supposed to lock as well [15:38:12] :( [15:39:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1447 to wikikube-worker1020 - cgoubert@cumin1002" [15:39:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:39:02] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1020 [15:39:05] (03PS1) 10Arnaudb: mariadb: prometheus config tweak for db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) [15:39:23] andrewbogott: anyways, should be done now [15:39:27] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:39:31] !log deactivate Tranist and peering sessions on cr2-eqdfw in advance of power-supply change T366864 [15:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:35] T366864: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864 [15:39:39] great. I'm going to decom a couple more but not for a few minutes [15:40:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1020 [15:40:08] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 451, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:40:09] ok I have another rename to go through, I'll tell you when I'm done? should only take a couple minutes [15:40:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1447 to wikikube-worker1020 [15:40:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1489 to wikikube-worker1021 [15:41:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:05] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt-wdqs1001.eqiad.wmnet [15:41:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:41:14] claime: sounds good [15:42:06] * inflatador misses having an API for DNS changes [15:42:40] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [15:42:41] (03CR) 10Andrew Bogott: [C:03+2] Prepare for decom of cloudvirt-wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/1046707 (https://phabricator.wikimedia.org/T367773) (owner: 10Andrew Bogott) [15:43:02] inflatador: netbox has an api :) [15:43:04] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 11.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:43:19] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1489 to wikikube-worker1021 - cgoubert@cumin1002" [15:43:28] it's the generation of the updated zone files that's limited in concurrency [15:44:27] yeah, the gitops workflow for DNS changes is the bottleneck. AFAIK we have a pretty low volume of DNS changes [15:44:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1489 to wikikube-worker1021 - cgoubert@cumin1002" [15:44:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:32] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1021 [15:44:42] Except when everyone's decommissioning/renaming nodes :D [15:44:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1021 [15:44:51] ok yeah [15:44:52] (03PS1) 10Andrew Bogott: Remove mentions of cloudvirt-wdqs100[1,2,3] [puppet] - 10https://gerrit.wikimedia.org/r/1046716 (https://phabricator.wikimedia.org/T367773) [15:44:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1489 to wikikube-worker1021 [15:45:00] andrewbogott: done [15:45:12] well...compared to running a DNSaaS service that is ;) [15:45:20] cool, thanks [15:46:02] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1019.eqiad.wmnet wikikube-worker1020.eqiad.wmnet wikikube-worker1021.eqiad.wmnet on all recursors [15:46:02] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1001.eqiad.wmnet [15:46:04] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1002.eqiad.wmnet [15:46:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1019.eqiad.wmnet wikikube-worker1020.eqiad.wmnet wikikube-worker1021.eqiad.wmnet on all recursors [15:46:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1019.eqiad.wmnet with OS bullseye [15:46:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1020.eqiad.wmnet with OS bullseye [15:46:46] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr[1-2]-codfw,cr2-drmrs,cr3-knams,cr2-magru with reason: JunOS upgrade and PSU swap on cr2-eqdfw [15:46:47] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on cr[1-2]-codfw,cr2-drmrs,cr3-knams,cr2-magru with reason: JunOS upgrade and PSU swap on cr2-eqdfw [15:46:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1021.eqiad.wmnet with OS bullseye [15:48:25] FIRING: SystemdUnitFailed: ferm.service on mw2297:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:41] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr[1-2]-codfw,cr2-drmrs,cr2-esams,cr2-magru with reason: JunOS upgrade and PSU swap on cr2-eqdfw [15:48:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr[1-2]-codfw,cr2-drmrs,cr2-esams,cr2-magru with reason: JunOS upgrade and PSU swap on cr2-eqdfw [15:49:53] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:49:59] !log rebooting cr2-eqdfw to upgrade JunOS T364092 [15:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:04] T364092: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092 [15:52:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:52:46] (03CR) 10Xcollazo: [C:03+1] "I personally do not use the `aqs` role for anything." [puppet] - 10https://gerrit.wikimedia.org/r/1043894 (https://phabricator.wikimedia.org/T313877) (owner: 10Eevans) [15:52:52] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 14m 41s) [15:52:56] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:53:25] FIRING: [9x] SystemdUnitFailed: ferm.service on kubernetes2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:47] AppserversUnreachable is me, will recover [15:55:29] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [15:55:46] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:56:05] (03CR) 10Ssingh: varnish: show better error for 429s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [15:56:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [15:56:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:56:42] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt-wdqs1002.eqiad.wmnet [15:56:46] claime: hi, I'm just doing a wikimedia portals deploy right now, is what your doing going to affect that? [15:57:27] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:57:27] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt-wdqs1001.eqiad.wmnet [15:57:28] it shouldn't [15:58:25] RESOLVED: [9x] SystemdUnitFailed: ferm.service on kubernetes2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:29] Ok, I did notice one k8s host timeout `ssh: connect to host mw2321.codfw.wmnet port 22: Connection timed out` [15:58:32] (03CR) 10Dzahn: "ok, for some reason I expected that we want to keep it docker-ce here as well and copying what we had just done seemed to make sense. I wi" [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [15:58:46] jan_drewniak: yeah that one's got hardware issues [15:59:10] jan_drewniak: it should just affect the pull-k8s stage, which is not a problem unless you get a lot of failures [15:59:12] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1001.eqiad.wmnet [15:59:16] mw2321 is known broken (https://phabricator.wikimedia.org/T367702) [15:59:33] ok thanks, that was the only error [15:59:35] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:59:45] shhh parsoid you're fine [16:00:33] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1028.eqiad.wmnet: Apply update to Java 11 - eevans@cumin1002 [16:03:35] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [16:05:13] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:14] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt-wdqs1001.eqiad.wmnet [16:05:33] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:05:57] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:07:21] (03PS2) 10Elukey: redfish: add property for storage manager URI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043804 (https://phabricator.wikimedia.org/T365372) [16:08:40] FIRING: [21x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:52] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1002.eqiad.wmnet [16:09:15] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1019.eqiad.wmnet with OS bullseye [16:09:20] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1028.eqiad.wmnet: Apply update to Java 11 - eevans@cumin1002 [16:09:28] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 14m 13s) [16:09:32] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:09:35] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:09:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1019.eqiad.wmnet with OS bullseye [16:11:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9899692 (10elukey) @Jhancock.wm super useful, thanks a lot! Do you know if there was anything else that we could use, maybe in the label attached to the host where its s... [16:11:57] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:12:35] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:14:24] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [16:16:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:16:05] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt-wdqs1002.eqiad.wmnet [16:16:48] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1003.eqiad.wmnet [16:18:43] (03PS1) 10Dzahn: codesearch: install docker.io if on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046724 (https://phabricator.wikimedia.org/T367479) [16:19:04] (03CR) 10CI reject: [V:04-1] codesearch: install docker.io if on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046724 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [16:19:14] jouncebot: now [16:19:14] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [16:19:24] anybody mind if I use this window for a fix? [16:19:31] (03PS2) 10Dzahn: codesearch: install docker.io if on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046724 (https://phabricator.wikimedia.org/T367479) [16:19:53] (03CR) 10CI reject: [V:04-1] codesearch: install docker.io if on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046724 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [16:20:49] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042333 (owner: 10PipelineBot) [16:20:50] not sure where best to report this since it is no longer an ongoing issue; db1170 reached 5-6 hours of replag around 14:30 UTC; I'm guessing it should have been depooled but wasn't; cc marostegui [16:20:52] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043172 (owner: 10PipelineBot) [16:21:44] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [16:22:08] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046725 [16:22:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [16:23:03] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046725 (owner: 10PipelineBot) [16:23:24] (03PS3) 10Dzahn: codesearch: install docker.io if on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046724 (https://phabricator.wikimedia.org/T367479) [16:23:59] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046725 (owner: 10PipelineBot) [16:24:02] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [16:24:03] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9899763 (10cmooney) [16:25:01] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9899765 (10cmooney) >>! In T364092#9766653, @ayounsi wrote: > Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru. Doh, I went with 22.2R... [16:25:13] about to deploy fix :) [16:25:22] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [16:25:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt-wdqs1003.eqiad.wmnet [16:25:28] (i hope) [16:25:41] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [16:26:04] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [16:26:26] (03PS1) 10Clément Goubert: Fix missing wikikube-worker [puppet] - 10https://gerrit.wikimedia.org/r/1046726 [16:27:10] (03CR) 10Clément Goubert: [C:03+2] Fix missing wikikube-worker [puppet] - 10https://gerrit.wikimedia.org/r/1046726 (owner: 10Clément Goubert) [16:27:28] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [16:27:57] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [16:28:49] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [16:29:19] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [16:29:34] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on cr[1-2]-codfw,cr2-drmrs,cr2-esams,cr2-magru with reason: JunOS upgrade and PSU swap on cr2-eqdfw [16:29:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on cr[1-2]-codfw,cr2-drmrs,cr2-esams,cr2-magru with reason: JunOS upgrade and PSU swap on cr2-eqdfw [16:29:47] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9899798 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2d8f617e-3efd-4a08-88cb-75ab38c0cc68) set by cmooney@cumin1002 for 0:40:00 on 5 host(s) and their services with rea... [16:29:52] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: JunOS upgrade and PSU swap on cr2-eqdfw [16:30:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: JunOS upgrade and PSU swap on cr2-eqdfw [16:30:18] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9899803 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f6da6e6c-9be7-4e8c-9beb-a7f1139359f6) set by cmooney@cumin1002 for 0:40:00 on 2 host(s) and their services with rea... [16:30:25] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1020.eqiad.wmnet with reason: host reimage [16:31:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1019.eqiad.wmnet with reason: host reimage [16:32:35] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1021.eqiad.wmnet with reason: host reimage [16:32:46] ^^ ping alert expected [16:33:03] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:33:15] i am done fyi. fix seems to have worked. [16:33:25] (03CR) 10Andrew Bogott: [C:03+2] Remove mentions of cloudvirt-wdqs100[1,2,3] [puppet] - 10https://gerrit.wikimedia.org/r/1046716 (https://phabricator.wikimedia.org/T367773) (owner: 10Andrew Bogott) [16:33:40] FIRING: [12x] SystemdUnitFailed: ferm.service on kubernetes2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:33:40] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:33:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1020.eqiad.wmnet with reason: host reimage [16:33:50] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046724" [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [16:33:55] FIRING: [13x] SystemdUnitFailed: ferm.service on kubernetes2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:07] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9899821 (10Eevans) `/dev/sde` has failed again :( {F55409881} {F55409882} {F55409883} [16:34:49] (03CR) 10Scott French: [C:03+2] Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [16:36:01] 10ops-eqiad, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudvirt-wdqs100[1,2,3] - https://phabricator.wikimedia.org/T367773#9899829 (10Andrew) a:05Andrew→03None [16:36:40] (03Merged) 10jenkins-bot: Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [16:36:52] (03CR) 10Dzahn: [V:03+1 C:03+2] "thanks, ACK! https://puppet-compiler.wmflabs.org/output/1043247/2937/" [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [16:37:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1019.eqiad.wmnet with reason: host reimage [16:37:04] (03PS3) 10Dzahn: idp: remove gitlab from the CAS protocol section [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) [16:38:40] RESOLVED: [13x] SystemdUnitFailed: ferm.service on kubernetes2054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:43] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:39:03] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:39:12] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9899850 (10bking) Thanks for checking that out, and sorry I did not check the SEL before sending over. I wonder if we could alert off this situation? Will ask around in #wikimedia-sre [16:39:49] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1043247/2938/idp1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [16:40:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1021.eqiad.wmnet with reason: host reimage [16:41:10] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9899854 (10Eevans) >>! In T362033#9899821, @Eevans wrote: > `/dev/sde` has failed again :( > > {F55409881} > > {F55409882} > > {F55409883} So —and @VRiley-WMF correct me if I'm wrong— we didn... [16:42:16] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [16:42:16] !log mnz@deploy1002 Started deploy [airflow-dags/research@5e1cd80]: (no justification provided) [16:42:41] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [16:42:49] !log mnz@deploy1002 Finished deploy [airflow-dags/research@5e1cd80]: (no justification provided) (duration: 00m 32s) [16:43:40] (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [16:43:43] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: sync [16:43:53] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: sync [16:44:34] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:44:53] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [16:45:24] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [16:45:46] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [16:46:14] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [16:46:25] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [16:46:43] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: sync [16:46:50] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: sync [16:47:17] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [16:47:30] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [16:47:36] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: sync [16:47:43] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: sync [16:48:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet - https://phabricator.wikimedia.org/T367766#9899861 (10Clement_Goubert) [16:48:14] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [16:48:41] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [16:49:15] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [16:49:35] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [16:50:16] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043804 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:50:19] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [16:50:21] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [16:51:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1020.eqiad.wmnet with OS bullseye [16:52:26] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4048.ulsfo.wmnet [16:53:05] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 384.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:55:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1019.eqiad.wmnet with OS bullseye [16:56:14] !log homer 'lsw1-e2-eqiad*' commit 'T351074' [16:56:39] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 22.4R3 - https://phabricator.wikimedia.org/T364092#9899880 (10cmooney) [16:57:12] (03Merged) 10jenkins-bot: redfish: add property for storage manager URI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043804 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:58:34] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [16:58:47] !log homer 'cr*eqiad*' commit 'T351074' [16:58:48] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [16:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:51] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:58:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1021.eqiad.wmnet with OS bullseye [16:59:20] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: sync [16:59:32] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: sync [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1700) [17:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T1700). [17:02:24] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet [17:05:32] !log Pooling and uncordoning wikikube-worker1019.eqiad.wmnet,wikikube-worker1020.eqiad.wmnet,wikikube-worker1021.eqiad.wmnet - T351074 [17:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:36] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [17:06:01] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T367789 (10Clement_Goubert) 03NEW [17:06:43] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [17:07:22] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [17:07:32] (03PS9) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:11:26] (03PS10) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:12:19] (03CR) 10Jforrester: [C:04-1] "PS9 looks mis-crushed? That's not our standard SVGO config." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:12:21] (03CR) 10Dzahn: [V:03+1 C:03+2] "deployed on idp - i can still login" [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [17:12:43] (03CR) 10Dzahn: [V:03+1 C:03+2] "Notice: /Stage[main]/Apereo_cas/File[/etc/cas/services/gitlab-23.json]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [17:12:46] (03PS11) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:13:14] (03CR) 10Herron: [C:03+2] Revert "istio_slos: add secondary recording rules" [puppet] - 10https://gerrit.wikimedia.org/r/1046690 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [17:14:02] (03CR) 10Ladsgroup: "oh. My aplogoies. let me fix that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:16:12] (03PS1) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [17:16:31] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [17:16:51] (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:17:11] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [17:18:14] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4042.ulsfo.wmnet [17:21:12] (03CR) 10Jforrester: "Apart from the CI failure, this needs to depend on I2d3784b3783188649fa955f505e943d1c7273bea (which needs to wait two weeks) and have a ch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:21:39] (03CR) 10Jforrester: [C:04-2] "Needs to wait until 1.43.0-wmf.11 (or later)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:22:25] (03PS2) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [17:22:28] (03CR) 10Jforrester: "Do you need any help/co-ordination getting this deployed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010953 (https://phabricator.wikimedia.org/T360067) (owner: 10Kosta Harlan) [17:22:49] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 105 probes of 730 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:23:05] (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:27:43] (03CR) 10Dzahn: [V:03+1 C:03+2] "@slyngshede Looks like puppet already does the thing, right? :)" [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [17:27:49] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 68 probes of 730 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:29:24] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [17:30:00] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [17:30:33] RECOVERY - Host elastic2099 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [17:31:50] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [17:31:54] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10Mail: Update fundraising mail / firewall settings to use new production mx-in hosts - https://phabricator.wikimedia.org/T367573#9900161 (10Dwisehaupt) PFW update tracked in T367796. [17:32:36] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [17:32:57] (03PS12) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:33:20] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [17:33:46] (03CR) 10Ladsgroup: "I used scour which is in manage.py in logos/ and then fed it to svgo minifier based on core's config (copy-pasted). So this should be the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:34:05] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [17:34:46] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [17:35:18] (03PS1) 10Volans: redfish: simplify interface of Redfish classes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1046734 (https://phabricator.wikimedia.org/T365372) [17:35:27] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [17:35:56] (03CR) 10Volans: "This is a simplification proposal, LMK what do you think." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1046734 (https://phabricator.wikimedia.org/T365372) (owner: 10Volans) [17:36:15] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [17:36:57] PROBLEM - Host elastic2099 is DOWN: PING CRITICAL - Packet loss = 100% [17:37:15] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [17:38:45] (03PS1) 10Andrew Bogott: Make cloudvirt1037 an ovs host [puppet] - 10https://gerrit.wikimedia.org/r/1046735 (https://phabricator.wikimedia.org/T364457) [17:38:47] (03PS1) 10Andrew Bogott: Make cloudvirt1038 an ovs host [puppet] - 10https://gerrit.wikimedia.org/r/1046736 (https://phabricator.wikimedia.org/T364457) [17:38:47] (03PS1) 10Ssingh: config/common: update list of ntp_servers to use anycast NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) [17:38:48] (03CR) 10Jforrester: "It's not terrible, though we shouldn't be stripping whitespace. Ah well. Let's ship it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:39:33] (03PS13) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:39:49] (03CR) 10Andrew Bogott: [C:03+2] Make cloudvirt1037 an ovs host [puppet] - 10https://gerrit.wikimedia.org/r/1046735 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [17:40:45] (03CR) 10Ladsgroup: "I found the issue and fixed it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [17:41:07] jouncebot: nowandnext [17:41:13] :( [17:41:15] PROBLEM - ensure kvm processes are running on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:42:56] (03PS2) 10Ssingh: config/common: update list of ntp_servers to use anycast NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) [17:43:41] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1037.eqiad.wmnet with OS bookworm [17:47:10] (03PS3) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [17:47:25] (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:47:50] (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [17:50:47] Amir1: not sure where best to report this since it is no longer an ongoing issue (or if it needs a report); db1170 reached 4-5 hours of replag around 14:30 UTC; I'm guessing it should have been depooled but wasn't [17:51:30] thanks which section JJMC89 [17:51:54] nvm, found it [17:51:55] s7 [17:52:15] PROBLEM - ensure kvm processes are running on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:52:42] (03CR) 10Volans: mariadb: bugfixes mysql_legacy (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [17:53:13] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [17:53:55] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [17:54:04] I think I know what happened, it's a one off because when the script started the host was depooled, then it got repooled by another process but the script assumed it was depooled and treated it as such. [17:54:12] I think there is even a ticket for it [17:54:37] I've seen that happen before for s1 [17:55:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:28] on an unrelated note, do you still work on mailman stuff, Amir1? have an issue that the list owners can't fix [17:57:11] JJMC89: I think I fixed your case last week [17:57:23] the bad email address? [17:57:36] two subscriptions in one of mailing lists [17:58:52] yea, one of them was bouncing cause of some encoding of a + in the address but the owner couldn't remove it [17:59:22] I think that's done now, please double check ^_^ [18:00:15] I can't see anything about the bad address (never could). The working one is subscribed, which is good. [18:00:37] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [18:01:05] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [18:01:23] (03CR) 10Jforrester: [C:03+1] "Let's ship it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [18:02:34] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [18:02:50] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage [18:03:01] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [18:04:13] * James_F is a bad influence on Amir1, clearly. [18:04:56] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [18:05:33] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [18:05:52] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage [18:06:31] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [18:07:06] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [18:07:49] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [18:08:19] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [18:09:17] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [18:09:42] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 50%, RTA = 30.47 ms [18:09:43] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [18:10:47] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [18:11:28] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [18:12:00] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800 (10RobH) 03NEW [18:12:01] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: sync [18:12:03] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: sync [18:12:30] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#9900359 (10RobH) [18:13:44] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9900366 (10Papaul) PEM0 was connected to ps1 with blue tag and PEM1 was connected to PS2 with red tag. moved PEM0 to ps2 error clear moved PEM1 to PS1 error moved from PEM0 to PEM1 power dow... [18:14:50] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:09] jouncebot: nowandnext [18:18:10] No deployments scheduled for the next 1 hour(s) and 41 minute(s) [18:18:10] In 1 hour(s) and 41 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T2000) [18:18:17] awesome [18:18:22] (03PS14) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [18:18:25] (03CR) 10Ladsgroup: [C:03+2] Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [18:19:05] (03Merged) 10jenkins-bot: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [18:19:27] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1027150|Change static footer icons to the new one (T256190)]] [18:19:32] T256190: Update footer image links on all MediaWiki skins to be legible and accessible - https://phabricator.wikimedia.org/T256190 [18:20:44] (03PS1) 10Eevans: cassandra: not (yet) ready to upgrade restbase to 'dev' (4.1.5) [puppet] - 10https://gerrit.wikimedia.org/r/1046748 (https://phabricator.wikimedia.org/T354970) [18:21:35] (03CR) 10Eevans: [C:03+2] cassandra: not (yet) ready to upgrade restbase to 'dev' (4.1.5) [puppet] - 10https://gerrit.wikimedia.org/r/1046748 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [18:22:51] (03PS1) 10Ladsgroup: Remove footer override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046750 [18:23:37] (03CR) 10Jforrester: "Oh oops, yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046750 (owner: 10Ladsgroup) [18:23:54] (03CR) 10Ladsgroup: [C:03+2] Remove footer override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046750 (owner: 10Ladsgroup) [18:24:00] (03CR) 10Jforrester: [C:03+1] "(Temporary until we can serve the MW one from core again?)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046750 (owner: 10Ladsgroup) [18:24:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046750 (owner: 10Ladsgroup) [18:24:35] (03Merged) 10jenkins-bot: Remove footer override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046750 (owner: 10Ladsgroup) [18:24:55] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1027150|Change static footer icons to the new one (T256190)]], [[gerrit:1046750|Remove footer override]] [18:24:59] T256190: Update footer image links on all MediaWiki skins to be legible and accessible - https://phabricator.wikimedia.org/T256190 [18:25:08] (03CR) 10Ladsgroup: [C:03+2] "Yeah, while the 2x and 1.5x won't be needed at all" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046750 (owner: 10Ladsgroup) [18:29:49] !log ladsgroup@deploy1002 ladsgroup, jforrester: Backport for [[gerrit:1027150|Change static footer icons to the new one (T256190)]], [[gerrit:1046750|Remove footer override]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:30:11] Amir1: LGTM. [18:30:32] gorgeous [18:30:34] !log ladsgroup@deploy1002 ladsgroup, jforrester: Continuing with sync [18:30:37] (03PS1) 10Scott French: data-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046752 (https://phabricator.wikimedia.org/T366851) [18:30:40] (03PS1) 10Scott French: aqs-http-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) [18:30:48] Only sad thing is that MobileFrontend still hides them. Boo hiss, etc. [18:32:00] (Pre-existing issue.) [18:32:33] FIRING: KubernetesCalicoDown: wikikube-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:33:02] and dark mode :D [18:33:14] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1037.eqiad.wmnet with OS bookworm [18:33:22] Eh. [18:33:30] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801 (10RobH) 03NEW [18:33:49] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#9900440 (10RobH) [18:34:28] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:34:34] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:36:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1038.eqiad.wmnet with OS bookworm [18:36:36] (03CR) 10Andrew Bogott: [C:03+2] Make cloudvirt1038 an ovs host [puppet] - 10https://gerrit.wikimedia.org/r/1046736 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [18:37:44] (03PS1) 10BCornwall: Set cp4042 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1046755 (https://phabricator.wikimedia.org/T364891) [18:39:17] Amir1: I'm amused how our thoughts crossed paths on T256190, you with your comment and me updating the description. :-) [18:39:17] T256190: Update footer image links on all MediaWiki skins to be legible and accessible - https://phabricator.wikimedia.org/T256190 [18:39:26] Clearly we're working together too much. [18:39:37] :D [18:40:15] I'm actually a clone of your in disguise. It is a top secret DARPA project to ensure sustainability of Wikipedia. [18:40:24] *you [18:40:48] If only I was as awesome that you would be my clone. [18:41:04] But anyway, I think we can land https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1027178 now and worry about dark mode later. [18:42:08] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1027150|Change static footer icons to the new one (T256190)]], [[gerrit:1046750|Remove footer override]] (duration: 17m 12s) [18:44:00] {{done}} [18:47:58] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804 (10RobH) 03NEW [18:48:27] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9900501 (10RobH) [18:48:37] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9900509 (10RobH) [18:53:31] (03PS1) 10Ssingh: P:bird::anycast_monitoring: add monitoring for 10.3.0.[5-7]/32 [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360) [18:54:07] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage [18:55:25] (03CR) 10BCornwall: [C:03+2] Set cp4042 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1046755 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [18:56:04] (03PS2) 10Ssingh: P:bird::anycast_monitoring: add monitoring for 10.3.0.[5-7]/32 [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360) [18:56:48] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS bullseye [18:57:00] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9900583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4042.ulsfo.wmnet with OS b... [18:57:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage [18:58:01] (03CR) 10Ladsgroup: "You need to set the host in footer:" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [18:58:18] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2940/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [19:01:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041659 (https://phabricator.wikimedia.org/T365627) (owner: 10Jforrester) [19:02:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039767 (https://phabricator.wikimedia.org/T362494) (owner: 10Jforrester) [19:10:34] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:10:40] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:15:40] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4042.ulsfo.wmnet with OS bullseye [19:15:55] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9900601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4042.ulsfo.wmnet with OS bulls... [19:15:59] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS bullseye [19:16:08] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9900602 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4042.ulsfo.wmnet with OS b... [19:16:16] (03CR) 10Ladsgroup: [C:03+1] codesearch: install docker.io if on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046724 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [19:22:33] FIRING: [2x] KubernetesCalicoDown: wikikube-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:22:36] RECOVERY - ensure kvm processes are running on cloudvirt1038 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:22:48] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1038.eqiad.wmnet with OS bookworm [19:23:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1046724 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [19:32:00] (03PS3) 10Ssingh: P:bird::anycast_monitoring: add monitoring for 10.3.0.[5-7]/32 [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360) [19:32:20] (03PS1) 10Andrew Bogott: Move cloudvirt1039 and cloudvirt1040 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1046761 (https://phabricator.wikimedia.org/T364457) [19:34:10] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2941/co" [puppet] - 10https://gerrit.wikimedia.org/r/1046757 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [19:36:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [19:38:21] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [19:40:40] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [19:43:52] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1039.eqiad.wmnet with OS bookworm [19:50:56] (03PS8) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [19:54:43] (03CR) 10Andrew Bogott: [C:03+2] Move cloudvirt1039 and cloudvirt1040 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1046761 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [19:55:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [19:55:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [19:55:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T352010)', diff saved to https://phabricator.wikimedia.org/P65116 and previous config saved to /var/cache/conftool/dbconfig/20240617-195520-ladsgroup.json [19:55:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:56:48] (03PS1) 10Andrew Bogott: Move cloudvirt1041 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1046764 (https://phabricator.wikimedia.org/T364457) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T2000). [20:00:05] James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] Happy to self-deploy. [20:00:42] (03PS2) 10Jforrester: [wikifunctionswiki] Remove right to promote/demote sysops and bureaucrats from staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041659 (https://phabricator.wikimedia.org/T365627) [20:00:46] (03PS2) 10Jforrester: Add a note that you cannot change wgCategoryCollation easily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039767 (https://phabricator.wikimedia.org/T362494) [20:00:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041659 (https://phabricator.wikimedia.org/T365627) (owner: 10Jforrester) [20:00:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039767 (https://phabricator.wikimedia.org/T362494) (owner: 10Jforrester) [20:01:38] (03Merged) 10jenkins-bot: [wikifunctionswiki] Remove right to promote/demote sysops and bureaucrats from staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041659 (https://phabricator.wikimedia.org/T365627) (owner: 10Jforrester) [20:01:39] (03Merged) 10jenkins-bot: Add a note that you cannot change wgCategoryCollation easily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039767 (https://phabricator.wikimedia.org/T362494) (owner: 10Jforrester) [20:01:57] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1041659|[wikifunctionswiki] Remove right to promote/demote sysops and bureaucrats from staff (T365627)]], [[gerrit:1039767|Add a note that you cannot change wgCategoryCollation easily (T362494 T366809)]] [20:02:06] T365627: Remove rights to promote and demote bureaucrats and admins from Wikifunctions staff - https://phabricator.wikimedia.org/T365627 [20:02:06] T362494: Enable numerical category sorting on Commons - https://phabricator.wikimedia.org/T362494 [20:02:06] T366809: Category pagination broken on Commons - https://phabricator.wikimedia.org/T366809 [20:02:51] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage [20:03:18] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814 (10RobH) 03NEW [20:03:28] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#9900809 (10RobH) [20:05:02] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4042.ulsfo.wmnet with OS bullseye [20:05:08] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9900818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4042.ulsfo.wmnet with OS bullseye completed: - cp404... [20:06:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage [20:06:30] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1041659|[wikifunctionswiki] Remove right to promote/demote sysops and bureaucrats from staff (T365627)]], [[gerrit:1039767|Add a note that you cannot change wgCategoryCollation easily (T362494 T366809)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:07:05] !log jforrester@deploy1002 jforrester: Continuing with sync [20:07:06] T365627: Remove rights to promote and demote bureaucrats and admins from Wikifunctions staff - https://phabricator.wikimedia.org/T365627 [20:07:06] T362494: Enable numerical category sorting on Commons - https://phabricator.wikimedia.org/T362494 [20:07:07] T366809: Category pagination broken on Commons - https://phabricator.wikimedia.org/T366809 [20:08:42] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4042.ulsfo.wmnet [20:08:58] (03PS9) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [20:15:40] hi, can i add a patch to the window still? i'd like to backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1043310 (just merged a minute ago) [20:15:42] jouncebot: now [20:15:43] For the next 0 hour(s) and 44 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T2000) [20:15:53] MatmaRex: Sure. [20:16:02] (03PS1) 10Jforrester: Fix styles for new heading HTML [skins/MinervaNeue] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046767 (https://phabricator.wikimedia.org/T367468) [20:16:04] Just wmf.9? [20:16:56] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1041659|[wikifunctionswiki] Remove right to promote/demote sysops and bureaucrats from staff (T365627)]], [[gerrit:1039767|Add a note that you cannot change wgCategoryCollation easily (T362494 T366809)]] (duration: 14m 59s) [20:16:58] Well, I guess wmf.8 is no longer active anyway. [20:17:03] there's no wmf.10 yet, is there? [20:17:21] (03PS2) 10Bartosz Dziewoński: Fix styles for new heading HTML [skins/MinervaNeue] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046767 (https://phabricator.wikimedia.org/T367468) (owner: 10Jforrester) [20:17:21] Not for a few hours. [20:17:37] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [skins/MinervaNeue] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046767 (https://phabricator.wikimedia.org/T367468) (owner: 10Jforrester) [20:17:57] (03PS1) 10Cory Massaro: Add addNestedMetadata to production orchestrator config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046769 [20:18:03] (03CR) 10Bartosz Dziewoński: [C:03+1] "(Oh oops, I was making a cherry-pick at the same time)" [skins/MinervaNeue] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046767 (https://phabricator.wikimedia.org/T367468) (owner: 10Jforrester) [20:18:37] (03PS2) 10Jforrester: Add addNestedMetadata to production orchestrator config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046769 (https://phabricator.wikimedia.org/T366829) (owner: 10Cory Massaro) [20:19:52] (03PS3) 10Cory Massaro: Add addNestedMetadata to production orchestrator config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046769 (https://phabricator.wikimedia.org/T366829) [20:24:22] * James_F glares at CI. [20:24:46] (03CR) 10Andrew Bogott: [C:03+2] Move cloudvirt1041 to ovs [puppet] - 10https://gerrit.wikimedia.org/r/1046764 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [20:25:12] (03PS10) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [20:27:33] RESOLVED: [2x] KubernetesCalicoDown: wikikube-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:33:08] (03PS11) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [20:33:09] MatmaRex: Sorry the merge is taking so long. [20:33:19] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1039.eqiad.wmnet with OS bookworm [20:33:51] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816 (10RobH) 03NEW [20:34:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS bookworm [20:34:15] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#9901095 (10RobH) [20:34:31] James_F: i'm used to it, it's always like this these days [20:34:42] :-( [20:34:43] James_F: actually, can you make sure i didn't break it by submitting another patchset? [20:35:00] since it says "Starting gate-and-submit-wmf jobs" on patchset 1 [20:35:00] Yeah, it's still running. [20:35:06] but there is patchset 2 [20:35:29] (03CR) 10Jforrester: [C:03+2] "Re-confirming this C+2 applies to PS2 too." [skins/MinervaNeue] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046767 (https://phabricator.wikimedia.org/T367468) (owner: 10Jforrester) [20:35:49] "Copied votes on follow-up patch sets have been updated" shows that it copied the C+2 across. [20:35:51] But doesn't hurt. [20:40:50] (03CR) 10Dzahn: [C:03+2] codesearch: install docker.io if on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1046724 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [20:41:11] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 52.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:42:42] (03CR) 10Hashar: [C:04-1] "`its-phabricator` fails to pass tests :/" [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [20:43:04] (03CR) 10Scott French: "I believe these should no longer be needed due to Cassandra client startup alone. However, I'm not entirely sure about the services that a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [20:45:44] MatmaRex: Meh, you're right, the secondary PS has caused Zuul to re-run. [20:45:54] I'm going to force-submit. [20:46:01] CI passed fully. [20:46:15] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1046767|Fix styles for new heading HTML (T367468)]] [20:46:20] T367468: Heading CSS changes in Minerva cause issues with styled headings - https://phabricator.wikimedia.org/T367468 [20:49:44] (03CR) 10CDobbins: varnish: show better error for 429s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [20:50:07] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [20:50:41] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1046767|Fix styles for new heading HTML (T367468)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:50:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9901172 (10Jdforrester-WMF) Could mw2321 be de-pooled? It's still in the scap target list: ` 20:49:07 /usr/bin/sudo /usr/local/sbin/mediawik... [20:52:04] MatmaRex: Look OK to you? [20:52:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [20:53:39] looking [20:54:06] (03PS4) 10Hashar: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) [20:54:36] James_F: looks good [20:54:37] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901190 (10BCornwall) [20:55:18] !log jforrester@deploy1002 jforrester: Continuing with sync [20:55:21] Cool. [20:58:56] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819 (10RobH) 03NEW [20:58:58] (03PS1) 10Dzahn: codesearch: fix dependencies on changing docker package name [puppet] - 10https://gerrit.wikimedia.org/r/1046776 (https://phabricator.wikimedia.org/T367479) [20:59:14] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#9901240 (10RobH) [20:59:21] (03CR) 10CI reject: [V:04-1] codesearch: fix dependencies on changing docker package name [puppet] - 10https://gerrit.wikimedia.org/r/1046776 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [20:59:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [20:59:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [20:59:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65117 and previous config saved to /var/cache/conftool/dbconfig/20240617-205955-marostegui.json [21:00:00] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240617T2100). [21:00:16] (Scap still running.) [21:00:47] (03PS2) 10Dzahn: codesearch: fix dependencies on changing docker package name [puppet] - 10https://gerrit.wikimedia.org/r/1046776 (https://phabricator.wikimedia.org/T367479) [21:02:11] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:05:12] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1046767|Fix styles for new heading HTML (T367468)]] (duration: 18m 57s) [21:05:17] Finally. [21:05:17] T367468: Heading CSS changes in Minerva cause issues with styled headings - https://phabricator.wikimedia.org/T367468 [21:05:36] MatmaRex: Sorry it took so long. Happy bug-hunting. [21:06:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9901260 (10Dzahn) mw2321 is already depooled=inactive in confctl. I think the issue that this is isn't sufficient to make it disappear for de... [21:06:37] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367820 (10RobH) 03NEW [21:06:41] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367820#9901282 (10RobH) [21:06:52] (03CR) 10Dzahn: [C:03+2] codesearch: fix dependencies on changing docker package name [puppet] - 10https://gerrit.wikimedia.org/r/1046776 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [21:06:59] (03PS3) 10Dzahn: codesearch: fix dependencies on changing docker package name [puppet] - 10https://gerrit.wikimedia.org/r/1046776 (https://phabricator.wikimedia.org/T367479) [21:07:33] thanks James_F [21:08:31] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#9901287 (10RobH) [21:09:28] !log cdobbins@cumin1002 conftool action : set/pooled=no; selector: name=4043.ulsfo.wmnet [21:09:56] !log cdobbins@cumin1002 conftool action : set/pooled=no; selector: name=cp4043.ulsfo.wmnet [21:13:20] (03PS1) 10CDobbins: Set cp4043 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1046778 (https://phabricator.wikimedia.org/T364891) [21:16:37] (03CR) 10Muehlenhoff: [C:03+1] codesearch: fix dependencies on changing docker package name [puppet] - 10https://gerrit.wikimedia.org/r/1046776 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [21:19:11] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:19:33] (03CR) 10Dzahn: codesearch: fix dependencies on changing docker package name [puppet] - 10https://gerrit.wikimedia.org/r/1046776 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [21:20:11] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:20:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS bookworm [21:29:15] (03CR) 10BCornwall: [C:03+1] Set cp4043 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1046778 (https://phabricator.wikimedia.org/T364891) (owner: 10CDobbins) [21:35:59] (03CR) 10CDobbins: [C:03+2] Set cp4043 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1046778 (https://phabricator.wikimedia.org/T364891) (owner: 10CDobbins) [21:41:17] !log cdobbins@cumin1002 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS bullseye [21:41:28] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901323 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1002 for host cp4043.ulsfo.wmnet with OS bullseye [21:44:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P65118 and previous config saved to /var/cache/conftool/dbconfig/20240617-214449-ladsgroup.json [21:44:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:45:54] (03PS1) 10Ryan Kemper: sre.hadoop.reboot-workers: use ceil not floor [cookbooks] - 10https://gerrit.wikimedia.org/r/1046780 (https://phabricator.wikimedia.org/T367592) [21:47:27] (03PS1) 10Scott French: kubernetes: split kubernetes-prod.yaml by team [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) [21:50:00] (03CR) 10Ryan Kemper: "Here's a script to be run on cumin that demonstrates the error. Change floor to ceil to see the correct result:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1046780 (https://phabricator.wikimedia.org/T367592) (owner: 10Ryan Kemper) [21:50:31] (03CR) 10Bking: [C:03+1] "Verified via python REPL" [cookbooks] - 10https://gerrit.wikimedia.org/r/1046780 (https://phabricator.wikimedia.org/T367592) (owner: 10Ryan Kemper) [21:55:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:48] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [21:58:58] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev2001.codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [21:59:03] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [21:59:05] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9901389 (10RKemper) >>! In T367442#9890151, @VRiley-WMF wrote: > @RKemper When is there a preference on when we could schedule this? Whenever's convenient.... [21:59:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P65119 and previous config saved to /var/cache/conftool/dbconfig/20240617-215956-ladsgroup.json [22:00:33] (03CR) 10Scott French: "This is a follow-up to Ie95a774570128484e9bc681bb039b8f34e76cf0e." [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [22:04:41] (03CR) 10Volans: "FYI like in T345880 also in this case if migrated to the rolling classes available in the cookbooks this bug would have not been present a" [cookbooks] - 10https://gerrit.wikimedia.org/r/1046780 (https://phabricator.wikimedia.org/T367592) (owner: 10Ryan Kemper) [22:05:27] (03PS1) 10SBassett: Update security.wikimedia.org helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 [22:05:39] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_A2 for an-worker1093 - https://phabricator.wikimedia.org/T367825 (10RKemper) 03NEW [22:05:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev2001.codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [22:05:57] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [22:06:00] (03CR) 10SBassett: [C:04-2] "Please don't +2 until we're ready to deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 (owner: 10SBassett) [22:08:21] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_A2 for an-worker1093 - https://phabricator.wikimedia.org/T367825#9901441 (10RKemper) [22:08:24] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_A2 for an-worker1093 - https://phabricator.wikimedia.org/T367825#9901439 (10RKemper) [22:08:55] (03PS2) 10SBassett: Update security.wikimedia.org helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046784 (https://phabricator.wikimedia.org/T365644) [22:11:31] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev200[2-3].codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [22:11:36] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [22:12:25] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [22:14:39] (03CR) 10Dzahn: [V:03+2 C:03+2] codesearch: fix dependencies on changing docker package name [puppet] - 10https://gerrit.wikimedia.org/r/1046776 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [22:15:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P65120 and previous config saved to /var/cache/conftool/dbconfig/20240617-221503-ladsgroup.json [22:15:36] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [22:21:02] (03PS5) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) [22:23:04] (03CR) 10Ryan Kemper: "Specifically, it returns this as the first batch, which is 4 hosts rather than the expected 3:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1046780 (https://phabricator.wikimedia.org/T367592) (owner: 10Ryan Kemper) [22:25:00] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev200[2-3].codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [22:25:05] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [22:26:28] !log cdobbins@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4043.ulsfo.wmnet with OS bullseye [22:26:34] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1002 for host cp4043.ulsfo.wmnet with OS bullseye ex... [22:28:45] !log cdobbins@cumin1002 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS bullseye [22:28:59] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1002 for host cp4043.ulsfo.wmnet with OS bullseye [22:30:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P65121 and previous config saved to /var/cache/conftool/dbconfig/20240617-223010-ladsgroup.json [22:30:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:30:20] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9901519 (10eoghan) [22:33:56] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046787 (https://phabricator.wikimedia.org/T325793) [22:42:31] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1041.eqiad.wmnet with OS bookworm [22:48:12] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9901569 (10Scott_French) @SGupta-WMF - Thanks for letting me know. Given your... [22:49:22] !log cdobbins@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage [22:52:23] !log cdobbins@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage [22:55:24] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833 (10eoghan) 03NEW [22:57:04] (03PS2) 10Dzahn: lists: Block incoming email on lists hosts during mailman migration [puppet] - 10https://gerrit.wikimedia.org/r/1043799 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [22:57:04] (03CR) 10Dzahn: [C:03+1] "I haven't used it before but seems right to me, it should block port 25 using ferm as it looks." [puppet] - 10https://gerrit.wikimedia.org/r/1043799 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [22:57:20] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9901610 (10eoghan) a:05eoghan→03Ladsgroup [22:57:26] (03PS1) 10Dzahn: lists: Allow mail to be received on lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1046786 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [22:59:33] (03PS1) 10Dzahn: lists: Switch DB firewall rules to use primary host variable [puppet] - 10https://gerrit.wikimedia.org/r/1046785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [22:59:33] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1046785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [22:59:34] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9901618 (10eoghan) https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/modules/profile/templates/mariadb/grants/production-m5.sql.erb#26 It's possible that the grants are alread... [23:00:40] (03PS4) 10Dzahn: lists: Migrate mailman primary host from lists1001 -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [23:00:40] (03CR) 10Dzahn: [C:03+1] "lgtm, as you already said I think you will have to manually remove the VIP from the old host. So disable puppet, remove IP, merge this, ru" [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [23:04:53] (03CR) 10Dzahn: [C:03+1] "once this is merged I think you can't run puppet on the old machine anymore since it will not find the service IP in Hiera. That's ok for " [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [23:06:11] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:08:35] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046787 (https://phabricator.wikimedia.org/T325793) (owner: 10Jforrester) [23:08:46] (03CR) 10Cory Massaro: [C:03+1] wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046787 (https://phabricator.wikimedia.org/T325793) (owner: 10Jforrester) [23:08:51] (03PS6) 10Scott French: DNM: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) [23:08:52] (03PS6) 10Scott French: DNM: rest-gateway: route commons-analytics via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) [23:14:51] !log cdobbins@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4043.ulsfo.wmnet with OS bullseye [23:15:01] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1002 for host cp4043.ulsfo.wmnet with OS bullseye completed: - cp4043 (**P... [23:16:29] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9901640 (10Papaul) Dear PAPAUL TSHIBAMBA, Thank you for contacting Juniper Networks Global Support. Case 2024-0617-183459 with Priority of P2 - High has been CREATED by you or a Juniper Age... [23:16:48] (03PS1) 10Zabe: Update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046789 [23:20:31] (03PS1) 10Jdlrobson: Improve responsive images and avoid for inline [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046790 (https://phabricator.wikimedia.org/T367463) [23:23:11] !log cdobbins@cumin1002 conftool action : set/pooled=yes; selector: name=cp4043.ulsfo.wmnet [23:24:48] (03PS1) 10Zabe: Initial configuration for arbcom_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046791 (https://phabricator.wikimedia.org/T363825) [23:25:10] jouncebot: nowandnext [23:25:10] No deployments scheduled for the next 2 hour(s) and 34 minute(s) [23:25:11] In 2 hour(s) and 34 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0200) [23:25:47] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901678 (10CDobbins) [23:26:15] (03CR) 10Zabe: [C:03+2] Update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046789 (owner: 10Zabe) [23:26:38] (03CR) 10Zabe: [C:03+2] Initial configuration for arbcom_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046791 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [23:26:55] (03Merged) 10jenkins-bot: Update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046789 (owner: 10Zabe) [23:27:20] (03Merged) 10jenkins-bot: Initial configuration for arbcom_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046791 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [23:28:54] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9901692 (10Scott_French) From T366851: We now understand the slow-client-startup issue to be the result of connection timeouts when new(er) versions of... [23:29:03] !log create private wiki for itwiki arbcom # T363825 [23:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:07] T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825 [23:29:32] !log zabe@deploy1002 Started scap: T363825 [23:31:26] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9901698 (10Scott_French) [23:34:14] !log zabe@deploy1002 zabe: T363825 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:34:18] T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825 [23:34:48] !log zabe@deploy1002 zabe: Continuing with sync [23:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1046793 [23:38:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1046793 (owner: 10TrainBranchBot) [23:39:16] (03PS1) 10Zabe: Initial configuration for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046794 (https://phabricator.wikimedia.org/T366649) [23:39:59] (03CR) 10CI reject: [V:04-1] Initial configuration for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046794 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [23:42:30] (03PS2) 10Zabe: Initial configuration for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046794 (https://phabricator.wikimedia.org/T366649) [23:43:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T352010)', diff saved to https://phabricator.wikimedia.org/P65122 and previous config saved to /var/cache/conftool/dbconfig/20240617-234302-ladsgroup.json [23:43:08] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:43:08] (03CR) 10CI reject: [V:04-1] Initial configuration for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046794 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [23:44:28] (03PS3) 10Zabe: Initial configuration for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046794 (https://phabricator.wikimedia.org/T366649) [23:44:33] !log zabe@deploy1002 Finished scap: T363825 (duration: 15m 00s) [23:44:37] T363825: Create private wikipedia_it_arbcom wiki - https://phabricator.wikimedia.org/T363825 [23:45:16] (03CR) 10Zabe: [C:03+2] Initial configuration for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046794 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [23:45:54] (03Merged) 10jenkins-bot: Initial configuration for u4cwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046794 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [23:46:58] !log Create an 'Universal Code of Conduct Coordinating Committee (U4C)' private wiki # T366649 [23:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:03] T366649: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' private wiki - https://phabricator.wikimedia.org/T366649 [23:47:18] !log zabe@deploy1002 Started scap: T366649 [23:48:52] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=arbcom_itwiki --cluster=all 2>&1 | tee /tmp/arbcom_it.UpdateSearchIndexConfig.log # T363825 [23:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:12] (03PS1) 10BCornwall: Set cp4044 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1046797 (https://phabricator.wikimedia.org/T364891) [23:51:57] !log zabe@deploy1002 zabe: T366649 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:52:06] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4044.ulsfo.wmnet [23:52:18] (03CR) 10BCornwall: [C:03+2] Set cp4044 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1046797 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [23:52:46] !log zabe@deploy1002 zabe: Continuing with sync [23:58:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65123 and previous config saved to /var/cache/conftool/dbconfig/20240617-235809-ladsgroup.json