[00:04:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P77605 and previous config saved to /var/cache/conftool/dbconfig/20250611-000441-marostegui.json [00:08:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1155346 [00:08:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1155346 (owner: 10TrainBranchBot) [00:10:14] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:46] PROBLEM - Disk space on an-worker1107 is CRITICAL: DISK CRITICAL - free space: / 2056 MB (3% inode=95%): /tmp 2056 MB (3% inode=95%): /var/tmp 2056 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1107&var-datasource=eqiad+prometheus/ops [00:19:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T396130)', diff saved to https://phabricator.wikimedia.org/P77606 and previous config saved to /var/cache/conftool/dbconfig/20250611-001949-marostegui.json [00:19:53] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [00:20:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:29:16] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1155346 (owner: 10TrainBranchBot) [00:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [01:30:28] 10ops-codfw, 06DC-Ops: Unresponsive management for nokiatest2002.mgmt:22 - https://phabricator.wikimedia.org/T396546 (10phaultfinder) 03NEW [01:31:25] 10ops-codfw, 06DC-Ops: Unresponsive management for nokiatest2001.mgmt:22 - https://phabricator.wikimedia.org/T396547 (10phaultfinder) 03NEW [01:31:46] PROBLEM - Disk space on an-worker1107 is CRITICAL: DISK CRITICAL - free space: / 2099 MB (3% inode=95%): /tmp 2099 MB (3% inode=95%): /var/tmp 2099 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1107&var-datasource=eqiad+prometheus/ops [01:46:02] (03PS1) 10DDesouza: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155352 (https://phabricator.wikimedia.org/T219903) [02:03:23] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155352 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [02:05:14] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:05:32] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155352 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [02:06:12] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [02:06:28] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [02:06:29] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [02:06:45] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [02:06:46] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [02:07:07] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [03:28:34] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [03:54:26] (03PS1) 10KartikMistry: Update recommendation-api to 2025-06-10-203235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155359 (https://phabricator.wikimedia.org/T374695) [04:00:54] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 197920720 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:01:54] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 30368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:10:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:21:35] <_joe_> truly nothing relevant [04:23:56] (03CR) 10Giuseppe Lavagetto: [C:03+2] robots.txt: add crawl-delay directive for semrushbot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148791 (owner: 10Giuseppe Lavagetto) [04:24:42] <_joe_> jouncebot: now [04:24:42] No deployments scheduled for the next 1 hour(s) and 35 minute(s) [04:24:42] (03Merged) 10jenkins-bot: robots.txt: add crawl-delay directive for semrushbot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148791 (owner: 10Giuseppe Lavagetto) [04:24:54] <_joe_> jouncebot: next [04:24:54] In 1 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T0600) [04:25:06] <_joe_> yeah I'm going to do this a bit earlier than usua; [04:25:08] <_joe_> *usual [04:25:44] !log oblivian@deploy1003 Started scap sync-world: Backport for [[gerrit:1148791|robots.txt: add crawl-delay directive for semrushbot]] [04:28:19] !log oblivian@deploy1003 oblivian: Backport for [[gerrit:1148791|robots.txt: add crawl-delay directive for semrushbot]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [04:30:34] !log oblivian@deploy1003 oblivian: Continuing with sync [04:37:27] !log oblivian@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148791|robots.txt: add crawl-delay directive for semrushbot]] (duration: 11m 43s) [04:43:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.514s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:48:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.491s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [05:05:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [05:06:14] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [05:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T396130)', diff saved to https://phabricator.wikimedia.org/P77607 and previous config saved to /var/cache/conftool/dbconfig/20250611-050911-marostegui.json [05:09:15] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:10:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:10:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2214 with weight 0 T396509', diff saved to https://phabricator.wikimedia.org/P77608 and previous config saved to /var/cache/conftool/dbconfig/20250611-051056-root.json [05:11:00] T396509: Switchover s6 master (db2229 -> db2214) - https://phabricator.wikimedia.org/T396509 [05:11:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s6 T396509 [05:11:21] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1155289 (https://phabricator.wikimedia.org/T396509) (owner: 10Gerrit maintenance bot) [05:15:02] !log Starting s6 codfw failover from db2229 to db2214 - T396509 [05:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T396509', diff saved to https://phabricator.wikimedia.org/P77609 and previous config saved to /var/cache/conftool/dbconfig/20250611-051525-marostegui.json [05:16:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2229 T396509', diff saved to https://phabricator.wikimedia.org/P77610 and previous config saved to /var/cache/conftool/dbconfig/20250611-051612-marostegui.json [05:16:16] T396509: Switchover s6 master (db2229 -> db2214) - https://phabricator.wikimedia.org/T396509 [05:16:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2229.codfw.wmnet with reason: Maintenance [05:18:35] (03PS1) 10Marostegui: db2229: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155366 (https://phabricator.wikimedia.org/T395989) [05:19:25] (03CR) 10Marostegui: [C:03+2] db2229: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155366 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [05:26:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T396130)', diff saved to https://phabricator.wikimedia.org/P77611 and previous config saved to /var/cache/conftool/dbconfig/20250611-052657-marostegui.json [05:27:02] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:27:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77612 and previous config saved to /var/cache/conftool/dbconfig/20250611-052719-root.json [05:28:54] (03PS1) 10Marostegui: db2238: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155369 (https://phabricator.wikimedia.org/T396549) [05:29:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2238 T396549', diff saved to https://phabricator.wikimedia.org/P77613 and previous config saved to /var/cache/conftool/dbconfig/20250611-052907-marostegui.json [05:29:11] T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549 [05:29:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2238.codfw.wmnet with reason: Maintenance [05:29:37] (03PS3) 10Samwilson: InitialiseSettings: wgTemplateDataEnableDiscovery on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151831 (https://phabricator.wikimedia.org/T377975) [05:30:08] (03CR) 10Marostegui: [C:03+2] db2238: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155369 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [05:35:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77614 and previous config saved to /var/cache/conftool/dbconfig/20250611-053527-root.json [05:39:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1040', diff saved to https://phabricator.wikimedia.org/P77615 and previous config saved to /var/cache/conftool/dbconfig/20250611-053903-marostegui.json [05:39:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1040.eqiad.wmnet with reason: Maintenance [05:40:55] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1155372 (https://phabricator.wikimedia.org/T396550) [05:40:59] (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1155373 (https://phabricator.wikimedia.org/T396550) [05:42:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P77616 and previous config saved to /var/cache/conftool/dbconfig/20250611-054204-marostegui.json [05:42:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77617 and previous config saved to /var/cache/conftool/dbconfig/20250611-054224-root.json [05:43:18] (03PS1) 10Marostegui: db-production.php: Disable writes in es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155374 (https://phabricator.wikimedia.org/T396550) [05:48:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77618 and previous config saved to /var/cache/conftool/dbconfig/20250611-054835-root.json [05:50:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77619 and previous config saved to /var/cache/conftool/dbconfig/20250611-055033-root.json [05:52:09] (03PS1) 10Marostegui: db1233: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155375 (https://phabricator.wikimedia.org/T396549) [05:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1233 T396549', diff saved to https://phabricator.wikimedia.org/P77620 and previous config saved to /var/cache/conftool/dbconfig/20250611-055222-marostegui.json [05:52:26] T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549 [05:52:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1233.eqiad.wmnet with reason: Maintenance [05:54:31] (03CR) 10Marostegui: [C:03+2] db1233: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155375 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [05:57:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77621 and previous config saved to /var/cache/conftool/dbconfig/20250611-055705-root.json [05:57:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P77622 and previous config saved to /var/cache/conftool/dbconfig/20250611-055711-marostegui.json [05:57:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77623 and previous config saved to /var/cache/conftool/dbconfig/20250611-055730-root.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T0600) [06:00:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77624 and previous config saved to /var/cache/conftool/dbconfig/20250611-060048-root.json [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool es1040', diff saved to https://phabricator.wikimedia.org/P77625 and previous config saved to /var/cache/conftool/dbconfig/20250611-060227-marostegui.json [06:03:03] (03CR) 10Marostegui: [C:03+2] db-production.php: Disable writes in es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155374 (https://phabricator.wikimedia.org/T396550) (owner: 10Marostegui) [06:03:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 T396550 [06:03:33] T396550: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T396550 [06:03:49] (03Merged) 10jenkins-bot: db-production.php: Disable writes in es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155374 (https://phabricator.wikimedia.org/T396550) (owner: 10Marostegui) [06:04:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool es1040', diff saved to https://phabricator.wikimedia.org/P77626 and previous config saved to /var/cache/conftool/dbconfig/20250611-060413-marostegui.json [06:04:27] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1155374|db-production.php: Disable writes in es7 (T396550)]] [06:05:04] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes in es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155377 [06:05:09] (03CR) 10Marostegui: [C:04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155377 (owner: 10Marostegui) [06:05:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77627 and previous config saved to /var/cache/conftool/dbconfig/20250611-060538-root.json [06:05:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool es1040', diff saved to https://phabricator.wikimedia.org/P77628 and previous config saved to /var/cache/conftool/dbconfig/20250611-060552-marostegui.json [06:06:32] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1155372 (https://phabricator.wikimedia.org/T396550) (owner: 10Gerrit maintenance bot) [06:06:40] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1155374|db-production.php: Disable writes in es7 (T396550)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:07:30] !log marostegui@deploy1003 marostegui: Continuing with sync [06:12:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T396130)', diff saved to https://phabricator.wikimedia.org/P77629 and previous config saved to /var/cache/conftool/dbconfig/20250611-061219-marostegui.json [06:12:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:12:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [06:12:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77630 and previous config saved to /var/cache/conftool/dbconfig/20250611-061236-root.json [06:12:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T396130)', diff saved to https://phabricator.wikimedia.org/P77631 and previous config saved to /var/cache/conftool/dbconfig/20250611-061242-marostegui.json [06:14:30] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155374|db-production.php: Disable writes in es7 (T396550)]] (duration: 10m 03s) [06:14:33] T396550: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T396550 [06:15:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1035 with weight 0 T396550', diff saved to https://phabricator.wikimedia.org/P77632 and previous config saved to /var/cache/conftool/dbconfig/20250611-061501-root.json [06:15:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77633 and previous config saved to /var/cache/conftool/dbconfig/20250611-061553-root.json [06:16:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1035 to es7 primary T396550', diff saved to https://phabricator.wikimedia.org/P77634 and previous config saved to /var/cache/conftool/dbconfig/20250611-061644-root.json [06:17:27] (03CR) 10Marostegui: [C:03+2] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1155373 (https://phabricator.wikimedia.org/T396550) (owner: 10Gerrit maintenance bot) [06:17:34] !log marostegui@dns1006 START - running authdns-update [06:18:21] !log marostegui@dns1006 END - running authdns-update [06:19:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool es1039', diff saved to https://phabricator.wikimedia.org/P77635 and previous config saved to /var/cache/conftool/dbconfig/20250611-061901-marostegui.json [06:19:04] !log Starting es7 eqiad failover from es1039 to es1035 - T396550 [06:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:12] (03CR) 10Marostegui: Revert "db-production.php: Disable writes in es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155377 (owner: 10Marostegui) [06:19:19] (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes in es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155377 (owner: 10Marostegui) [06:20:07] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes in es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155377 (owner: 10Marostegui) [06:20:42] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1155377|Revert "db-production.php: Disable writes in es7"]] [06:20:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77636 and previous config saved to /var/cache/conftool/dbconfig/20250611-062044-root.json [06:22:50] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1155377|Revert "db-production.php: Disable writes in es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:23:45] !log marostegui@deploy1003 marostegui: Continuing with sync [06:25:33] (03PS1) 10Muehlenhoff: Apply installserver role on install7002 [puppet] - 10https://gerrit.wikimedia.org/r/1155503 (https://phabricator.wikimedia.org/T394263) [06:25:42] !log installing libxml2 security updates [06:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77637 and previous config saved to /var/cache/conftool/dbconfig/20250611-062741-root.json [06:30:13] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [06:30:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T396130)', diff saved to https://phabricator.wikimedia.org/P77638 and previous config saved to /var/cache/conftool/dbconfig/20250611-063027-marostegui.json [06:30:31] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:30:48] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155377|Revert "db-production.php: Disable writes in es7"]] (duration: 10m 06s) [06:31:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77639 and previous config saved to /var/cache/conftool/dbconfig/20250611-063059-root.json [06:32:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [06:32:51] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [06:35:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2238 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77640 and previous config saved to /var/cache/conftool/dbconfig/20250611-063549-root.json [06:36:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.272s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:36:36] (03CR) 10Jelto: [C:03+2] gitlab: bump gitlab-settings to v1.8.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155152 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [06:38:21] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [06:38:34] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:41:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.272s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:42:08] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [06:42:45] (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role on install7002 [puppet] - 10https://gerrit.wikimedia.org/r/1155503 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [06:42:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77641 and previous config saved to /var/cache/conftool/dbconfig/20250611-064246-root.json [06:43:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2028 T395241', diff saved to https://phabricator.wikimedia.org/P77642 and previous config saved to /var/cache/conftool/dbconfig/20250611-064314-marostegui.json [06:43:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2028.codfw.wmnet with reason: Maintenance [06:45:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P77643 and previous config saved to /var/cache/conftool/dbconfig/20250611-064535-marostegui.json [06:46:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2027 T395241', diff saved to https://phabricator.wikimedia.org/P77644 and previous config saved to /var/cache/conftool/dbconfig/20250611-064606-marostegui.json [06:46:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77645 and previous config saved to /var/cache/conftool/dbconfig/20250611-064611-root.json [06:46:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2027.codfw.wmnet with reason: Maintenance [06:48:14] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [06:48:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [06:49:04] !log jmm@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=1) rolling restart_daemons on A:wdqs-all [06:49:11] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [06:50:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77646 and previous config saved to /var/cache/conftool/dbconfig/20250611-065013-root.json [06:50:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [06:52:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77647 and previous config saved to /var/cache/conftool/dbconfig/20250611-065217-root.json [06:56:29] (03PS1) 10Jelto: gitlab: bump gitlab-settings to v1.9.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155542 (https://phabricator.wikimedia.org/T395014) [06:57:18] (03CR) 10Arnaudb: [C:03+1] "thanks for the hotfix!" [puppet] - 10https://gerrit.wikimedia.org/r/1155542 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [06:57:40] (03CR) 10Jelto: [C:03+2] gitlab: bump gitlab-settings to v1.9.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155542 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [06:58:32] (03CR) 10Brouberol: "We need to bump the chart version once more, as another change was merged in the meantime" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [07:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:17] (03PS4) 10Brouberol: Configure dse-k8s-worker100[2-3] with the dse_k8s::worker role [puppet] - 10https://gerrit.wikimedia.org/r/1155120 (https://phabricator.wikimedia.org/T395557) [07:00:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P77648 and previous config saved to /var/cache/conftool/dbconfig/20250611-070042-marostegui.json [07:00:54] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [07:01:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77649 and previous config saved to /var/cache/conftool/dbconfig/20250611-070117-root.json [07:03:40] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [07:04:08] (03PS1) 10Alexandros Kosiaris: aux-k8s: Switch MTU to 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1155543 (https://phabricator.wikimedia.org/T352956) [07:05:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77650 and previous config saved to /var/cache/conftool/dbconfig/20250611-070519-root.json [07:06:34] (03PS1) 10Jelto: gitlab: bump gitlab-settings to v1.10.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155545 (https://phabricator.wikimedia.org/T395014) [07:07:09] (03CR) 10Arnaudb: [C:03+1] gitlab: bump gitlab-settings to v1.10.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155545 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [07:07:16] (03CR) 10Alexandros Kosiaris: [C:03+2] aux-k8s: Switch MTU to 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1155543 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [07:07:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77651 and previous config saved to /var/cache/conftool/dbconfig/20250611-070722-root.json [07:07:25] (03CR) 10Jelto: [C:03+2] gitlab: bump gitlab-settings to v1.10.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155545 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [07:09:55] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [07:10:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [07:14:47] RESOLVED: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [07:15:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T396130)', diff saved to https://phabricator.wikimedia.org/P77652 and previous config saved to /var/cache/conftool/dbconfig/20250611-071549-marostegui.json [07:15:53] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:16:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [07:16:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77653 and previous config saved to /var/cache/conftool/dbconfig/20250611-071612-marostegui.json [07:20:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77654 and previous config saved to /var/cache/conftool/dbconfig/20250611-072024-root.json [07:20:36] (03CR) 10Elukey: [C:03+1] pyrra: update o11y slos to 4w window [puppet] - 10https://gerrit.wikimedia.org/r/1155246 (https://phabricator.wikimedia.org/T395916) (owner: 10Herron) [07:21:06] (03PS1) 10Slyngshede: IDP: Update stylesheets [dns] - 10https://gerrit.wikimedia.org/r/1155546 [07:22:10] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:22:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77655 and previous config saved to /var/cache/conftool/dbconfig/20250611-072227-root.json [07:24:00] (03CR) 10Slyngshede: [C:03+2] IDP: Update stylesheets [dns] - 10https://gerrit.wikimedia.org/r/1155546 (owner: 10Slyngshede) [07:24:08] !log slyngshede@dns1004 START - running authdns-update [07:24:58] !log slyngshede@dns1004 END - running authdns-update [07:25:01] (03PS1) 10Jelto: gitlab: bump gitlab-settings to v1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155568 (https://phabricator.wikimedia.org/T395014) [07:25:19] PROBLEM - TFTP service on install7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [07:25:40] (03CR) 10Arnaudb: [C:03+1] gitlab: bump gitlab-settings to v1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155568 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [07:26:01] 06SRE: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845#10902857 (10Jgiannelos) Hey @Eevans * Regarding mobile-sections this has been completely decommisioned for long time now. I don't think we need storage for this anymore. * Mobile-html and media-list has also... [07:26:03] (03CR) 10Fabfur: [C:03+2] hiera: x-provenance header on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [07:26:05] (03CR) 10Jelto: [C:03+2] gitlab: bump gitlab-settings to v1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155568 (https://phabricator.wikimedia.org/T395014) (owner: 10Jelto) [07:27:23] (03PS1) 10Muehlenhoff: atftpd: Add support for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) [07:27:36] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [07:28:37] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install7002), Fresh: 144 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:29:22] (03CR) 10CI reject: [V:04-1] atftpd: Add support for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:31:54] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1031.eqiad.wmnet [07:31:58] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [07:33:13] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1031.eqiad.wmnet [07:33:16] (03PS1) 10Vgutierrez: hiera: Switch magru to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1155593 (https://phabricator.wikimedia.org/T395131) [07:33:32] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [07:34:09] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155593 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [07:34:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:34:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77656 and previous config saved to /var/cache/conftool/dbconfig/20250611-073457-marostegui.json [07:35:01] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:35:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77657 and previous config saved to /var/cache/conftool/dbconfig/20250611-073530-root.json [07:35:58] (03PS2) 10Muehlenhoff: atftpd: Add support for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) [07:37:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77658 and previous config saved to /var/cache/conftool/dbconfig/20250611-073733-root.json [07:38:07] (03CR) 10Elukey: phabricator: expand support for Phabricator tasks (033 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans) [07:38:35] jmm@cumin1003 drain-node (PID 1099825) is awaiting input [07:39:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:41:03] (03CR) 10Elukey: [C:03+1] tox: add style checker and formatter environments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154766 (owner: 10Volans) [07:41:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:41:14] (03CR) 10Elukey: [C:03+1] git: add .git-blame-ignore-revs [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154776 (owner: 10Volans) [07:45:11] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [07:45:34] (03CR) 10Volans: [C:03+2] tox: add style checker and formatter environments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154766 (owner: 10Volans) [07:45:40] (03CR) 10Volans: [C:03+2] git: add .git-blame-ignore-revs [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154776 (owner: 10Volans) [07:47:09] (03CR) 10Ayounsi: [C:03+2] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [07:47:57] (03PS1) 10Muehlenhoff: Add stub keytab for install7002 [labs/private] - 10https://gerrit.wikimedia.org/r/1155594 [07:49:52] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub keytab for install7002 [labs/private] - 10https://gerrit.wikimedia.org/r/1155594 (owner: 10Muehlenhoff) [07:50:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P77659 and previous config saved to /var/cache/conftool/dbconfig/20250611-075004-marostegui.json [07:51:02] (03Merged) 10jenkins-bot: tox: add style checker and formatter environments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154766 (owner: 10Volans) [07:51:22] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5908/console" [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [07:51:45] (03PS3) 10Muehlenhoff: atftpd: Add support for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) [07:52:01] (03Merged) 10jenkins-bot: git: add .git-blame-ignore-revs [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154776 (owner: 10Volans) [07:52:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77660 and previous config saved to /var/cache/conftool/dbconfig/20250611-075240-root.json [07:52:57] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [07:53:18] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [07:53:50] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:56:22] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [07:57:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:57:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:59:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [07:59:29] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [07:59:43] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [08:00:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [08:01:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T395241)', diff saved to https://phabricator.wikimedia.org/P77661 and previous config saved to /var/cache/conftool/dbconfig/20250611-080101-fceratto.json [08:03:01] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [08:03:52] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [08:04:13] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:04:15] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [08:05:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P77662 and previous config saved to /var/cache/conftool/dbconfig/20250611-080511-marostegui.json [08:05:18] (03CR) 10Muehlenhoff: "The PCC failure for P5 is expected, we use Puppet 7 syntax." [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:05:34] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host netflow1003.eqiad.wmnet [08:05:35] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:07:17] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [08:07:20] jmm@cumin1003 drain-node (PID 1102750) is awaiting input [08:07:29] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet [08:07:45] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [08:09:17] (03PS1) 10Ayounsi: Add netflow1003 to profile::kafka::broker::custom_ferm_srange_component [puppet] - 10https://gerrit.wikimedia.org/r/1155599 [08:09:50] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow1003.eqiad.wmnet - ayounsi@cumin1003" [08:09:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [08:09:55] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow1003.eqiad.wmnet - ayounsi@cumin1003" [08:09:55] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:09:55] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache netflow1003.eqiad.wmnet on all recursors [08:09:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow1003.eqiad.wmnet on all recursors [08:10:29] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow1003.eqiad.wmnet - ayounsi@cumin1003" [08:10:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T395241)', diff saved to https://phabricator.wikimedia.org/P77664 and previous config saved to /var/cache/conftool/dbconfig/20250611-081031-fceratto.json [08:10:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow1003.eqiad.wmnet - ayounsi@cumin1003" [08:10:54] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet [08:11:02] (03PS1) 10Elukey: Rename docker_registry_ha's occurrences to docker_registry [labs/private] - 10https://gerrit.wikimedia.org/r/1155601 (https://phabricator.wikimedia.org/T390251) [08:11:23] (03PS3) 10Majavah: P:openstack: pdns: auth: Bind the API on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) [08:11:23] (03PS4) 10Majavah: P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) [08:11:23] (03PS3) 10Majavah: P:openstack: pdns: recursor: Support binding on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) [08:11:24] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [08:11:24] (03PS1) 10Majavah: P:openstack: pdns: Add type definition for host config [puppet] - 10https://gerrit.wikimedia.org/r/1155602 (https://phabricator.wikimedia.org/T396448) [08:11:25] (03PS1) 10Majavah: P:openstack: pdns: auth: Explicitely configure IPs to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1155603 (https://phabricator.wikimedia.org/T396448) [08:11:28] (03PS4) 10Alexandros Kosiaris: docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) [08:11:30] (03PS2) 10Alexandros Kosiaris: docker_registry: Move rsyslog rules from init to web.pp [puppet] - 10https://gerrit.wikimedia.org/r/1155257 (https://phabricator.wikimedia.org/T390251) [08:11:34] (03PS2) 10Alexandros Kosiaris: docker_registry: Refactor to allow >1 instance [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) [08:11:46] (03CR) 10Elukey: "Filed https://gerrit.wikimedia.org/r/c/labs/private/+/1155601 to support the PCC runs. Everything LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [08:12:08] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [08:12:32] (03CR) 10Elukey: [C:03+1] "Makes sense yes!" [puppet] - 10https://gerrit.wikimedia.org/r/1155257 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [08:12:47] (03PS14) 10Filippo Giunchedi: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [08:12:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5910/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [08:13:15] (03CR) 10CI reject: [V:04-1] P:openstack: pdns: Add type definition for host config [puppet] - 10https://gerrit.wikimedia.org/r/1155602 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [08:13:20] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:13:33] (03CR) 10Filippo Giunchedi: "I reworked the code a little in the next PS since that was easier said than explained through review cycles, let me know what you think !" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [08:13:35] ayounsi@cumin1003 makevm (PID 1102824) is awaiting input [08:13:50] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host netflow1003.eqiad.wmnet with OS bookworm [08:14:53] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [08:14:54] (03PS1) 10Gkyziridis: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) [08:15:02] 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562 (10Joe) 03NEW [08:15:05] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [08:15:40] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [08:17:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [08:18:32] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [08:18:35] (03CR) 10Elukey: "Left a nit but the idea looks really nice! I am going to wait for the +1 to when we'll have PPC ready, and/or a successful Pontoon test." [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [08:18:42] 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#10903050 (10Joe) p:05Triage→03High [08:19:51] (03CR) 10Brouberol: [C:03+2] Convert an-db100[1-2] to dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1155119 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [08:19:54] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [08:20:02] (03PS15) 10Filippo Giunchedi: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [08:20:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77665 and previous config saved to /var/cache/conftool/dbconfig/20250611-082018-marostegui.json [08:20:22] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:20:28] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [08:20:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [08:20:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T396130)', diff saved to https://phabricator.wikimedia.org/P77666 and previous config saved to /var/cache/conftool/dbconfig/20250611-082039-marostegui.json [08:20:55] (03PS4) 10Muehlenhoff: atftpd: Add support for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) [08:21:05] (03CR) 10Muehlenhoff: atftpd: Add support for Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:21:21] (03PS5) 10Muehlenhoff: atftpd: Add support for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) [08:21:53] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1155602 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [08:22:15] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [08:23:10] (03CR) 10Alexandros Kosiaris: "Yup, working on that." [puppet] - 10https://gerrit.wikimedia.org/r/1155258 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [08:25:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P77667 and previous config saved to /var/cache/conftool/dbconfig/20250611-082538-fceratto.json [08:25:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [08:26:03] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [08:26:33] !log brouberol@cumin2002 START - Cookbook sre.hosts.rename from an-db1001 to dse-k8s-worker1012 [08:26:57] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [08:27:45] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [08:27:48] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow1003.eqiad.wmnet with reason: host reimage [08:30:13] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:30:17] !log brouberol@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-db1001 to dse-k8s-worker1012 - brouberol@cumin2002" [08:30:46] !log brouberol@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-db1001 to dse-k8s-worker1012 - brouberol@cumin2002" [08:30:46] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:30:47] !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1012 on all recursors [08:30:50] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1012 on all recursors [08:30:51] !log brouberol@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1012 [08:30:53] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [08:32:09] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org [08:32:10] !log tappof@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org [08:32:11] !log brouberol@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1012 [08:32:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow1003.eqiad.wmnet with reason: host reimage [08:32:51] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-db1001 to dse-k8s-worker1012 [08:33:29] RESOLVED: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:33:42] !log T395240 May 2025 Bookworm reboots: alert2002.wikimedia.org [08:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:47] (03CR) 10JMeybohm: [C:03+2] CI: Remove invasive log message on helmfile compilation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155204 (https://phabricator.wikimedia.org/T396234) (owner: 10JMeybohm) [08:35:02] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [08:35:06] !log brouberol@cumin2002 START - Cookbook sre.hosts.move-vlan for host dse-k8s-worker1012 [08:35:06] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host dse-k8s-worker1012 [08:36:59] (03PS1) 10Muehlenhoff: profile::memcached::instance: Add support for passing firewall as an srange [puppet] - 10https://gerrit.wikimedia.org/r/1155609 [08:37:10] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [08:37:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [08:37:23] (03CR) 10CI reject: [V:04-1] profile::memcached::instance: Add support for passing firewall as an srange [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [08:37:43] (03PS1) 10Vgutierrez: hiera: Switch lvs7002 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1155610 (https://phabricator.wikimedia.org/T396561) [08:37:44] (03PS2) 10JMeybohm: CI: Remove invasive log message on helmfile compilation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155204 (https://phabricator.wikimedia.org/T396234) [08:37:44] (03PS3) 10JMeybohm: Add a script to visualize the dependencies of admin_ng environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155212 (https://phabricator.wikimedia.org/T389080) [08:38:03] (03CR) 10JMeybohm: Add a script to visualize the dependencies of admin_ng environments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155212 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [08:39:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T396130)', diff saved to https://phabricator.wikimedia.org/P77668 and previous config saved to /var/cache/conftool/dbconfig/20250611-083935-marostegui.json [08:39:39] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:39:53] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet [08:40:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:40:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P77669 and previous config saved to /var/cache/conftool/dbconfig/20250611-084045-fceratto.json [08:40:53] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10903118 (10elukey) [08:44:09] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [08:45:10] (03PS1) 10Majavah: hieradata: Add codfw1dev v6 auth DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) [08:45:11] (03PS1) 10Majavah: hieradata: Add codfw1dev v6 recursive DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) [08:45:24] (03PS2) 10Muehlenhoff: profile::memcached::instance: Add support for passing firewall as an srange [puppet] - 10https://gerrit.wikimedia.org/r/1155609 [08:46:07] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1155593 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [08:46:34] (03CR) 10CI reject: [V:04-1] profile::memcached::instance: Add support for passing firewall as an srange [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [08:46:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5911/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [08:47:59] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5912/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [08:51:25] !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1012.eqiad.wmnet with reason: host reimage [08:51:45] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet [08:51:52] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet [08:52:29] (03PS3) 10Muehlenhoff: profile::memcached::instance: Add support for passing firewall as an srange [puppet] - 10https://gerrit.wikimedia.org/r/1155609 [08:52:42] FIRING: JobUnavailable: Reduced availability for job gnmic in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:53:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow1003.eqiad.wmnet with OS bookworm [08:53:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow1003.eqiad.wmnet [08:54:42] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1012.eqiad.wmnet with reason: host reimage [08:54:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P77670 and previous config saved to /var/cache/conftool/dbconfig/20250611-085442-marostegui.json [08:55:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T395241)', diff saved to https://phabricator.wikimedia.org/P77671 and previous config saved to /var/cache/conftool/dbconfig/20250611-085552-fceratto.json [08:56:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [08:56:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T395241)', diff saved to https://phabricator.wikimedia.org/P77672 and previous config saved to /var/cache/conftool/dbconfig/20250611-085615-fceratto.json [08:57:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:58:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [08:58:41] 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573 (10Vgutierrez) 03NEW [08:58:57] (03CR) 10Muehlenhoff: [C:03+2] atftpd: Add support for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1155585 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:59:13] 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573#10903266 (10Vgutierrez) p:05Triage→03High setting as high priority given the server is getting intermittently pooled and depooled potentially impacting user traffic [08:59:22] FIRING: GnmiTargetDown: fasw2-c1b-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:59:48] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [09:00:01] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-staging-worker [09:00:22] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:01:26] GnmiTargetDown is expected, should recover soon [09:03:07] FIRING: [11x] GnmiTargetDown: cloudsw1-d5-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:03:29] FIRING: [11x] GnmiTargetDown: cloudsw1-d5-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:04:22] RESOLVED: GnmiTargetDown: fasw2-c1b-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:05:01] (03PS2) 10Filippo Giunchedi: thanos: enable tracing for store [puppet] - 10https://gerrit.wikimedia.org/r/1155153 (https://phabricator.wikimedia.org/T394318) [09:05:02] (03PS2) 10Filippo Giunchedi: thanos: enforce series limit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318) [09:05:30] (03PS4) 10Muehlenhoff: profile::memcached::instance: Add support for passing firewall as an srange [puppet] - 10https://gerrit.wikimedia.org/r/1155609 [09:05:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T395241)', diff saved to https://phabricator.wikimedia.org/P77673 and previous config saved to /var/cache/conftool/dbconfig/20250611-090552-fceratto.json [09:06:01] jmm@cumin1003 drain-node (PID 1110568) is awaiting input [09:08:07] RESOLVED: [11x] GnmiTargetDown: cloudsw1-d5-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:09:39] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [09:09:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P77674 and previous config saved to /var/cache/conftool/dbconfig/20250611-090949-marostegui.json [09:10:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:11:11] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [09:11:43] (03PS1) 10Muehlenhoff: Revert "Revert back to install7001" [puppet] - 10https://gerrit.wikimedia.org/r/1155616 (https://phabricator.wikimedia.org/T394263) [09:11:43] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [09:12:06] !log installing libfile-find-rule-perl security updates [09:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:36] !log brouberol@cumin2002 START - Cookbook sre.hosts.rename from an-db1002 to dse-k8s-worker1013 [09:14:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [09:15:06] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [09:17:26] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1036.eqiad.wmnet [09:17:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [09:18:46] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=eqiad [09:19:52] !log repool eqiad for inference.discovery.wmnet - was left depooled after a long maintenance for k8s infra changes a week ago [09:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:03] !log brouberol@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-db1002 to dse-k8s-worker1013 - brouberol@cumin2002" [09:20:06] (03PS1) 10Ayounsi: Promote the TransitPeeringIn/OutSaturation alerts to p.aging [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) [09:20:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P77678 and previous config saved to /var/cache/conftool/dbconfig/20250611-092059-fceratto.json [09:21:10] !log brouberol@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-db1002 to dse-k8s-worker1013 - brouberol@cumin2002" [09:21:11] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:21:11] !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1013 on all recursors [09:21:14] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1013 on all recursors [09:21:15] !log brouberol@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1013 [09:21:48] (03CR) 10CI reject: [V:04-1] Promote the TransitPeeringIn/OutSaturation alerts to p.aging [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:22:26] !log brouberol@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1013 [09:23:07] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-db1002 to dse-k8s-worker1013 [09:23:29] (03PS2) 10Ayounsi: Promote the TransitPeeringIn/OutSaturation alerts to p.aging [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) [09:24:06] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [09:24:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T396130)', diff saved to https://phabricator.wikimedia.org/P77679 and previous config saved to /var/cache/conftool/dbconfig/20250611-092457-marostegui.json [09:25:01] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:25:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2194.codfw.wmnet with reason: Maintenance [09:25:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T396130)', diff saved to https://phabricator.wikimedia.org/P77680 and previous config saved to /var/cache/conftool/dbconfig/20250611-092518-marostegui.json [09:26:18] (03PS2) 10Hnowlan: trafficserver: restbaseless reading lists API for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1149625 (https://phabricator.wikimedia.org/T384891) [09:29:13] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [09:29:46] (03PS1) 10Joal: Fix analytics webrequest data purge [puppet] - 10https://gerrit.wikimedia.org/r/1155621 (https://phabricator.wikimedia.org/T395934) [09:30:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155610 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:30:50] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch magru to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1155593 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [09:30:59] (03CR) 10Btullis: [C:03+2] Fix analytics webrequest data purge [puppet] - 10https://gerrit.wikimedia.org/r/1155621 (https://phabricator.wikimedia.org/T395934) (owner: 10Joal) [09:31:24] (03CR) 10JMeybohm: [C:03+2] Add a script to visualize the dependencies of admin_ng environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155212 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [09:31:27] (03CR) 10JMeybohm: [V:03+2 C:03+2] CI: Remove invasive log message on helmfile compilation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155204 (https://phabricator.wikimedia.org/T396234) (owner: 10JMeybohm) [09:34:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:36:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P77681 and previous config saved to /var/cache/conftool/dbconfig/20250611-093606-fceratto.json [09:37:17] !log use Google Trust Services (GTS) unified TLS certificate on magru - T395131 [09:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:20] T395131: Replace Digicert TLS certs with Google Trust Services ones - https://phabricator.wikimedia.org/T395131 [09:37:34] (03CR) 10Majavah: "I'm not a huge fan of `src_sets` overriding `srange`. What do you think about adding both rules if both are set, or at least throwing a vi" [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [09:38:13] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [09:40:37] (03CR) 10Ayounsi: [C:03+1] Revert "Revert back to install7001" [puppet] - 10https://gerrit.wikimedia.org/r/1155616 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:40:45] !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1013.eqiad.wmnet with reason: host reimage [09:42:10] (03PS1) 10Ayounsi: DHCP: install7001->7002 [homer/public] - 10https://gerrit.wikimedia.org/r/1155622 (https://phabricator.wikimedia.org/T394263) [09:43:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1155622 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:43:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T396130)', diff saved to https://phabricator.wikimedia.org/P77682 and previous config saved to /var/cache/conftool/dbconfig/20250611-094319-marostegui.json [09:43:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:43:42] (03CR) 10Ayounsi: [C:03+2] DHCP: install7001->7002 [homer/public] - 10https://gerrit.wikimedia.org/r/1155622 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:43:45] (03CR) 10Muehlenhoff: [C:03+2] Revert "Revert back to install7001" [puppet] - 10https://gerrit.wikimedia.org/r/1155616 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:44:12] (03Merged) 10jenkins-bot: DHCP: install7001->7002 [homer/public] - 10https://gerrit.wikimedia.org/r/1155622 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:44:13] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1013.eqiad.wmnet with reason: host reimage [09:47:59] jmm@cumin1003 drain-node (PID 1114280) is awaiting input [09:48:19] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [09:51:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T395241)', diff saved to https://phabricator.wikimedia.org/P77683 and previous config saved to /var/cache/conftool/dbconfig/20250611-095113-fceratto.json [09:51:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1195.eqiad.wmnet with reason: Maintenance [09:51:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T395241)', diff saved to https://phabricator.wikimedia.org/P77684 and previous config saved to /var/cache/conftool/dbconfig/20250611-095139-fceratto.json [09:51:51] (03Merged) 10jenkins-bot: CI: Remove invasive log message on helmfile compilation error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155204 (https://phabricator.wikimedia.org/T396234) (owner: 10JMeybohm) [09:51:52] (03Merged) 10jenkins-bot: Add a script to visualize the dependencies of admin_ng environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155212 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [09:53:04] !log restarting varnish on cp5018 to clear VarnishChildRestarted alert [09:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:54] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [09:56:01] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet [09:56:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:58:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P77686 and previous config saved to /var/cache/conftool/dbconfig/20250611-095825-marostegui.json [09:58:27] 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573#10903511 (10Ladsgroup) I looked at the host a bit, it looks healthy (no swapping, no cpu saturation, etc.), nothing in kernel logs, the proxy-logs don't show anything out o... [09:59:05] (03PS1) 10Muehlenhoff: Failover webproxy to install7002 [dns] - 10https://gerrit.wikimedia.org/r/1155624 (https://phabricator.wikimedia.org/T394263) [09:59:34] 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573#10903517 (10Ladsgroup) What Matthew said about the front-end proxies was that when I doubt, just reboot them, it has uptime of 64 days and should be rebooted anyway, should... [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T1000) [10:00:07] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [10:01:48] (03PS5) 10Brouberol: Configure dse-k8s-worker100[2-3] with the dse_k8s::worker role [puppet] - 10https://gerrit.wikimedia.org/r/1155120 (https://phabricator.wikimedia.org/T395557) [10:02:00] (03CR) 10JMeybohm: [V:03+2 C:03+2] Make simple-cfssl usable for local WMF PKI deployments [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1154266 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [10:02:00] 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573#10903523 (10Vgutierrez) please go ahead @Ladsgroup [10:02:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T395241)', diff saved to https://phabricator.wikimedia.org/P77687 and previous config saved to /var/cache/conftool/dbconfig/20250611-100220-fceratto.json [10:02:33] (03PS6) 10JMeybohm: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) [10:02:45] (03PS6) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) [10:03:27] (03CR) 10CI reject: [V:04-1] cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [10:03:40] (03CR) 10CI reject: [V:04-1] coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [10:04:11] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [10:05:41] (03CR) 10Elukey: [C:03+1] Add netflow1003 to profile::kafka::broker::custom_ferm_srange_component [puppet] - 10https://gerrit.wikimedia.org/r/1155599 (owner: 10Ayounsi) [10:06:08] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [10:06:58] (03PS1) 10JMeybohm: Revert "CI: Remove invasive log message on helmfile compilation error" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155626 [10:07:36] kubestagemaster1003 will go down for a Ganeti reboot [10:07:41] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet [10:09:36] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:09:40] PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:11:18] (03PS2) 10JMeybohm: Revert "CI: Remove invasive log message on helmfile compilation error" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155626 [10:12:25] (03CR) 10JMeybohm: [V:03+2 C:03+2] Revert "CI: Remove invasive log message on helmfile compilation error" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155626 (owner: 10JMeybohm) [10:12:43] (03PS7) 10JMeybohm: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) [10:12:53] (03PS7) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) [10:13:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P77688 and previous config saved to /var/cache/conftool/dbconfig/20250611-101332-marostegui.json [10:13:57] FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:13:58] (03CR) 10Muehlenhoff: [C:03+2] Assign ncredir role to ncredir7003 [puppet] - 10https://gerrit.wikimedia.org/r/1153947 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:14:03] 06SRE, 07SRE-Unowned, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584 (10elukey) 03NEW [10:14:29] 06SRE, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10903575 (10elukey) [10:15:18] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet [10:15:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [10:16:00] RECOVERY - Host kubestagemaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [10:16:54] PROBLEM - Host ms-fe2015 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P77689 and previous config saved to /var/cache/conftool/dbconfig/20250611-101727-fceratto.json [10:17:56] RECOVERY - Host ms-fe2015 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [10:18:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:20:01] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet [10:22:37] 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573#10903594 (10Ladsgroup) rebooted and I'm seeing the requests are flowing again with 200s. Let's see if that fixes the issue. [10:23:45] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [10:27:07] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155195 (owner: 10PipelineBot) [10:28:01] (03PS3) 10Muehlenhoff: Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) [10:28:14] (03PS2) 10Muehlenhoff: Add ncredir7003 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1153948 (https://phabricator.wikimedia.org/T394263) [10:28:38] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155195 (owner: 10PipelineBot) [10:28:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T396130)', diff saved to https://phabricator.wikimedia.org/P77690 and previous config saved to /var/cache/conftool/dbconfig/20250611-102839-marostegui.json [10:28:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:28:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2205.codfw.wmnet with reason: Maintenance [10:29:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T396130)', diff saved to https://phabricator.wikimedia.org/P77691 and previous config saved to /var/cache/conftool/dbconfig/20250611-102902-marostegui.json [10:29:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [10:29:39] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet [10:30:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [10:30:46] (03CR) 10Ilias Sarantopoulos: [C:03+1] ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [10:31:48] (03PS1) 10Muehlenhoff: Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1155629 [10:32:20] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:32:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [10:32:32] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet [10:32:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P77692 and previous config saved to /var/cache/conftool/dbconfig/20250611-103234-fceratto.json [10:35:56] (03CR) 10Alexandros Kosiaris: [C:03+1] coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [10:37:24] (03CR) 10Alexandros Kosiaris: [C:03+1] cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [10:39:12] jmm@cumin1003 drain-node (PID 1121814) is awaiting input [10:40:31] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet [10:41:34] (03CR) 10Muehlenhoff: "Me neither, but OTOH this is really just transitionary: Once all call sites have moved off profile::memcached::srange, I'll change the log" [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [10:45:51] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet [10:45:58] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet [10:46:48] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet [10:46:55] (03CR) 10Majavah: [C:03+1] "sgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [10:47:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T395241)', diff saved to https://phabricator.wikimedia.org/P77693 and previous config saved to /var/cache/conftool/dbconfig/20250611-104741-fceratto.json [10:47:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T396130)', diff saved to https://phabricator.wikimedia.org/P77694 and previous config saved to /var/cache/conftool/dbconfig/20250611-104750-marostegui.json [10:47:54] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:48:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [10:48:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:48:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T395241)', diff saved to https://phabricator.wikimedia.org/P77695 and previous config saved to /var/cache/conftool/dbconfig/20250611-104825-fceratto.json [10:48:34] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [10:50:00] kubestagemaster1004 and dse-k8s-etcd1002 will go down for a Ganeti reboot [10:50:05] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet [10:52:04] PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:52:32] PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100% [10:53:28] (03CR) 10Muehlenhoff: [C:03+2] Temporarily disable access for Jon [puppet] - 10https://gerrit.wikimedia.org/r/1152307 (owner: 10Jdlrobson) [10:54:14] (03CR) 10Muehlenhoff: [C:03+2] "For when you're back; ping me and we will reinstate your access" [puppet] - 10https://gerrit.wikimedia.org/r/1152307 (owner: 10Jdlrobson) [10:55:26] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet [10:55:36] RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [10:55:41] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet [10:56:02] RECOVERY - Host kubestagemaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [10:56:57] FIRING: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:57:01] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet [10:57:41] FIRING: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:59:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T395241)', diff saved to https://phabricator.wikimedia.org/P77696 and previous config saved to /var/cache/conftool/dbconfig/20250611-105900-fceratto.json [11:00:06] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T1100). [11:00:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [11:01:03] ml-etcd1001 will go down for a Ganeti reboot [11:01:08] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet [11:01:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:02:18] !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:02:28] !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:02:41] RESOLVED: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:02:47] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:02:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P77697 and previous config saved to /var/cache/conftool/dbconfig/20250611-110257-marostegui.json [11:03:09] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:03:12] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:50] (03PS1) 10Ladsgroup: mariadb: Comment out m4 [puppet] - 10https://gerrit.wikimedia.org/r/1155637 (https://phabricator.wikimedia.org/T395999) [11:05:31] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:05:42] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [11:05:57] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:06:21] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:06:41] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet [11:06:49] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet [11:06:51] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:07:52] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10903827 (10elukey) @SKivlehan-WMF Hi! I think you need to request access to the `wmf` LDAP group, please check https://wikitech.wikimedia.org/wiki/SRE... [11:08:28] c/12 [11:08:30] ufff [11:09:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:09:31] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154437 (owner: 10PipelineBot) [11:10:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti20[45-50] into production - https://phabricator.wikimedia.org/T396590 (10MoritzMuehlenhoff) 03NEW [11:10:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10903839 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:13:17] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet [11:14:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P77698 and previous config saved to /var/cache/conftool/dbconfig/20250611-111407-fceratto.json [11:15:19] (03PS2) 10Volans: phabricator: expand support for Phabricator tasks [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 [11:15:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:16:36] (03CR) 10Volans: "addressed comments" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans) [11:16:44] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet [11:16:58] (03CR) 10Ladsgroup: [C:03+2] mariadb: Comment out m4 [puppet] - 10https://gerrit.wikimedia.org/r/1155637 (https://phabricator.wikimedia.org/T395999) (owner: 10Ladsgroup) [11:17:26] !log installing librabbitmq security updates [11:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P77699 and previous config saved to /var/cache/conftool/dbconfig/20250611-111805-marostegui.json [11:19:30] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:20:14] (03CR) 10Muehlenhoff: [C:03+2] Add ncredir7003 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1153948 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:20:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:51] (03PS1) 10Andrew-WMDE: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155643 (https://phabricator.wikimedia.org/T396002) [11:22:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet [11:22:11] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet [11:24:31] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet [11:27:09] (03PS1) 10Gmodena: dse-k8s-eqiad: remove deprecated dumps 2 config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155644 (https://phabricator.wikimedia.org/T396593) [11:28:35] !log Ran fixStuckGlobalRename.php for T396545 [11:28:38] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet [11:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:39] T396545: Unblock stuck global rename of Tok'ra Operative - https://phabricator.wikimedia.org/T396545 [11:28:42] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [11:28:44] (03CR) 10Ayounsi: [C:03+2] Add netflow1003 to profile::kafka::broker::custom_ferm_srange_component [puppet] - 10https://gerrit.wikimedia.org/r/1155599 (owner: 10Ayounsi) [11:29:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P77700 and previous config saved to /var/cache/conftool/dbconfig/20250611-112914-fceratto.json [11:29:50] (03PS4) 10Ladsgroup: Add x1 to DBRecordCache for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145243 [11:32:04] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir7003.magru.wmnet [11:32:15] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir7003.magru.wmnet [11:32:19] (03CR) 10Ayounsi: [C:03+1] Failover webproxy to install7002 [dns] - 10https://gerrit.wikimedia.org/r/1155624 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:33:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T396130)', diff saved to https://phabricator.wikimedia.org/P77701 and previous config saved to /var/cache/conftool/dbconfig/20250611-113312-marostegui.json [11:33:17] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:33:19] (03PS1) 10Brouberol: airflow: upgrade base image to pull a new cncf-kubernetes provider version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155645 (https://phabricator.wikimedia.org/T396476) [11:33:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2227.codfw.wmnet with reason: Maintenance [11:33:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T396130)', diff saved to https://phabricator.wikimedia.org/P77702 and previous config saved to /var/cache/conftool/dbconfig/20250611-113336-marostegui.json [11:33:59] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet [11:33:59] !log disable lvs7003 secondary link switch port - T367731 [11:34:01] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [11:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:05] !log klausman@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ml-serve1001.eqiad.wmnet [11:34:05] T367731: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731 [11:34:19] (03CR) 10Muehlenhoff: [C:03+2] Failover webproxy to install7002 [dns] - 10https://gerrit.wikimedia.org/r/1155624 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:34:23] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [11:34:24] !log jmm@dns1004 START - running authdns-update [11:34:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet [11:35:14] !log jmm@dns1004 END - running authdns-update [11:35:19] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet [11:35:47] (03CR) 10Brouberol: [C:03+2] airflow: upgrade base image to pull a new cncf-kubernetes provider version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155645 (https://phabricator.wikimedia.org/T396476) (owner: 10Brouberol) [11:37:32] (03PS6) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) [11:37:41] (03CR) 10CI reject: [V:04-1] ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [11:38:07] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet [11:38:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:39:02] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet [11:39:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:41:09] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve1002.eqiad.wmnet [11:42:02] !log disable lvs3010 secondary link switch port - T367731 [11:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:05] T367731: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731 [11:43:23] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet [11:43:28] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet [11:44:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T395241)', diff saved to https://phabricator.wikimedia.org/P77703 and previous config saved to /var/cache/conftool/dbconfig/20250611-114422-fceratto.json [11:44:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [11:44:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T395241)', diff saved to https://phabricator.wikimedia.org/P77704 and previous config saved to /var/cache/conftool/dbconfig/20250611-114447-fceratto.json [11:46:12] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1002.eqiad.wmnet [11:47:17] (03PS7) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) [11:51:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T395241)', diff saved to https://phabricator.wikimedia.org/P77706 and previous config saved to /var/cache/conftool/dbconfig/20250611-115140-fceratto.json [11:51:58] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet [11:52:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T396130)', diff saved to https://phabricator.wikimedia.org/P77707 and previous config saved to /var/cache/conftool/dbconfig/20250611-115231-marostegui.json [11:52:36] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:56:17] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1046.eqiad.wmnet [11:57:49] (03PS1) 10Muehlenhoff: Remove ncredir7001 from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1155649 (https://phabricator.wikimedia.org/T394263) [12:01:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1046.eqiad.wmnet [12:01:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1046.eqiad.wmnet [12:06:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P77708 and previous config saved to /var/cache/conftool/dbconfig/20250611-120648-fceratto.json [12:07:01] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [12:07:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P77709 and previous config saved to /var/cache/conftool/dbconfig/20250611-120740-marostegui.json [12:12:18] (03PS1) 10Brouberol: Revert "airflow: upgrade base image to pull a new cncf-kubernetes provider version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155651 [12:12:52] (03PS1) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) [12:13:38] (03CR) 10CI reject: [V:04-1] ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [12:15:10] (03PS2) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) [12:16:13] 10ops-esams, 06DC-Ops: esams: remove old lvs secondary links - https://phabricator.wikimedia.org/T396601 (10ayounsi) 03NEW p:05Triage→03Low [12:16:15] 10ops-magru: magru: remove old lvs secondary links - https://phabricator.wikimedia.org/T396602 (10ayounsi) 03NEW p:05Triage→03Low [12:17:25] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603 (10ayounsi) 03NEW p:05Triage→03Low [12:17:30] (03CR) 10Brouberol: [C:03+2] Revert "airflow: upgrade base image to pull a new cncf-kubernetes provider version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155651 (owner: 10Brouberol) [12:17:34] 10ops-esams, 06DC-Ops: esams: remove old lvs secondary links - https://phabricator.wikimedia.org/T396601#10904186 (10ayounsi) [12:21:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:21:15] (03CR) 10Ayounsi: [C:03+1] "Not my area of expertise but logic lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/1155649 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:21:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:21:54] RECOVERY - Disk space on an-worker1154 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1154&var-datasource=eqiad+prometheus/ops [12:21:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P77710 and previous config saved to /var/cache/conftool/dbconfig/20250611-122155-fceratto.json [12:22:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P77711 and previous config saved to /var/cache/conftool/dbconfig/20250611-122246-marostegui.json [12:22:56] (03CR) 10Ayounsi: [C:03+2] Netops: remove check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1148891 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:23:12] RECOVERY - Disk space on an-worker1110 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [12:23:12] RECOVERY - Disk space on an-worker1131 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1131&var-datasource=eqiad+prometheus/ops [12:25:33] (03PS1) 10Ilias Sarantopoulos: ml-services: increase workers in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155655 (https://phabricator.wikimedia.org/T387019) [12:26:00] RECOVERY - Disk space on an-worker1093 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [12:26:56] RECOVERY - Disk space on an-worker1124 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1124&var-datasource=eqiad+prometheus/ops [12:27:30] (03CR) 10Dat Nguyen: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155643 (https://phabricator.wikimedia.org/T396002) (owner: 10Andrew-WMDE) [12:28:06] (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5913/console" [puppet] - 10https://gerrit.wikimedia.org/r/1155229 (owner: 10FNegri) [12:30:09] (03CR) 10Kevin Bazira: [C:03+1] ml-services: increase workers in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155655 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [12:30:58] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase workers in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155655 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [12:31:48] RECOVERY - Disk space on an-worker1107 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1107&var-datasource=eqiad+prometheus/ops [12:31:53] (03PS2) 10FNegri: Revert "maintain-dbusers: Revert overly strict type" [puppet] - 10https://gerrit.wikimedia.org/r/1155229 (https://phabricator.wikimedia.org/T395999) [12:32:26] (03Merged) 10jenkins-bot: ml-services: increase workers in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155655 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [12:32:26] (03PS1) 10Aklapper: Don't call time() more than needed [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155657 [12:32:30] RECOVERY - Disk space on an-worker1109 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [12:32:48] (03PS3) 10FNegri: Revert "maintain-dbusers: Revert overly strict type" [puppet] - 10https://gerrit.wikimedia.org/r/1155229 (https://phabricator.wikimedia.org/T395999) [12:33:08] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155229 (https://phabricator.wikimedia.org/T395999) (owner: 10FNegri) [12:33:14] (03CR) 10Aklapper: [V:03+2 C:03+2] Don't call time() more than needed [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155657 (owner: 10Aklapper) [12:33:56] RECOVERY - Disk space on an-worker1117 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops [12:35:26] RECOVERY - Disk space on an-worker1105 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1105&var-datasource=eqiad+prometheus/ops [12:35:43] (03PS1) 10Aklapper: move a comment [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155658 [12:35:51] (03CR) 10Tarrow: [C:03+1] "manually checked the image tag is also present. I apparently don't have +2 here (anymore?)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155643 (https://phabricator.wikimedia.org/T396002) (owner: 10Andrew-WMDE) [12:36:31] (03PS2) 10Aklapper: move a comment [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155658 [12:36:50] (03CR) 10Aklapper: [V:03+2 C:03+2] move a comment [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155658 (owner: 10Aklapper) [12:37:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T395241)', diff saved to https://phabricator.wikimedia.org/P77712 and previous config saved to /var/cache/conftool/dbconfig/20250611-123702-fceratto.json [12:37:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [12:37:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T395241)', diff saved to https://phabricator.wikimedia.org/P77713 and previous config saved to /var/cache/conftool/dbconfig/20250611-123727-fceratto.json [12:37:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T396130)', diff saved to https://phabricator.wikimedia.org/P77714 and previous config saved to /var/cache/conftool/dbconfig/20250611-123753-marostegui.json [12:37:57] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:41:14] (03CR) 10Majavah: [C:03+1] Revert "maintain-dbusers: Revert overly strict type" [puppet] - 10https://gerrit.wikimedia.org/r/1155229 (https://phabricator.wikimedia.org/T395999) (owner: 10FNegri) [12:41:32] !log disable lvs7002 secondary link switch port - T367731 [12:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:36] T367731: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731 [12:43:42] !log disable lvs7001 secondary link switch port - T367731 [12:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:49] (03CR) 10Filippo Giunchedi: pdb_resource_exporter: add puppetdb resource exporter to puppedb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [12:44:11] 10ops-magru: magru: remove old lvs secondary links - https://phabricator.wikimedia.org/T396602#10904321 (10ayounsi) [12:44:17] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: enable memcache on all titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/1155231 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [12:45:35] jouncebot: nowandnext [12:45:35] No deployments scheduled for the next 0 hour(s) and 14 minute(s) [12:45:35] In 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T1300) [12:47:37] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1047.eqiad.wmnet [12:47:39] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [12:48:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:48:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T395241)', diff saved to https://phabricator.wikimedia.org/P77715 and previous config saved to /var/cache/conftool/dbconfig/20250611-124834-fceratto.json [12:49:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:50:43] (03CR) 10FNegri: [C:03+2] Revert "maintain-dbusers: Revert overly strict type" [puppet] - 10https://gerrit.wikimedia.org/r/1155229 (https://phabricator.wikimedia.org/T395999) (owner: 10FNegri) [12:51:03] (03PS1) 10Brouberol: airflow: upgrade base image to pull a new cncf-kubernetes provider version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155659 (https://phabricator.wikimedia.org/T396476) [12:51:17] !log disable lvs3009 secondary link switch port - T367731 [12:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:21] T367731: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731 [12:53:16] (03CR) 10JMeybohm: [C:03+2] cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [12:53:59] 10ops-esams, 06SRE, 06DC-Ops: esams: remove old lvs secondary links - https://phabricator.wikimedia.org/T396601#10904354 (10ayounsi) [12:54:00] 06SRE, 07SRE-Unowned: The ops-maint-gcal.js script is missing support for some vendors - https://phabricator.wikimedia.org/T381680#10904355 (10Scott_French) @elukey - Ah, I wonder if Google might have changed something. The 16384 number was based entirely on bisection with a small number of test events. It see... [12:54:28] (03Merged) 10jenkins-bot: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [12:54:39] 10ops-esams, 06SRE, 06DC-Ops: esams: remove old lvs secondary links - https://phabricator.wikimedia.org/T396601#10904373 (10ayounsi) [12:54:49] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1047.eqiad.wmnet [12:54:52] !log disable lvs3008 secondary link switch port - T367731 [12:54:54] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [12:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:57] incoming gerrit spam -- apologies in advance [12:56:04] (03CR) 10JMeybohm: calico: Add support to manage CNI installation by daemonset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [12:56:22] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384308 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155218 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:56:28] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384321 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155222 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:56:34] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384425 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155226 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:56:38] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384427 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155230 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:56:42] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384924 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155248 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:56:46] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T384922 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155245 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:56:50] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384933 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155250 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:56:53] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384938 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155251 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:56:56] (03PS1) 10Aklapper: Penalize on setting Due Date to default value [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155660 (https://phabricator.wikimedia.org/T396607) [12:56:57] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384939 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155254 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:01] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T328502 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155138 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:05] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T384998 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155136 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:08] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T370157 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155134 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:12] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T228830 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155144 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:15] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T309012 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155143 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:18] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T374842 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155142 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:22] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T367149 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155141 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:25] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T315866 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155139 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:29] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T367065 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155137 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:46] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T371083 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155131 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:50] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384309 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155133 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:57:58] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T370526 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155132 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:05] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1047.eqiad.wmnet [12:58:06] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T374823 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155130 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:16] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T374839 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155129 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:20] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T375166 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155128 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:24] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384303 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155127 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:28] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T384305 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155124 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:32] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T385583 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155598 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:36] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T385590 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155600 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:40] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T321808 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155607 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:45] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T358029 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155611 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:49] (03PS4) 10Filippo Giunchedi: monitoring services: add migration task T350694 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155140 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:53] (03PS1) 10Filippo Giunchedi: monitoring services: add migration task T332764 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155627 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:58:57] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T385587 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155135 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:59:01] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T384214 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155619 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:59:05] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T362397 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155612 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:59:13] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T370530 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155625 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:59:17] (03PS4) 10Filippo Giunchedi: monitoring services: add migration task T357099 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155145 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:59:21] (03PS2) 10Filippo Giunchedi: monitoring services: add migration task T384830 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155240 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:59:25] (03CR) 10Brouberol: [C:03+2] airflow: upgrade base image to pull a new cncf-kubernetes provider version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155659 (https://phabricator.wikimedia.org/T396476) (owner: 10Brouberol) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T1300) [13:00:05] xSavitar, Lucas_WMDE, edsanders, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:00:12] I can deploy! [13:00:13] I'm here [13:00:16] o/ [13:00:18] hi [13:00:19] !log disable lvs6002 secondary link switch port - T367731 [13:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:23] T367731: drmrs/esams/magru LVS : remove cross-rack links - https://phabricator.wikimedia.org/T367731 [13:00:31] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155244 (https://phabricator.wikimedia.org/T396219) (owner: 10Lucas Werkmeister (WMDE)) [13:00:48] xSavitar: do you want to self-service your config change? [13:01:04] You can go ahead, I'm here to help with testing :) [13:01:09] ok :) [13:01:16] (03PS3) 10Jforrester: wikifunctions: Configure memcachedUri for the function-orchestrator and enable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155241 (https://phabricator.wikimedia.org/T390746) [13:01:16] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-06-03-205630 to 2025-06-09-163022 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155661 (https://phabricator.wikimedia.org/T390753) [13:01:26] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-04-185118 to 2025-06-10-144243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155662 (https://phabricator.wikimedia.org/T390753) [13:01:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [13:02:35] (03Merged) 10jenkins-bot: SUL3: Enable client hints data on the auth shared domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [13:03:04] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1152064|SUL3: Enable client hints data on the auth shared domain (T395185)]] [13:03:04] (03PS2) 10Aklapper: Penalize on setting Due Date to default value [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155660 (https://phabricator.wikimedia.org/T396607) [13:03:08] T395185: Consider enabling client hints on auth.wikimedia.org - https://phabricator.wikimedia.org/T395185 [13:03:27] !log T393557 block requests to /api/rest_v1/page/data-parsoid [13:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:30] T393557: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557 [13:03:37] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [13:03:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P77716 and previous config saved to /var/cache/conftool/dbconfig/20250611-130341-fceratto.json [13:03:46] (03CR) 10Aklapper: [V:03+2 C:03+2] Penalize on setting Due Date to default value [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155660 (https://phabricator.wikimedia.org/T396607) (owner: 10Aklapper) [13:04:22] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603#10904450 (10ayounsi) [13:05:12] !log lucaswerkmeister-wmde@deploy1003 d3r1ck01, lucaswerkmeister-wmde: Backport for [[gerrit:1152064|SUL3: Enable client hints data on the auth shared domain (T395185)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:05:21] xSavitar: please test :) [13:05:24] okay [13:06:16] Lucas_WMDE, works as expected. [13:06:27] let's go live :) [13:06:41] (03CR) 10Samtar: [C:03+1] InitialiseSettings: wgTemplateDataEnableDiscovery on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151831 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [13:06:45] nice! [13:07:15] !log lucaswerkmeister-wmde@deploy1003 d3r1ck01, lucaswerkmeister-wmde: Continuing with sync [13:07:39] jmm@cumin1003 drain-node (PID 1138830) is awaiting input [13:08:28] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10904472 (10bking) a:03bking [13:08:41] Lucas_WMDE, thank you so much for deploying. 🙏🏽 [13:11:40] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1047.eqiad.wmnet [13:13:06] 06SRE: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845#10904495 (10Eevans) >>! In T395845#10902857, @Jgiannelos wrote: > Hey @Eevans > > * Regarding mobile-sections this has been completely decommisioned for long time now. I don't think we need storage for this a... [13:13:21] (03PS1) 10Samtar: IS: Enable `wgTemplateDataEnableDiscovery` for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155665 (https://phabricator.wikimedia.org/T377975) [13:14:14] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152064|SUL3: Enable client hints data on the auth shared domain (T395185)]] (duration: 11m 09s) [13:14:17] T395185: Consider enabling client hints on auth.wikimedia.org - https://phabricator.wikimedia.org/T395185 [13:14:19] (03CR) 10Elukey: "This will impact some dashboards, most notably the Citoid one (to backlog and we already started the quarter). It should be fine but befor" [puppet] - 10https://gerrit.wikimedia.org/r/1155316 (https://phabricator.wikimedia.org/T395916) (owner: 10Herron) [13:14:40] my backport is almost done in CI, so I’ll wait for that to finish [13:14:50] I'll have a go at self-deploying [13:15:02] (after) [13:15:05] ok [13:15:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155244 (https://phabricator.wikimedia.org/T396219) (owner: 10Lucas Werkmeister (WMDE)) [13:15:17] (thanks for the clarification, I was about to complain :D) [13:16:04] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10904507 (10bking) 05Open→03Resolved `relforge100[89]` are now part of the cluster: ` bking@relforge1008:~$ curl -s http://0:9200/_cat/nodes 10.64.164.14... [13:18:44] 06SRE: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845#10904523 (10Eevans) >>! In T395845#10904495, @Eevans wrote: >>>! In T395845#10902857, @Jgiannelos wrote: >> >> [ ... ] > > Ok, and AFAIK a truncate would —in the worst case scenario— just result in a cold cac... [13:18:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P77717 and previous config saved to /var/cache/conftool/dbconfig/20250611-131848-fceratto.json [13:19:42] (03CR) 10Vgutierrez: [C:03+1] "ncredir7003 is already pooled, all good :)" [puppet] - 10https://gerrit.wikimedia.org/r/1155649 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:20:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.119s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:23:36] This ^ is probably because of T393557. Calls to parsoid have dropped in codfw from 0.15rps to close to 0. [13:23:36] T393557: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557 [13:23:38] 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573#10904567 (10Vgutierrez) 05Open→03Resolved a:03Ladsgroup No errors since the reboot, feel free to re-open the task if the issue re-appears: ` vgutierrez@lvs2013:~$... [13:23:59] (03PS15) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [13:24:32] I 'll silence it for now, let's see how the alert will behave in 15m or so. [13:25:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.119s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:25:43] 🤞 we will be able to remove mw-parsoid soon [13:25:59] (03Abandoned) 10Ayounsi: MR: rollback gNMI [homer/public] - 10https://gerrit.wikimedia.org/r/1133398 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [13:26:36] that CI build is taking longer than expected [13:26:38] :S [13:28:02] wondering if I should stop my scap and let edsanders go first [13:28:06] ah, no, it just finished! [13:28:18] then let’s let that deploy go through :) [13:28:29] (03Merged) 10jenkins-bot: Update searchsuggest message key [extensions/Wikibase] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155244 (https://phabricator.wikimedia.org/T396219) (owner: 10Lucas Werkmeister (WMDE)) [13:28:52] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1155244|Update searchsuggest message key (T396219)]] [13:28:53] (03Abandoned) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [13:28:55] T396219: ScopedTypeaheadSearch: update "searchsuggest" message key - https://phabricator.wikimedia.org/T396219 [13:29:09] akosiaris: that's awesome :D [13:29:34] akosiaris: if the alert doesn't shut off we can add a minimum rps threshold for it to fire [13:29:57] (03CR) 10Xcollazo: [C:03+1] dse-k8s-eqiad: remove deprecated dumps 2 config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155644 (https://phabricator.wikimedia.org/T396593) (owner: 10Gmodena) [13:31:01] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1155244|Update searchsuggest message key (T396219)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:14] testing… [13:31:31] works! \o/ [13:31:42] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [13:33:38] (03CR) 10Herron: [C:03+1] thanos: enable tracing for store [puppet] - 10https://gerrit.wikimedia.org/r/1155153 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [13:33:51] (03CR) 10CI reject: [V:04-1] Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [13:33:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T395241)', diff saved to https://phabricator.wikimedia.org/P77718 and previous config saved to /var/cache/conftool/dbconfig/20250611-133355-fceratto.json [13:34:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [13:34:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T395241)', diff saved to https://phabricator.wikimedia.org/P77719 and previous config saved to /var/cache/conftool/dbconfig/20250611-133420-fceratto.json [13:34:37] MatmaRex: I’m guessing your to changes can be deployed together (once we get to them)? [13:34:53] Lucas_WMDE: yep [13:34:59] ack [13:36:51] (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2025-06-10-203235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155359 (https://phabricator.wikimedia.org/T374695) (owner: 10KartikMistry) [13:37:43] 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#10904642 (10Joe) There's a few open questions here: * In terms of pure traffic control, which is what SRE want, only running detection on cache mis... [13:38:23] (03Merged) 10jenkins-bot: Update recommendation-api to 2025-06-10-203235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155359 (https://phabricator.wikimedia.org/T374695) (owner: 10KartikMistry) [13:38:49] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155244|Update searchsuggest message key (T396219)]] (duration: 09m 57s) [13:38:52] T396219: ScopedTypeaheadSearch: update "searchsuggest" message key - https://phabricator.wikimedia.org/T396219 [13:38:59] Now witness the firepower of this fully armed and operational SpiderPig! [13:39:01] deploy at will, edsanders [13:39:47] !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:39:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155295 (https://phabricator.wikimedia.org/T392121) (owner: 10Esanders) [13:40:04] “Running '/usr/local/sbin/restart-php-fpm-all php7.4-fpm [snip]' on 4 host(s)” o_O are we still running php7.4? [13:40:34] (03PS3) 10Ayounsi: Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) [13:40:46] (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements everywhere except 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155295 (https://phabricator.wikimedia.org/T392121) (owner: 10Esanders) [13:40:59] (03CR) 10CI reject: [V:04-1] Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:41:08] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1155295|Enable DiscussionTools visual enhancements everywhere except 12 wikis (T392121)]] [13:41:12] T392121: Phase 4: Offer Usability Improvements as default-on feature at wikis - https://phabricator.wikimedia.org/T392121 [13:41:58] (03PS4) 10Ayounsi: Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) [13:42:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T395241)', diff saved to https://phabricator.wikimedia.org/P77720 and previous config saved to /var/cache/conftool/dbconfig/20250611-134230-fceratto.json [13:42:38] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:43:08] !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:43:17] !log esanders@deploy1003 esanders: Backport for [[gerrit:1155295|Enable DiscussionTools visual enhancements everywhere except 12 wikis (T392121)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:43:59] (03CR) 10Hnowlan: [C:03+2] trafficserver: restbaseless reading lists API for all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1149625 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [13:45:07] !log migrating reading lists out of restbase for all wikis [13:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:41] (03CR) 10Alexandros Kosiaris: [C:03+1] calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [13:46:45] !log esanders@deploy1003 esanders: Continuing with sync [13:47:11] \o/ [13:47:54] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:48:55] !log Updated Recommnedation-API to 2025-06-10-203235-production (T374695) [13:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:58] T374695: Community-defined Translation Collections: Support collections with multiple sub-collections - https://phabricator.wikimedia.org/T374695 [13:49:24] (03PS1) 10Aklapper: Split an if clause [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155677 [13:50:18] !log upload varnish 7.1.1-2~bpo11+wmf2 to apt.wm.o (bullseye-wikimedia) - T396581 [13:50:19] (03CR) 10Aklapper: [V:03+2 C:03+2] Split an if clause [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155677 (owner: 10Aklapper) [13:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:21] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [13:53:45] !log esanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155295|Enable DiscussionTools visual enhancements everywhere except 12 wikis (T392121)]] (duration: 12m 36s) [13:53:48] T392121: Phase 4: Offer Usability Improvements as default-on feature at wikis - https://phabricator.wikimedia.org/T392121 [13:55:25] alright, I’ll finish with MatmaRex’ changes then :) [13:55:41] thanks [13:55:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155299 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [13:55:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155303 (https://phabricator.wikimedia.org/T393963) (owner: 10Bartosz Dziewoński) [13:56:34] (03Merged) 10jenkins-bot: Set $wgPHPSessionHandling to 'disable' on testwiki and beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155299 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [13:56:35] (03PS5) 10Ayounsi: Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) [13:56:37] (03Merged) 10jenkins-bot: Stop logging $wgPHPSessionHandling warnings for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155303 (https://phabricator.wikimedia.org/T393963) (owner: 10Bartosz Dziewoński) [13:56:49] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:56:58] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1155299|Set $wgPHPSessionHandling to 'disable' on testwiki and beta cluster (T362324)]], [[gerrit:1155303|Stop logging $wgPHPSessionHandling warnings for now (T393963)]] [13:57:03] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [13:57:04] T393963: PHP Deprecated: Use of $_SESSION was deprecated in MediaWiki 1.27. [Called from session_write_close in (internal function)] - https://phabricator.wikimedia.org/T393963 [13:57:25] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 10310 [13:57:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P77721 and previous config saved to /var/cache/conftool/dbconfig/20250611-135736-fceratto.json [13:58:39] (03CR) 10Ssingh: [C:03+1] hiera: Switch lvs7002 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1155610 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:59:08] !log lucaswerkmeister-wmde@deploy1003 matmarex, lucaswerkmeister-wmde: Backport for [[gerrit:1155299|Set $wgPHPSessionHandling to 'disable' on testwiki and beta cluster (T362324)]], [[gerrit:1155303|Stop logging $wgPHPSessionHandling warnings for now (T393963)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:59:29] looking [13:59:38] (03PS1) 10Vgutierrez: hiera: Switch eqsin to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1155681 (https://phabricator.wikimedia.org/T395131) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T1400) [14:00:20] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 10310 [14:00:35] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 10310 [14:00:48] Lucas_WMDE: seems good [14:01:04] (03CR) 10Andrew-WMDE: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155643 (https://phabricator.wikimedia.org/T396002) (owner: 10Andrew-WMDE) [14:01:04] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10904763 (10cmelo) Hi @elukey, thanks, and yes I would like some help with the process, I also would like to request: **analytics-privatedata-users** access, because I usually need to read some f... [14:01:11] !log lucaswerkmeister-wmde@deploy1003 matmarex, lucaswerkmeister-wmde: Continuing with sync [14:01:12] ok! [14:01:50] oh, I didn’t realize we’re already over time :S [14:02:16] Lucas_WMDE: It's fine, we don't clash in practice. [14:02:39] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155643 (https://phabricator.wikimedia.org/T396002) (owner: 10Andrew-WMDE) [14:02:42] (03CR) 10Vgutierrez: [C:03+1] "thanks for working on this" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1155237 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh) [14:04:21] !log andrew-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [14:04:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 10310 [14:06:00] (03PS2) 10Jforrester: wikifunctions: Update evaluators from 2025-06-03-205630 to 2025-06-09-163022 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155661 (https://phabricator.wikimedia.org/T390753) [14:06:00] (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-04-185118 to 2025-06-10-144243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155662 (https://phabricator.wikimedia.org/T390753) [14:06:00] (03PS4) 10Jforrester: wikifunctions: Configure memcachedUri for the function-orchestrator and enable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155241 (https://phabricator.wikimedia.org/T390746) [14:06:33] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10904773 (10cmelo) @elukey, Here are the public keys: PROD public key: ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICB5a7B3Lik8aSZpI3TOgV6uBExCmrkmn8FE/3PHmClG claudiomelo@wmf3041 ` WMCS public key:... [14:06:40] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Update evaluators from 2025-06-03-205630 to 2025-06-09-163022 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155661 (https://phabricator.wikimedia.org/T390753) (owner: 10Jforrester) [14:06:42] !log andrew-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [14:08:08] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-06-03-205630 to 2025-06-09-163022 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155661 (https://phabricator.wikimedia.org/T390753) (owner: 10Jforrester) [14:08:12] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155299|Set $wgPHPSessionHandling to 'disable' on testwiki and beta cluster (T362324)]], [[gerrit:1155303|Stop logging $wgPHPSessionHandling warnings for now (T393963)]] (duration: 11m 14s) [14:08:18] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [14:08:18] T393963: PHP Deprecated: Use of $_SESSION was deprecated in MediaWiki 1.27. [Called from session_write_close in (internal function)] - https://phabricator.wikimedia.org/T393963 [14:09:45] Lucas_WMDE: thanks! [14:10:20] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:27] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcontrol2010-dev service implementation - https://phabricator.wikimedia.org/T396064#10904809 (10Andrew) 05Open→03Resolved [14:10:42] MatmaRex: np! I’m excited to see those deprecation warnings go down in logspam-watch ^^ [14:10:51] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:10:59] !log andrew-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [14:11:20] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Update orchestrator from 2025-06-04-185118 to 2025-06-10-144243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155662 (https://phabricator.wikimedia.org/T390753) (owner: 10Jforrester) [14:11:37] !log UTC afternoon backport+config window done [14:11:39] (03PS1) 10Tchanders: temp accounts: Enable temp account creation on three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155683 (https://phabricator.wikimedia.org/T396464) [14:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:41] (03PS1) 10Tchanders: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155684 (https://phabricator.wikimedia.org/T396465) [14:11:49] !log apine@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:57] (03CR) 10Cory Massaro: wikifunctions: Update orchestrator from 2025-06-04-185118 to 2025-06-10-144243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155662 (https://phabricator.wikimedia.org/T390753) (owner: 10Jforrester) [14:12:23] !log andrew-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [14:12:36] (03CR) 10Tchanders: [C:04-2] "Planned for 24 June, 2025. Requires go-ahead from comms." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155684 (https://phabricator.wikimedia.org/T396465) (owner: 10Tchanders) [14:12:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P77722 and previous config saved to /var/cache/conftool/dbconfig/20250611-141243-fceratto.json [14:12:52] !log apine@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:13:00] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:13:06] (03CR) 10Tchanders: [C:04-2] "Planned for 17 June, 2025. Requires go-ahead from comms." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155683 (https://phabricator.wikimedia.org/T396464) (owner: 10Tchanders) [14:13:41] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:14:12] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Update orchestrator from 2025-06-04-185118 to 2025-06-10-144243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155662 (https://phabricator.wikimedia.org/T390753) (owner: 10Jforrester) [14:15:18] !log andrew-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [14:15:43] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-06-04-185118 to 2025-06-10-144243 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155662 (https://phabricator.wikimedia.org/T390753) (owner: 10Jforrester) [14:15:44] (03PS1) 10Jforrester: WikiLambda: Set repo-only config only in repo mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155686 [14:15:44] (03PS1) 10Jforrester: WikiLambda: Enable orchestrator cache updates on edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155687 (https://phabricator.wikimedia.org/T390746) [14:15:52] !log andrew-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [14:15:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:16:32] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:16:53] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:17:26] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:18:09] !log apine@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:18:28] 06SRE, 06Traffic: haproxy is able to load the same GeoIP & IP-to-ASN data as Varnish does - https://phabricator.wikimedia.org/T329849#10904877 (10Fabfur) 05Open→03Resolved p:05Triage→03Medium a:03Fabfur [14:18:41] !log apine@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:18:49] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:19:18] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:21:01] (03CR) 10Ssingh: [C:03+2] Release 9.2.10-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1155237 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh) [14:21:07] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Configure memcachedUri for the function-orchestrator and enable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155241 (https://phabricator.wikimedia.org/T390746) (owner: 10Jforrester) [14:21:32] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:22:16] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:22:43] (03Merged) 10jenkins-bot: wikifunctions: Configure memcachedUri for the function-orchestrator and enable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155241 (https://phabricator.wikimedia.org/T390746) (owner: 10Jforrester) [14:23:37] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:24:42] (03CR) 10Ssingh: "NOOP on all DNS hosts: https://puppet-compiler.wmflabs.org/output/1052109/5917/" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:26:20] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:27:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T395241)', diff saved to https://phabricator.wikimedia.org/P77723 and previous config saved to /var/cache/conftool/dbconfig/20250611-142750-fceratto.json [14:28:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [14:28:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T395241)', diff saved to https://phabricator.wikimedia.org/P77724 and previous config saved to /var/cache/conftool/dbconfig/20250611-142816-fceratto.json [14:28:48] (03PS1) 10Cory Massaro: wikifunctions: make the JSON good with commas. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155690 [14:28:53] (03CR) 10Jforrester: [C:03+2] wikifunctions: make the JSON good with commas. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155690 (owner: 10Cory Massaro) [14:30:19] (03CR) 10Muehlenhoff: [C:03+2] Remove ncredir7001 from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1155649 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [14:30:28] (03Merged) 10jenkins-bot: wikifunctions: make the JSON good with commas. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155690 (owner: 10Cory Massaro) [14:31:45] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:31:55] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:34:37] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [14:34:57] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet [14:36:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T395241)', diff saved to https://phabricator.wikimedia.org/P77726 and previous config saved to /var/cache/conftool/dbconfig/20250611-143633-fceratto.json [14:37:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155681 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [14:39:15] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet [14:40:24] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet [14:40:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet [14:41:14] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127188 (https://phabricator.wikimedia.org/T388260) (owner: 10Scott French) [14:44:52] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts relforge[1003-1004].eqiad.wmnet [14:45:18] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts relforge[1003-1004].eqiad.wmnet [14:46:47] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet [14:51:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P77727 and previous config saved to /var/cache/conftool/dbconfig/20250611-145140-fceratto.json [14:53:18] (03PS1) 10Elukey: sre.hosts.provision: improve Supermicro's PXE configs and logs [cookbooks] - 10https://gerrit.wikimedia.org/r/1155697 [14:54:39] (03PS1) 10Bking: cirrus-streaming-updater (staging): remove references to defunct host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155698 (https://phabricator.wikimedia.org/T390565) [14:54:46] (03CR) 10Volans: [C:03+1] "LGTM, question inline for more future-proofing" [cookbooks] - 10https://gerrit.wikimedia.org/r/1155697 (owner: 10Elukey) [14:55:36] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for tarrow - https://phabricator.wikimedia.org/T208491#10905094 (10Tarrow) 05Resolved→03Open I was just trying to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1155643 and discovered that I seem to... [14:56:11] (03PS8) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) [14:56:11] (03PS3) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) [14:56:44] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1052109/5918/" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:57:18] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [14:57:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-magru:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:57:54] (03PS2) 10AOkoth: wmnet: switch active doc host [dns] - 10https://gerrit.wikimedia.org/r/1155306 (https://phabricator.wikimedia.org/T392130) [14:57:57] (03CR) 10Elukey: sre.hosts.provision: improve Supermicro's PXE configs and logs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1155697 (owner: 10Elukey) [14:58:27] (03CR) 10Bking: [C:03+2] cirrus-streaming-updater (staging): remove references to defunct host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155698 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [14:58:38] (03PS1) 10Alexandros Kosiaris: wikifunctions: Enable staging access to memcached [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155702 (https://phabricator.wikimedia.org/T391986) [14:58:44] (03CR) 10Bking: [C:03+2] "self-merging, as this only affects a staging environment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155698 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [14:59:30] (03CR) 10Ssingh: [C:03+1] "Looking at the PCC diffs, +1 for the Monday deployment." [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:59:54] (03CR) 10Volans: [C:03+1] sre.hosts.provision: improve Supermicro's PXE configs and logs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1155697 (owner: 10Elukey) [15:00:12] (03Merged) 10jenkins-bot: cirrus-streaming-updater (staging): remove references to defunct host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155698 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [15:00:18] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: improve Supermicro's PXE configs and logs [cookbooks] - 10https://gerrit.wikimedia.org/r/1155697 (owner: 10Elukey) [15:00:53] (03PS9) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) [15:00:53] (03PS2) 10Gkyziridis: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) [15:01:53] !log bking@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:02:13] !log bking@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:02:25] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155703 [15:02:54] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10905110 (10elukey) 05Resolved→03Open [15:03:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:03:08] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:03:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:03:23] (03Abandoned) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [15:03:26] (03CR) 10Clément Goubert: [C:03+1] wikifunctions: Enable staging access to memcached [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155702 (https://phabricator.wikimedia.org/T391986) (owner: 10Alexandros Kosiaris) [15:04:38] (03CR) 10Ssingh: [C:03+1] hiera: Switch eqsin to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1155681 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [15:04:51] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Enable staging access to memcached [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155702 (https://phabricator.wikimedia.org/T391986) (owner: 10Alexandros Kosiaris) [15:06:15] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan1001.eqiad.wmnet [15:06:35] (03Merged) 10jenkins-bot: wikifunctions: Enable staging access to memcached [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155702 (https://phabricator.wikimedia.org/T391986) (owner: 10Alexandros Kosiaris) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:45] (03PS3) 10Gkyziridis: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) [15:06:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P77729 and previous config saved to /var/cache/conftool/dbconfig/20250611-150647-fceratto.json [15:06:58] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts relforge[1003-1004].eqiad.wmnet [15:08:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:08:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:09:04] (03PS4) 10Gkyziridis: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) [15:09:12] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:09:13] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:09:39] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:10:42] !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.10-1wm2_amd64.changes: T390912 [15:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:45] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [15:13:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for tarrow - https://phabricator.wikimedia.org/T208491#10905150 (10Addshore) Apr 20, 2023 Removed @Tarrow (1620) @thcipriani (2321) https://gerrit.wikimedia.org/r/admin/groups/3fdcf8fd0d569e90a3e9b39788a29f2c50d33be9,audit... [15:13:29] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [reason: testing 9.2.10 upgrade] [15:14:29] !log depool cp4037 to test ATS 9.2.10 upgrade: T390912 [15:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:45] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts dse-k8s-worker1012.eqiad.wmnet [15:15:10] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:15:20] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp4037*} and A:cp - 9.2.10 upgrade (T390912) [15:15:25] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1001.eqiad.wmnet [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:45] (03PS1) 10Clare Ming: xLab: Deploy v0.6.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155708 (https://phabricator.wikimedia.org/T396457) [15:16:53] (03PS1) 10Bking: relforge: remove decomm'd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1155709 (https://phabricator.wikimedia.org/T390565) [15:18:20] PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:18:28] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:18:46] PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:18:56] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:18:58] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:19:12] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:19:14] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:19:30] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:19:38] PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:19:39] (03PS1) 10Cwhite: add ecs.version checking [software/ecs] - 10https://gerrit.wikimedia.org/r/1155710 (https://phabricator.wikimedia.org/T395819) [15:19:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp4037*} and A:cp - 9.2.10 upgrade (T390912) [15:19:46] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [15:19:46] RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:19:56] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:20:14] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:20:15] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [15:20:21] (03CR) 10Bking: [C:03+2] relforge: remove decomm'd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1155709 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [15:20:30] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:20:34] (03CR) 10Cwhite: [C:03+2] add ecs.version checking [software/ecs] - 10https://gerrit.wikimedia.org/r/1155710 (https://phabricator.wikimedia.org/T395819) (owner: 10Cwhite) [15:20:38] RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:20:53] bking@cumin2002 decommission (PID 3555323) is awaiting input [15:20:56] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:20:58] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.6.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155708 (https://phabricator.wikimedia.org/T396457) (owner: 10Clare Ming) [15:20:58] (03Merged) 10jenkins-bot: add ecs.version checking [software/ecs] - 10https://gerrit.wikimedia.org/r/1155710 (https://phabricator.wikimedia.org/T395819) (owner: 10Cwhite) [15:21:06] (03CR) 10Bking: [C:03+2] "Self-merging, as these hosts are already decommed." [puppet] - 10https://gerrit.wikimedia.org/r/1155709 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [15:21:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for tarrow - https://phabricator.wikimedia.org/T208491#10905187 (10thcipriani) 05Open→03Resolved >>! In T208491#10905149, @Addshore wrote: > Apr 20, 2023 Removed @Tarrow (1620) @thcipriani (2321) > > https://gerrit... [15:21:17] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:21:50] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [reason: repooling after testing 9.2.10 upgrade: T390912] [15:21:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T395241)', diff saved to https://phabricator.wikimedia.org/P77731 and previous config saved to /var/cache/conftool/dbconfig/20250611-152155-fceratto.json [15:22:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for tarrow - https://phabricator.wikimedia.org/T208491#10905205 (10Tarrow) Thanks for the super fast response! [15:22:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1232.eqiad.wmnet with reason: Maintenance [15:22:19] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch eqsin to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1155681 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [15:22:20] PROBLEM - nova-compute proc maximum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:22:20] PROBLEM - nova-compute proc maximum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:22:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T395241)', diff saved to https://phabricator.wikimedia.org/P77732 and previous config saved to /var/cache/conftool/dbconfig/20250611-152220-fceratto.json [15:22:33] (03Merged) 10jenkins-bot: xLab: Deploy v0.6.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155708 (https://phabricator.wikimedia.org/T396457) (owner: 10Clare Ming) [15:22:47] PROBLEM - nova-compute proc maximum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:22:56] PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:22:56] PROBLEM - nova-compute proc maximum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:22:58] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:23:20] RECOVERY - nova-compute proc maximum on cloudvirt1053 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:23:28] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:23:46] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dse-k8s-worker1012.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [15:23:47] RECOVERY - nova-compute proc maximum on cloudvirt1049 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:23:52] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for tarrow - https://phabricator.wikimedia.org/T208491#10905211 (10Addshore) {meme, src="seal-of-approval", above="Such speed", below="much fast"} [15:23:56] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:24:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dse-k8s-worker1012.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [15:24:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:07] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dse-k8s-worker1012.eqiad.wmnet [15:24:16] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: relforge[1003-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [15:24:20] RECOVERY - nova-compute proc maximum on cloudvirt1050 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:24:22] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: relforge[1003-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [15:24:22] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts relforge[1003-1004].eqiad.wmnet [15:24:56] RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:25:12] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:25:15] (03PS1) 10Alexandros Kosiaris: wikifunctions: Fix the syntax of memcachdUri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155711 (https://phabricator.wikimedia.org/T391986) [15:25:20] RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:25:31] (03CR) 10Ilias Sarantopoulos: [C:04-1] "the list of models and threshold definitions for simplewiki and trwikfrom https://gerrit.wikimedia.org/r/q/Ifac4768d27eebab0cbd749ae8e5f06" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [15:25:56] RECOVERY - nova-compute proc maximum on cloudvirt1052 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:26:18] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:26:50] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:27:58] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Fix the syntax of memcachdUri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155711 (https://phabricator.wikimedia.org/T391986) (owner: 10Alexandros Kosiaris) [15:28:01] (03CR) 10Jforrester: [C:03+1] wikifunctions: Fix the syntax of memcachdUri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155711 (https://phabricator.wikimedia.org/T391986) (owner: 10Alexandros Kosiaris) [15:28:03] (03CR) 10Clément Goubert: [C:03+1] wikifunctions: Fix the syntax of memcachdUri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155711 (https://phabricator.wikimedia.org/T391986) (owner: 10Alexandros Kosiaris) [15:28:25] bking@cumin2002 decommission (PID 3568090) is awaiting input [15:29:01] !log use Google Trust Services (GTS) unified TLS certificate on eqsin - T395131 [15:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:04] T395131: Replace Digicert TLS certs with Google Trust Services ones - https://phabricator.wikimedia.org/T395131 [15:29:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T395241)', diff saved to https://phabricator.wikimedia.org/P77733 and previous config saved to /var/cache/conftool/dbconfig/20250611-152923-fceratto.json [15:29:29] (03Merged) 10jenkins-bot: wikifunctions: Fix the syntax of memcachdUri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155711 (https://phabricator.wikimedia.org/T391986) (owner: 10Alexandros Kosiaris) [15:30:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:32:04] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): decommission relforge100[34] - https://phabricator.wikimedia.org/T390565#10905248 (10bking) a:05bking→03None [15:32:37] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): decommission relforge100[34] - https://phabricator.wikimedia.org/T390565#10905254 (10bking) Hello DC Ops, I think these hosts are ready for y'all. If that's not the case, ping me here on in IRC (inflatador) and... [15:34:06] (03PS2) 10Bking: search: Return traffic to all DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154828 (https://phabricator.wikimedia.org/T388610) (owner: 10Ebernhardson) [15:34:08] (03PS1) 10Cwhite: refactor ecs.version testing [software/ecs] - 10https://gerrit.wikimedia.org/r/1155712 [15:34:15] (03PS8) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [15:34:19] (03PS3) 10Bking: search: Return traffic to all DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154828 (https://phabricator.wikimedia.org/T388610) (owner: 10Ebernhardson) [15:34:47] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:35:00] (03CR) 10Cwhite: [C:03+2] refactor ecs.version testing [software/ecs] - 10https://gerrit.wikimedia.org/r/1155712 (owner: 10Cwhite) [15:35:28] (03Merged) 10jenkins-bot: refactor ecs.version testing [software/ecs] - 10https://gerrit.wikimedia.org/r/1155712 (owner: 10Cwhite) [15:36:20] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:36:57] (03CR) 10Bking: [C:03+2] search: Return traffic to all DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154828 (https://phabricator.wikimedia.org/T388610) (owner: 10Ebernhardson) [15:37:31] (03PS1) 10Bking: Revert "search: Return traffic to all DCs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155713 [15:37:39] (03CR) 10Bking: [V:03+2 C:03+2] Revert "search: Return traffic to all DCs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155713 (owner: 10Bking) [15:38:17] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 264525 [15:38:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 264525 [15:38:53] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:39:28] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [15:40:08] (03CR) 10Gkyziridis: ores-extension: enable oresUI for the second batch of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [15:40:39] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10905329 (10elukey) >>! In T395966#10904763, @cmelo wrote: > Hi @elukey, thanks, and yes I would like some help with the process, I also would like to request: **analytics-privatedata-users** acc... [15:42:50] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: After moving dse-k8s-worker1012 vlan - btullis@cumin1002" [15:42:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: After moving dse-k8s-worker1012 vlan - btullis@cumin1002" [15:42:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:33] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:44:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P77734 and previous config saved to /var/cache/conftool/dbconfig/20250611-154430-fceratto.json [15:44:51] !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1012 [15:46:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1012 [15:46:38] (03PS1) 10Elukey: admin: move cmelo to ssh user [puppet] - 10https://gerrit.wikimedia.org/r/1155717 (https://phabricator.wikimedia.org/T395966) [15:47:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [15:50:15] (03CR) 10Scott French: [C:03+2] shellbox: align image version to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127188 (https://phabricator.wikimedia.org/T388260) (owner: 10Scott French) [15:50:33] (03CR) 10Cathal Mooney: [C:03+1] Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [15:52:00] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:52:01] (03Merged) 10jenkins-bot: shellbox: align image version to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127188 (https://phabricator.wikimedia.org/T388260) (owner: 10Scott French) [15:53:30] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:31] (03PS1) 10Hnowlan: services_proxy: change mobileapps port [puppet] - 10https://gerrit.wikimedia.org/r/1155719 (https://phabricator.wikimedia.org/T367418) [15:54:31] huh [15:54:52] (03PS2) 10Hnowlan: services_proxy: change mobileapps port [puppet] - 10https://gerrit.wikimedia.org/r/1155719 (https://phabricator.wikimedia.org/T367418) [15:54:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:01] !incidents [15:55:01] 6341 (UNACKED) ProbeDown sre (103.102.166.224 ip4 text-https:443 probes/service http_text-https_ip4 eqsin) [15:55:05] uh oh [15:55:07] !ack 6341 [15:55:08] 6341 (ACKED) ProbeDown sre (103.102.166.224 ip4 text-https:443 probes/service http_text-https_ip4 eqsin) [15:55:09] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [15:55:30] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:55:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1155717 (https://phabricator.wikimedia.org/T395966) (owner: 10Elukey) [15:55:37] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [15:55:52] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [15:55:59] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [15:56:10] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [15:56:17] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [15:56:20] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:56:26] ttfb is way up in the last 25 minutes or so [15:56:30] yep [15:56:31] <_joe_> sukhe: fwiw things are ok here [15:56:34] moving to -security [15:56:35] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [15:56:39] <_joe_> like the site is blazing fast [15:56:40] _joe_: only eqsin is impacted [15:56:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:56:42] eqsin feels slow [15:56:44] even on ssh [15:57:09] that also explains the purged alerts [15:57:22] !log dancy@deploy1003 Installing scap version "4.173.0" for 2 host(s) [15:57:41] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:48] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:57:51] (03PS1) 10Btullis: Remove obsolete analytics_cluster::postgresql role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1155720 (https://phabricator.wikimedia.org/T395557) [15:58:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:58:56] !incidents [15:58:57] 6341 (ACKED) ProbeDown sre (103.102.166.224 ip4 text-https:443 probes/service http_text-https_ip4 eqsin) [15:58:57] 6342 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [15:59:02] !ack 6342 [15:59:03] 6342 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [15:59:03] !ack 6342 [15:59:04] 6342 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [15:59:12] !log dancy@deploy1003 Installation of scap version "4.173.0" completed for 2 hosts [15:59:19] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:59:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P77735 and previous config saved to /var/cache/conftool/dbconfig/20250611-155937-fceratto.json [15:59:47] RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:59:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:13] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:00:22] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [16:00:23] (03PS5) 10Gkyziridis: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) [16:01:27] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:01:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:01:50] (03CR) 10Gkyziridis: ores-extension: enable oresUI for the second batch of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [16:02:22] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [16:02:41] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:06:41] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:06:41] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155719 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [16:09:53] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts dse-k8s-worker1013.eqiad.wmnet [16:10:11] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@b0517a4]: Deploy to pickup T385112#10905490. [16:10:15] T385112: Investigate reasons for remaining inconsistencies - https://phabricator.wikimedia.org/T385112 [16:10:49] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@b0517a4]: Deploy to pickup T385112#10905490. (duration: 02m 14s) [16:11:50] (03CR) 10Ilias Sarantopoulos: "Just a small nit regarding sorting the entries in the array, other than that LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [16:14:05] (03PS1) 10Tchanders: WIP Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) [16:14:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T395241)', diff saved to https://phabricator.wikimedia.org/P77736 and previous config saved to /var/cache/conftool/dbconfig/20250611-161444-fceratto.json [16:15:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1234.eqiad.wmnet with reason: Maintenance [16:15:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T395241)', diff saved to https://phabricator.wikimedia.org/P77737 and previous config saved to /var/cache/conftool/dbconfig/20250611-161509-fceratto.json [16:16:41] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:18:16] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [16:18:35] (03CR) 10Ilias Sarantopoulos: "Just a comment regarding sorting alphabetically the wikis in the arrays. Other than that it looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [16:18:43] (03CR) 10Ahmon Dancy: [C:03+1] "This looks reasonable to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136044 (https://phabricator.wikimedia.org/T364694) (owner: 10Aklapper) [16:20:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:22:17] btullis@cumin1003 reimage (PID 1157910) is awaiting input [16:22:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T395241)', diff saved to https://phabricator.wikimedia.org/P77738 and previous config saved to /var/cache/conftool/dbconfig/20250611-162335-fceratto.json [16:23:57] btullis@cumin1002 decommission (PID 1236153) is awaiting input [16:24:42] (03CR) 10Dzahn: "I have a side question. Should I not expect to see newer versions on https://docker-registry.wikimedia.org/buildkitd/tags/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/1155324 (https://phabricator.wikimedia.org/T394931) (owner: 10Brennen Bearnes) [16:24:53] (03CR) 10Dzahn: [C:03+2] gitlab runners: update buildkitd to v0.22.0 [puppet] - 10https://gerrit.wikimedia.org/r/1155324 (https://phabricator.wikimedia.org/T394931) (owner: 10Brennen Bearnes) [16:25:09] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dse-k8s-worker1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [16:25:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dse-k8s-worker1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [16:25:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dse-k8s-worker1013.eqiad.wmnet [16:25:58] (03CR) 10Ahmon Dancy: [C:03+1] "https://docker-registry.wikimedia.org/repos/releng/buildkit/tags/ is where you want to look" [puppet] - 10https://gerrit.wikimedia.org/r/1155324 (https://phabricator.wikimedia.org/T394931) (owner: 10Brennen Bearnes) [16:27:05] (03PS1) 10Cwhite: Change message when no errors found [software/ecs] - 10https://gerrit.wikimedia.org/r/1155726 [16:27:12] (03CR) 10Ahmon Dancy: [C:03+1] "In fact, if you know of a way to delete https://docker-registry.wikimedia.org/buildkitd, that would be nice." [puppet] - 10https://gerrit.wikimedia.org/r/1155324 (https://phabricator.wikimedia.org/T394931) (owner: 10Brennen Bearnes) [16:27:55] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [16:28:30] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:53] (03PS1) 10Aklapper: Penalize on linking a Pholio Mock [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155727 (https://phabricator.wikimedia.org/T396609) [16:29:25] (03CR) 10Brennen Bearnes: "Last I knew, deleting images from the registry was considered a no-go, but maybe (hopefully) that has changed..." [puppet] - 10https://gerrit.wikimedia.org/r/1155324 (https://phabricator.wikimedia.org/T394931) (owner: 10Brennen Bearnes) [16:29:56] (03CR) 10Dzahn: [C:03+2] "ah, thanks, gotcha!" [puppet] - 10https://gerrit.wikimedia.org/r/1155324 (https://phabricator.wikimedia.org/T394931) (owner: 10Brennen Bearnes) [16:30:06] (03CR) 10Aklapper: [V:03+2 C:03+2] Penalize on linking a Pholio Mock [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1155727 (https://phabricator.wikimedia.org/T396609) (owner: 10Aklapper) [16:30:15] (03CR) 10Dzahn: [C:03+1] "What is the status of doc2003 regarding backups now? Last comment I saw was about adding it to the ignore list. But if that is production " [dns] - 10https://gerrit.wikimedia.org/r/1155306 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [16:30:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:31:50] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: After moving dse-k8s-worker1013 vlan - btullis@cumin1002" [16:31:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: After moving dse-k8s-worker1013 vlan - btullis@cumin1002" [16:31:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:05] !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1013 [16:33:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1013 [16:34:24] 10SRE-SLO, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07OKR-Work, 07Workstreams: Establish an SLO for the Wikifunctions integration into Wikimedia projects' wikitext pages, to assure reader experience quality is maintained during roll-out - https://phabricator.wikimedia.org/T390548#10905611 (10DSantamaria) [16:34:28] !log btullis@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:34:42] (03CR) 10Dzahn: [V:03+1 C:03+1] ";; ANSWER SECTION:" [puppet] - 10https://gerrit.wikimedia.org/r/1154855 (owner: 10BCornwall) [16:36:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:36:58] 10SRE-SLO, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07OKR-Work, 07Workstreams: Establish an SLO for the Wikifunctions integration into Wikimedia projects' wikitext pages, to assure reader experience quality is maintained during roll-out - https://phabricator.wikimedia.org/T390548#10905621 (10DSantamaria) [16:37:03] !incidents [16:37:04] 6343 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [16:37:04] 6342 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [16:37:04] 6341 (RESOLVED) ProbeDown sre (103.102.166.224 ip4 text-https:443 probes/service http_text-https_ip4 eqsin) [16:37:13] !ack 6343 [16:37:13] 6343 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [16:38:40] 10ops-codfw, 06DC-Ops: Alert for device lsw1-b3-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396635 (10phaultfinder) 03NEW [16:38:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P77739 and previous config saved to /var/cache/conftool/dbconfig/20250611-163842-fceratto.json [16:40:22] btullis@cumin1002 provision (PID 1261705) is awaiting input [16:41:13] (03CR) 10Aleksandar Mastilovic: "What does this "experimental check failed" mean? Is there a change required to the source code?" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [16:41:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:42:41] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:30] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:45:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:47:41] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:33] 10ops-codfw, 06DC-Ops: Alert for device lsw1-b5-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396638 (10phaultfinder) 03NEW [16:49:40] (03PS1) 10AOkoth: doc: make doc2003 the active host [puppet] - 10https://gerrit.wikimedia.org/r/1155733 (https://phabricator.wikimedia.org/T392130) [16:53:35] 10ops-codfw, 06DC-Ops: Alert for device lsw1-c1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396639 (10phaultfinder) 03NEW [16:53:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P77740 and previous config saved to /var/cache/conftool/dbconfig/20250611-165350-fceratto.json [16:54:28] (03PS6) 10Gkyziridis: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) [16:56:41] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155735 [16:58:19] 10SRE-SLO, 06Abstract Wikipedia team, 06SRE Observability, 07Essential-Work: create new SLO dashboard via Pyrra - https://phabricator.wikimedia.org/T394057#10905756 (10Jdforrester-WMF) [16:58:21] 10SRE-SLO, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07OKR-Work, 07Workstreams: Establish an SLO for the Wikifunctions integration into Wikimedia projects' wikitext pages, to assure reader experience quality is maintained during roll-out - https://phabricator.wikimedia.org/T390548#10905757 (10Jdforrester-... [16:58:23] 10SRE-SLO, 06Abstract Wikipedia team, 06SRE Observability, 07Essential-Work: create new SLO dashboard via Pyrra - https://phabricator.wikimedia.org/T394057#10905759 (10Jdforrester-WMF) [16:58:32] (03CR) 10AOkoth: "Ack. I've made `doc2003` the active_host on Puppet. That will automatically disable backups on it. I should probably merge that before thi" [dns] - 10https://gerrit.wikimedia.org/r/1155306 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [16:58:34] 10ops-codfw, 06DC-Ops: Alert for device lsw1-d1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396641 (10phaultfinder) 03NEW [17:00:05] swfrench-wmf and jasmine_: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T1700). [17:00:52] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [17:01:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [17:06:31] (03PS9) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [17:08:38] 10ops-codfw, 06DC-Ops: Alert for device lsw1-d5-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396642 (10phaultfinder) 03NEW [17:08:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T395241)', diff saved to https://phabricator.wikimedia.org/P77741 and previous config saved to /var/cache/conftool/dbconfig/20250611-170857-fceratto.json [17:09:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [17:09:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T395241)', diff saved to https://phabricator.wikimedia.org/P77742 and previous config saved to /var/cache/conftool/dbconfig/20250611-170922-fceratto.json [17:09:41] (03PS1) 10Bking: cirrussearch: return traffic to all DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155738 (https://phabricator.wikimedia.org/T388610) [17:11:36] btullis@cumin1002 provision (PID 1261705) is awaiting input [17:12:04] (03CR) 10Majavah: [C:03+1] P:idm Enable API [puppet] - 10https://gerrit.wikimedia.org/r/1154262 (https://phabricator.wikimedia.org/T364605) (owner: 10Slyngshede) [17:12:10] (03PS7) 10Ilias Sarantopoulos: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [17:12:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:12:40] (03CR) 10Ilias Sarantopoulos: [C:03+1] ores-extension: enable oresUI for the second batch of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [17:13:40] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [17:16:32] (03PS8) 10Ilias Sarantopoulos: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [17:16:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [17:17:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T395241)', diff saved to https://phabricator.wikimedia.org/P77743 and previous config saved to /var/cache/conftool/dbconfig/20250611-171733-fceratto.json [17:18:30] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [17:19:29] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [17:19:45] (03PS9) 10Ilias Sarantopoulos: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [17:20:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:22:24] (03CR) 10Btullis: [C:03+1] elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [17:22:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:22:35] 10ops-codfw, 06DC-Ops: Alert for device lsw1-d6-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396643 (10phaultfinder) 03NEW [17:29:17] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [17:32:00] (03CR) 10Herron: [C:03+2] pyrra: update o11y slos to 4w window [puppet] - 10https://gerrit.wikimedia.org/r/1155246 (https://phabricator.wikimedia.org/T395916) (owner: 10Herron) [17:32:13] (03CR) 10Mforns: "One post-merge comment:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155123 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [17:32:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P77744 and previous config saved to /var/cache/conftool/dbconfig/20250611-173240-fceratto.json [17:35:26] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [17:35:50] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [17:37:04] (03CR) 10Ryan Kemper: [C:03+2] etcd data for search-{psi,omega} dns discovery [puppet] - 10https://gerrit.wikimedia.org/r/1151308 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [17:37:56] !log T143553 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151308 (first patch in plan https://phabricator.wikimedia.org/T143553#10861215) [17:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:59] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [17:45:13] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1012.eqiad.wmnet with reason: host reimage [17:46:00] (03CR) 10Ryan Kemper: [C:03+2] search: Add dnsdisc entries for omega and psi clusters [puppet] - 10https://gerrit.wikimedia.org/r/1151300 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [17:47:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P77745 and previous config saved to /var/cache/conftool/dbconfig/20250611-174747-fceratto.json [17:48:35] !log T143553 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151300 to add dnsdisc entries for omega/psi clusters (second patch in plan https://phabricator.wikimedia.org/T143553#10861215) [17:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:39] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [17:48:52] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10906008 (10RobH) [17:48:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1012.eqiad.wmnet with reason: host reimage [17:50:20] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648 (10RobH) 03NEW [17:50:52] !log running agent on A:dnsbox T143553 [17:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:39] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1013.eqiad.wmnet with reason: host reimage [17:52:55] (03PS2) 10Ryan Kemper: search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1151316 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [17:53:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10906027 (10RobH) a:05RobH→03None @jcrespo or @Marostegui: Would one of you be the best person to handle this or should I task it over to Kwaku for assignment? Basi... [17:53:38] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1155733/5919/" [puppet] - 10https://gerrit.wikimedia.org/r/1155733 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [17:53:40] (03PS4) 10Ebernhardson: Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) [17:55:05] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1013.eqiad.wmnet with reason: host reimage [17:56:08] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search-psi [17:56:10] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649 (10RobH) 03NEW [17:56:25] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search-omega [17:57:02] !log T143553 Pooled `dns-disc=search-(omega|psi)` per plan in https://phabricator.wikimedia.org/T143553#10861215 [17:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:05] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [17:57:21] (03PS6) 10Ryan Kemper: search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [17:58:01] (03CR) 10Ryan Kemper: [C:03+2] search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [17:58:17] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10906050 (10RobH) @Jgreen: Should this turf to you or should I assign it over to Greg for allocation? Basically we need to update the firmware on the affected db hosts db... [17:58:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10906051 (10RobH) [17:59:15] (03CR) 10Dzahn: [C:03+1] "yea, that sounds right (about the order of things)" [dns] - 10https://gerrit.wikimedia.org/r/1155306 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [17:59:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10906058 (10Ladsgroup) I'm on phone give me a second to tell you bow each one is and how we can proceed. Some might be simpler than others [18:00:05] brennen and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T1800). [18:00:28] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10906061 (10RobH) [18:02:17] (03PS7) 10Ryan Kemper: search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [18:02:17] (03PS5) 10Ryan Kemper: Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [18:02:35] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] search: Add search-{psi,omega} geoip discovery entries [dns] - 10https://gerrit.wikimedia.org/r/1151304 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [18:02:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T395241)', diff saved to https://phabricator.wikimedia.org/P77746 and previous config saved to /var/cache/conftool/dbconfig/20250611-180254-fceratto.json [18:03:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1251.eqiad.wmnet with reason: Maintenance [18:03:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T395241)', diff saved to https://phabricator.wikimedia.org/P77747 and previous config saved to /var/cache/conftool/dbconfig/20250611-180309-fceratto.json [18:03:12] !log sukhe@dns1004 START - running authdns-update [18:03:44] o/ [18:03:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10906076 (10Ladsgroup) Db1250 is master of m1, needs a switchover. Db1251 is a normal s1 replica. We can depool it at any moment Db1252 is also a normal replica of w4. C... [18:04:03] !log sukhe@dns1004 END - running authdns-update [18:05:26] (03CR) 10Ryan Kemper: [C:03+2] Add search-{psi,omega}.svc.$dc.wmnet cnames [dns] - 10https://gerrit.wikimedia.org/r/1151303 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [18:05:47] !log sukhe@dns1004 START - running authdns-update [18:05:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651 (10RobH) 03NEW [18:06:20] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [18:06:39] !log sukhe@dns1004 END - running authdns-update [18:07:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10906115 (10Ladsgroup) I can take care of all of them later today except db1250. For that it has to wait a bit [18:07:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10906116 (10RobH) a:03Andrew @andrew, Would you be the best person to handle this or should I task it over to Joanna for assignment? Basically we need... [18:07:56] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10906119 (10RobH) [18:08:25] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10906122 (10RobH) 05Open→03Resolved Thanks! [18:08:45] !log sudo cumin 'A:lvs-secondary-eqiad or A:lvs-secondary-codfw' 'run-puppet-agent': T143553 [18:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:49] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [18:09:11] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10906126 (10RobH) The cookbook has been repaired via T394543 (Thank you Riccardo!) and now sub-tasks have been filed and linked from this parent task for all service groups/sre sub-teams... [18:09:25] btullis@cumin1003 reimage (PID 1166575) is awaiting input [18:09:41] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:09:57] !log 1.45.0-wmf.5 train status (392175): no current blockers, logs reasonably clean, rolling to group1 [18:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10906129 (10RobH) a:03Ladsgroup >>! In T396648#10906115, @Ladsgroup wrote: > I can take care of all of them later today except db1250. For that it has to wait a bit T... [18:10:57] !log sudo cumin 'A:lvs-low-traffic-eqiad or A:lvs-low-traffic-codfw' 'run-puppet-agent': T143553 [18:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:01] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155747 (https://phabricator.wikimedia.org/T392175) [18:11:02] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155747 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [18:11:51] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155747 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [18:12:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [18:12:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1012.eqiad.wmnet with OS bookworm [18:12:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T395241)', diff saved to https://phabricator.wikimedia.org/P77748 and previous config saved to /var/cache/conftool/dbconfig/20250611-181228-fceratto.json [18:13:20] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [18:16:04] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [18:16:05] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1013.eqiad.wmnet with OS bookworm [18:16:57] (03CR) 10Dzahn: [V:03+1] "the httpbb tests pass, so V+1. I just don't have the background if this was confirmed with releng and CI is configured to upload to the ne" [puppet] - 10https://gerrit.wikimedia.org/r/1155733 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [18:19:27] (03CR) 10Ryan Kemper: [C:03+2] search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1151316 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [18:19:28] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#10906166 (10VRiley-WMF) I currently have a case pending with servertech.com the ticket is 00503345 [18:21:19] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.5 refs T392175 [18:21:23] T392175: 1.45.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T392175 [18:21:27] brennen: once the dust settles for group 1 and logs look clean, would it be alright if I use the tail of your window to wrap up some shellbox deployments that got preempted earlier? [18:22:27] swfrench-wmf: yeah, give me a few minutes to triage logs and i'll give you a ping? [18:22:46] brennen: that sounds great - take your time! :) [18:23:15] cool. one or two things here i want to take a closer look at. [18:23:18] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:26:46] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" [18:26:53] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" [18:26:53] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:27:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P77749 and previous config saved to /var/cache/conftool/dbconfig/20250611-182735-fceratto.json [18:28:37] 10ops-codfw, 06DC-Ops: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396657 (10phaultfinder) 03NEW [18:29:34] 10ops-codfw, 06DC-Ops: Alert for device lsw1-e3-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396658 (10phaultfinder) 03NEW [18:30:42] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:31:29] !log truncating restbase mobile-sections table — T395845 [18:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:32] T395845: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845 [18:31:51] swfrench-wmf: go ahead i'd say. [18:32:35] brennen: great, thank you very much [18:33:35] 10ops-codfw, 06DC-Ops: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396659 (10phaultfinder) 03NEW [18:34:58] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [18:35:32] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [18:35:43] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [18:36:01] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [18:36:12] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [18:36:26] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [18:36:26] vriley@cumin1002 netbox (PID 1306732) is awaiting input [18:36:38] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [18:37:02] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [18:37:09] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" [18:37:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" [18:37:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:38:39] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-f3-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393785#10906271 (10phaultfinder) [18:42:12] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:42:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P77750 and previous config saved to /var/cache/conftool/dbconfig/20250611-184242-fceratto.json [18:43:31] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet [18:44:49] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:45:16] vriley@cumin1002 provision (PID 474103) is awaiting input [18:49:11] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:49:19] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:49:35] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [18:49:41] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:50:16] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [18:50:27] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [18:50:40] !log remove ganeti1047 from Ganeti cluster in eqiad for hardware diagnosis [18:50:42] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [18:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:53] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [18:51:06] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [18:51:18] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [18:51:41] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [18:52:11] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:53:30] FIRING: ProbeDown: Service ganeti1047:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:57] PROBLEM - ganeti-confd running on ganeti1047 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [18:53:57] PROBLEM - ganeti-noded running on ganeti1047 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [18:54:49] 10ops-eqiad, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660 (10MoritzMuehlenhoff) 03NEW [18:54:51] ^ expected [18:56:20] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [18:56:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [18:57:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T395241)', diff saved to https://phabricator.wikimedia.org/P77751 and previous config saved to /var/cache/conftool/dbconfig/20250611-185748-fceratto.json [18:57:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-magru:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:58:40] 06SRE: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845#10906345 (10Eevans) That didn't move the needle as much as I'd hoped. Most of the storage volumes are now at abut ~75% (using a combination of `nodetool cleanup` and the truncation of mobile-sections). restba... [18:59:59] 06SRE: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845#10906346 (10Eevans) 05Open→03Resolved [19:00:36] FYI, I am done with the previously mentioned shellbox updates [19:02:06] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [19:02:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906351 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [19:10:15] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [19:10:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906360 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [19:12:19] (03PS1) 10Bartosz Dziewoński: Change OutputPage::wrapWikiTextAsInterface() to soft-deprecation [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155749 (https://phabricator.wikimedia.org/T396618) [19:12:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155749 (https://phabricator.wikimedia.org/T396618) (owner: 10Bartosz Dziewoński) [19:13:59] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [19:14:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [19:16:33] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [19:16:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [19:17:12] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:27:24] (03PS2) 10Cwhite: Perform dot expansion per dot_expander.rb [software/ecs] - 10https://gerrit.wikimedia.org/r/1155726 [19:27:28] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10906386 (10Eevans) @VRiley-WMF any eta on this? I don't need to do any actual drive swapping right now, but knowing the number and disposition of drives avail... [19:28:26] (03CR) 10Eevans: [C:03+2] cassandra: reuse preseed for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152337 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [19:28:48] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10906391 (10VRiley-WMF) Thanks for checking in. I will get you a number soon [19:29:10] (03PS3) 10Cwhite: Perform dot expansion per dot_expander.rb [software/ecs] - 10https://gerrit.wikimedia.org/r/1155726 [19:30:12] (03CR) 10Cwhite: [C:03+2] Perform dot expansion per dot_expander.rb [software/ecs] - 10https://gerrit.wikimedia.org/r/1155726 (owner: 10Cwhite) [19:30:32] (03Merged) 10jenkins-bot: Perform dot expansion per dot_expander.rb [software/ecs] - 10https://gerrit.wikimedia.org/r/1155726 (owner: 10Cwhite) [19:30:34] vriley@cumin1002 reimage (PID 1311221) is awaiting input [19:30:50] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [19:30:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906411 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [19:31:06] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:31:28] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:23] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:40] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:34:11] (03CR) 10AOkoth: [C:03+2] doc: make doc2003 the active host [puppet] - 10https://gerrit.wikimedia.org/r/1155733 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [19:34:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:36:13] (03PS3) 10AOkoth: wmnet: switch active doc host [dns] - 10https://gerrit.wikimedia.org/r/1155306 (https://phabricator.wikimedia.org/T392130) [19:37:38] (03CR) 10AOkoth: [C:03+2] wmnet: switch active doc host [dns] - 10https://gerrit.wikimedia.org/r/1155306 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [19:38:03] !log aokoth@dns1004 START - running authdns-update [19:38:56] !log aokoth@dns1004 END - running authdns-update [19:44:38] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:45:19] vriley@cumin1002 reimage (PID 1315010) is awaiting input [19:45:49] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [19:45:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [19:46:03] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [19:46:09] (03PS1) 10Eevans: cassandra-dev2001: configure for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1155756 (https://phabricator.wikimedia.org/T391544) [19:46:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [19:50:41] (03Abandoned) 10Eevans: Use instance `ID=default` when no ID is supplied [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/384055 (https://phabricator.wikimedia.org/T178169) (owner: 10Eevans) [19:51:43] (03Abandoned) 10Eevans: Don't start cassandra on boot or via puppet [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/219503 (https://phabricator.wikimedia.org/T103134) (owner: 10GWicke) [19:52:40] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:52:41] RESOLVED: ProbeDown: Service ganeti1047:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:58:10] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10906453 (10VRiley-WMF) @Eevans We currently 15x 480 drives. [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T2000). [20:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] hi [20:01:09] hi ! i can deploy [20:01:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155749 (https://phabricator.wikimedia.org/T396618) (owner: 10Bartosz Dziewoński) [20:02:12] vriley@cumin1002 reimage (PID 1315192) is awaiting input [20:02:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:02:25] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [20:02:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [20:02:33] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [20:02:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906469 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [20:05:29] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:05:49] (03Merged) 10jenkins-bot: Change OutputPage::wrapWikiTextAsInterface() to soft-deprecation [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1155749 (https://phabricator.wikimedia.org/T396618) (owner: 10Bartosz Dziewoński) [20:06:04] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:06:12] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1155749|Change OutputPage::wrapWikiTextAsInterface() to soft-deprecation (T396618)]] [20:06:15] T396618: PHP Deprecated: Use of MediaWiki\Output\OutputPage::wrapWikiTextAsInterface was deprecated in MediaWiki 1.45. [Called from MediaWiki\Extension\Translate\Synchronization\ExportTranslationsSpecialPage::execute] - https://phabricator.wikimedia.org/T396618 [20:06:27] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:07:14] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:07:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:07:29] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [20:07:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906475 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [20:07:59] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [20:08:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906476 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [20:08:27] !log cjming@deploy1003 matmarex, cjming: Backport for [[gerrit:1155749|Change OutputPage::wrapWikiTextAsInterface() to soft-deprecation (T396618)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:08:53] MatmaRex: ok to sync? [20:09:00] cjming: yup [20:09:11] !log cjming@deploy1003 matmarex, cjming: Continuing with sync [20:11:10] (03CR) 10Krinkle: "Tagging with wmf-perf because this changes the cache/expiry handling, and because it moves flamegraph samples. All good and welcome, but I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński) [20:16:12] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155749|Change OutputPage::wrapWikiTextAsInterface() to soft-deprecation (T396618)]] (duration: 10m 00s) [20:16:16] T396618: PHP Deprecated: Use of MediaWiki\Output\OutputPage::wrapWikiTextAsInterface was deprecated in MediaWiki 1.45. [Called from MediaWiki\Extension\Translate\Synchronization\ExportTranslationsSpecialPage::execute] - https://phabricator.wikimedia.org/T396618 [20:23:59] vriley@cumin1002 reimage (PID 1318504) is awaiting input [20:24:22] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [20:24:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906528 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [20:24:51] vriley@cumin1002 reimage (PID 1318556) is awaiting input [20:24:55] (03PS1) 10Dwisehaupt: Add civi cname to civicrm for new standalone testing [dns] - 10https://gerrit.wikimedia.org/r/1155761 (https://phabricator.wikimedia.org/T261779) [20:25:54] cjming: thanks for deploying [20:26:02] np - yw! [20:27:14] (03CR) 10Jgreen: [C:03+1] Add civi cname to civicrm for new standalone testing [dns] - 10https://gerrit.wikimedia.org/r/1155761 (https://phabricator.wikimedia.org/T261779) (owner: 10Dwisehaupt) [20:27:53] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [20:28:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10906537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [20:30:27] (03CR) 10Dwisehaupt: [C:03+2] Add civi cname to civicrm for new standalone testing [dns] - 10https://gerrit.wikimedia.org/r/1155761 (https://phabricator.wikimedia.org/T261779) (owner: 10Dwisehaupt) [20:30:42] !log dwisehaupt@dns1004 START - running authdns-update [20:31:39] !log dwisehaupt@dns1004 END - running authdns-update [20:59:02] (03CR) 10Cathal Mooney: [C:03+1] "Looks good, one nit in-line." [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T2100) [21:00:27] (03CR) 10Cathal Mooney: [C:03+1] Promote the TransitPeeringIn/OutSaturation alerts to p.aging (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [21:10:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155686 (owner: 10Jforrester) [21:10:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155687 (https://phabricator.wikimedia.org/T390746) (owner: 10Jforrester) [21:11:33] (03Merged) 10jenkins-bot: WikiLambda: Set repo-only config only in repo mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155686 (owner: 10Jforrester) [21:11:39] (03Merged) 10jenkins-bot: WikiLambda: Enable orchestrator cache updates on edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155687 (https://phabricator.wikimedia.org/T390746) (owner: 10Jforrester) [21:12:03] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1155686|WikiLambda: Set repo-only config only in repo mode]], [[gerrit:1155687|WikiLambda: Enable orchestrator cache updates on edit (T390746)]] [21:12:08] T390746: When needing an Object, fetch it from the memcached pool not HTTP if so configured - https://phabricator.wikimedia.org/T390746 [21:14:12] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1155686|WikiLambda: Set repo-only config only in repo mode]], [[gerrit:1155687|WikiLambda: Enable orchestrator cache updates on edit (T390746)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:14:55] !log jforrester@deploy1003 jforrester: Continuing with sync [21:17:59] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7001.magru.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [21:18:03] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [21:21:49] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155686|WikiLambda: Set repo-only config only in repo mode]], [[gerrit:1155687|WikiLambda: Enable orchestrator cache updates on edit (T390746)]] (duration: 09m 45s) [21:21:52] T390746: When needing an Object, fetch it from the memcached pool not HTTP if so configured - https://phabricator.wikimedia.org/T390746 [21:23:47] 10SRE-Access-Requests: apine is a member of wmf and project-deployment-prep but not spider pig - https://phabricator.wikimedia.org/T396669 (10cmassaro) 03NEW [21:24:32] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7001.magru.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [21:24:36] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [21:26:21] 10SRE-Access-Requests: apine is a member of wmf and deployers but not spider pig - https://phabricator.wikimedia.org/T396669#10906693 (10cmassaro) [21:26:43] 10SRE-Access-Requests: apine is a member of wmf and deployers but not spider pig - https://phabricator.wikimedia.org/T396669#10906695 (10Jdforrester-WMF) [21:26:55] !log apine@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:27:40] !log apine@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:27:47] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:28:16] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:28:31] 10SRE-Access-Requests: apine is a member of wmf and deployers but not spider pig - https://phabricator.wikimedia.org/T396669#10906700 (10Dzahn) You need to request membership in groups "deployment" and "spiderpig-access". The "spiderpig-access" part you can request at https://idm.wikimedia.org/permissions/ see... [21:30:49] 06SRE, 10SRE-Access-Requests: apine is a member of wmf and deployers but not spider pig - https://phabricator.wikimedia.org/T396669#10906709 (10Dzahn) update: since the last edit I see you already have "deployment" (not the same as deployment-prep). This means all you need is the spiderpig-access part. Just... [21:36:15] (03PS1) 10CDobbins: varnish: add libvmod-wmfuniq to apt-get install packages [cookbooks] - 10https://gerrit.wikimedia.org/r/1155771 [21:37:15] (03CR) 10BCornwall: [C:03+1] varnish: add libvmod-wmfuniq to apt-get install packages [cookbooks] - 10https://gerrit.wikimedia.org/r/1155771 (owner: 10CDobbins) [21:38:47] (03CR) 10BCornwall: [V:03+2 C:03+1] "`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1155771 (owner: 10CDobbins) [21:39:07] (03CR) 10CDobbins: [C:03+2] varnish: add libvmod-wmfuniq to apt-get install packages [cookbooks] - 10https://gerrit.wikimedia.org/r/1155771 (owner: 10CDobbins) [21:43:46] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp7001.magru.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [21:43:50] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [21:48:57] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp7001.magru.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [21:49:01] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [21:50:30] (03PS1) 10Cwhite: beta-logs: bump phatality version [puppet] - 10https://gerrit.wikimedia.org/r/1155773 (https://phabricator.wikimedia.org/T387606) [21:50:31] (03PS1) 10Cwhite: logstash: bump phatality version [puppet] - 10https://gerrit.wikimedia.org/r/1155774 (https://phabricator.wikimedia.org/T387606) [21:51:30] (03CR) 10Cwhite: [C:03+2] beta-logs: bump phatality version [puppet] - 10https://gerrit.wikimedia.org/r/1155773 (https://phabricator.wikimedia.org/T387606) (owner: 10Cwhite) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250611T2200) [22:01:15] (03PS1) 10Andrew Bogott: Add radosgw access for members of the new 'object_storage' role. [puppet] - 10https://gerrit.wikimedia.org/r/1155775 (https://phabricator.wikimedia.org/T396594) [22:13:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10906820 (10Andrew) Yes -- assuming that the cookbook works reliably for updating the firmware, these should either be be managed by me or by the hypothe... [22:35:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10906864 (10RobH) The cookbook worked reliably for updating 4 of the 6 cirrussearch hosts (first couple were used in testing so had issues on the automat... [22:35:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10906865 (10RobH) [22:36:28] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10906866 (10RobH) [22:36:34] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10906867 (10RobH) [22:38:06] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10906869 (10RobH) a:05RobH→03BTullis @btullis, With the successful update of the cookbook, an-coord1004 can now be scheduled for downtime and update. The downtime is about 15minutes or so... [22:38:58] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10906872 (10RobH) [22:40:02] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10906884 (10RobH) @btullis, I've updated the task description to answer the quesiton on downtime and steps required. Would you like to handle the actual firmware update to these hosts via th... [22:40:09] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10906885 (10RobH) a:05RobH→03BTullis [22:40:22] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10906886 (10RobH) [22:40:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1253 (T396648)', diff saved to https://phabricator.wikimedia.org/P77754 and previous config saved to /var/cache/conftool/dbconfig/20250611-224035-ladsgroup.json [22:40:40] T396648: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648 [22:40:52] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10906889 (10RobH) p:05Triage→03Medium [22:40:54] (03CR) 10Samwilson: [C:03+1] IS: Enable `wgTemplateDataEnableDiscovery` for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155665 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [22:43:46] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1253.eqiad.wmnet with reason: Firmware upgrade (T396648) [22:47:05] !log ladsgroup@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1253.eqiad.wmnet [22:48:00] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet [22:48:32] !log ladsgroup@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1253.eqiad.wmnet [22:49:09] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet [22:50:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10906964 (10Ladsgroup) I'm getting this for db1253: ` db1253.eqiad.wmnet (Gen 15): starting db1253.eqiad.wmnet (SSD): update db1253.eqiad.wmnet (SSD): current version: 1... [22:53:14] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1253.eqiad.wmnet with reason: Firmware upgrade (T396648) [22:53:18] T396648: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648 [22:54:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10906995 (10Ladsgroup) I bumped the downtime to 48 hours, (I have shut down mariadb and ran swap off since it'll need a reboot) so if you need to do it on your own, plea... [22:57:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-magru:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:58:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10907003 (10Ladsgroup) a:05Ladsgroup→03RobH [23:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1155780 [23:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1155780 (owner: 10TrainBranchBot) [23:50:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1155780 (owner: 10TrainBranchBot)