[00:07:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195426 [00:07:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195426 (owner: 10TrainBranchBot) [00:28:57] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195426 (owner: 10TrainBranchBot) [00:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:14:06] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 25s) [01:32:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:18:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b13-drmrs and cr1-drmrs (185.15.58.148) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:19:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b13-drmrs:et-0/0/50 (Core: cr1-drmrs:et-0/0/2 {#D0101}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b13-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:20:03] (03PS5) 10RLazarus: deployment_server: Prefix `helmfile apply` output with "[service env]" [puppet] - 10https://gerrit.wikimedia.org/r/1192282 [02:20:03] (03PS2) 10RLazarus: deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352 [02:22:51] (03CR) 10CI reject: [V:04-1] deployment_server: Prefix `helmfile apply` output with "[service env]" [puppet] - 10https://gerrit.wikimedia.org/r/1192282 (owner: 10RLazarus) [02:23:20] (03CR) 10CI reject: [V:04-1] deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352 (owner: 10RLazarus) [02:31:25] (03PS6) 10RLazarus: deployment_server: Prefix `helmfile apply` output with "[service env]" [puppet] - 10https://gerrit.wikimedia.org/r/1192282 [02:31:26] (03PS3) 10RLazarus: deployment_server: Refactor charlie to add a Service dataclass [puppet] - 10https://gerrit.wikimedia.org/r/1195352 [02:32:32] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:04:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:07:10] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:38:36] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:et-0/0/2 (Core: asw1-b13-drmrs:et-0/0/50 {#D0101}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:45:10] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:47:04] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30041 bytes in 3.832 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:02:10] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:05:42] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#11267431 (10Joe) >>! In T389932#10701585, @jhathaway wrote: > In proposing possible solutions, I would love to understand a bit more why our `site.pp` uses complex regexes.... [04:09:57] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#11267432 (10Joe) >>! In T389932#10701560, @jhathaway wrote: >>>! In T389932#10697436, @Joe wrote: >>>>! In T389932#10694961, @jhathaway wrote: >>> One issue with using just... [04:47:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1027,1050].eqiad.wmnet with reason: Cloning [04:49:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1027.eqiad.wmnet onto es1050.eqiad.wmnet [04:49:11] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool es1027 - Depool es1027.eqiad.wmnet to then clone it to es1050.eqiad.wmnet - marostegui@cumin1003 [04:49:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es1027 - Depool es1027.eqiad.wmnet to then clone it to es1050.eqiad.wmnet - marostegui@cumin1003 [04:50:48] (03PS1) 10Marostegui: db1241: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195431 (https://phabricator.wikimedia.org/T406541) [04:51:38] (03CR) 10Marostegui: [C:03+2] db1241: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195431 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [04:52:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1241.eqiad.wmnet with reason: Maintenance [04:52:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1241 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83759 and previous config saved to /var/cache/conftool/dbconfig/20251013-045230-marostegui.json [04:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:53:33] (03PS1) 10Arnaudb: gerrit: re-enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1195432 (https://phabricator.wikimedia.org/T387833) [05:00:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1241 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83760 and previous config saved to /var/cache/conftool/dbconfig/20251013-050034-root.json [05:01:55] (03PS1) 10Marostegui: mariadb: Productionize es1056 [puppet] - 10https://gerrit.wikimedia.org/r/1195433 (https://phabricator.wikimedia.org/T406488) [05:02:33] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1056 [puppet] - 10https://gerrit.wikimedia.org/r/1195433 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:02:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:48] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:58] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:06:22] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1033.eqiad.wmnet onto es1056.eqiad.wmnet [05:06:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool es1033 - Depool es1033.eqiad.wmnet to then clone it to es1056.eqiad.wmnet - marostegui@cumin1003 [05:08:27] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1241 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83762 and previous config saved to /var/cache/conftool/dbconfig/20251013-051540-root.json [05:20:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es1033 - Depool es1033.eqiad.wmnet to then clone it to es1056.eqiad.wmnet - marostegui@cumin1003 [05:30:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1241 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83763 and previous config saved to /var/cache/conftool/dbconfig/20251013-053045-root.json [05:31:22] (03CR) 10Marostegui: [C:04-1] site.pp: Add es2052, remove from insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [05:34:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:34] (03PS1) 10Marostegui: db1238: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195435 (https://phabricator.wikimedia.org/T406541) [05:36:16] (03CR) 10Marostegui: [C:03+2] db1238: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195435 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [05:37:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1238.eqiad.wmnet with reason: Maintenance [05:37:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1238 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83764 and previous config saved to /var/cache/conftool/dbconfig/20251013-053723-marostegui.json [05:42:39] (03PS3) 10Arnaudb: gerrit: remove localbackup logic from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1193599 (https://phabricator.wikimedia.org/T387833) [05:45:00] (03PS4) 10Arnaudb: gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) [05:45:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1238 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83765 and previous config saved to /var/cache/conftool/dbconfig/20251013-054528-root.json [05:45:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1241 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83766 and previous config saved to /var/cache/conftool/dbconfig/20251013-054551-root.json [05:47:46] (03PS1) 10Arnaudb: gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) [05:47:51] (03PS3) 10Arnaudb: gerrit: local backup on source server only [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) [05:48:53] (03PS4) 10Arnaudb: gerrit: local backup on source server only [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) [05:49:23] (03CR) 10Arnaudb: gerrit: local backup on source server only (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:53:06] (03PS2) 10Arnaudb: gerrit: add dry run rsync [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) [06:00:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1238 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83767 and previous config saved to /var/cache/conftool/dbconfig/20251013-060034-root.json [06:15:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1238 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83768 and previous config saved to /var/cache/conftool/dbconfig/20251013-061540-root.json [06:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:18:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b13-drmrs and cr1-drmrs (185.15.58.148) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:19:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b13-drmrs:et-0/0/50 (Core: cr1-drmrs:et-0/0/2 {#D0101}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b13-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:22:20] 06SRE, 10Hiddenparma: Exclude logged in users from requestctl general filters, create separate scope for it. - https://phabricator.wikimedia.org/T407092 (10Joe) 03NEW [06:24:50] (03PS1) 10Giuseppe Lavagetto: cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) [06:30:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:30:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1238 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83769 and previous config saved to /var/cache/conftool/dbconfig/20251013-063046-root.json [06:31:01] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1195440 [06:31:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193988 (owner: 10DCausse) [06:32:32] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:34:12] (03PS1) 10Muehlenhoff: Update access metadata for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1195441 [06:35:27] (03CR) 10Muehlenhoff: [C:03+2] Update access metadata for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1195441 (owner: 10Muehlenhoff) [06:36:23] (03PS2) 10Giuseppe Lavagetto: cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) [06:39:20] (03PS5) 10Revi: kowikisource: Add "해석" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193521 (https://phabricator.wikimedia.org/T406405) [06:43:53] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:51:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [06:52:35] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1195440 (owner: 10Muehlenhoff) [06:53:53] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T0700). [07:00:05] revi, Msz2001, kostajh, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] morning [07:00:12] o/ [07:00:13] well afternoon but still [07:00:31] o/ [07:00:37] Hi [07:00:44] I can deploy [07:00:51] thanks! [07:01:14] dcausse: will you need to verify your patch? [07:01:40] kostajh: no [07:01:47] Oh are we doing bottom to up? [07:02:05] No, I am looking to see if I can bundle all the config patches in one deploy [07:02:08] ah lol [07:02:17] * revi was about to go get something to drink [07:02:48] revi: I don’t think you need to verify yours either, as far as I can tell? [07:02:57] kostajh: I'd like to deploy mine myself, I'd like to practice deployments :) [07:02:58] I think post-deploy verification at max? [07:03:25] nothing to double check before deployment for sure [07:03:36] ok [07:03:48] Msz2001: that sounds fine. Do you want to do the config patches together? [07:04:26] I'd better start with a single patch today if that's not a problem [07:04:49] sure, that's fine [07:06:36] Msz2001: do you want to go first? [07:06:43] Okay, I will [07:07:15] revi: I’ll let you know after your patch is live [07:07:21] ping me when it's my turn - I'll... [07:07:21] good [07:07:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194975 (https://phabricator.wikimedia.org/T406883) (owner: 10Mszwarc) [07:08:05] (03Merged) 10jenkins-bot: arbcom_plwiki: Change favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194975 (https://phabricator.wikimedia.org/T406883) (owner: 10Mszwarc) [07:08:48] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1194975|arbcom_plwiki: Change favicon (T406883)]] [07:08:52] T406883: Change favicon of arbcom_plwiki - https://phabricator.wikimedia.org/T406883 [07:08:54] (03PS6) 10Revi: kowikisource: Add "해석" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193521 (https://phabricator.wikimedia.org/T406405) [07:08:56] preventing auto-rebase... [07:10:11] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:11:01] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30029 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:11:45] (03PS1) 10Marostegui: db1199: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195540 (https://phabricator.wikimedia.org/T406541) [07:14:33] (03CR) 10Marostegui: [C:03+2] db1199: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195540 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [07:15:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1199.eqiad.wmnet with reason: Maintenance [07:15:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1199 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83770 and previous config saved to /var/cache/conftool/dbconfig/20251013-071521-marostegui.json [07:17:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:17:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:19:16] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7255/co" [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [07:20:45] (03CR) 10Elukey: [V:03+1 C:03+2] profile::amd_gpu: use a system user for the GPU node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [07:22:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:23:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1199 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83771 and previous config saved to /var/cache/conftool/dbconfig/20251013-072320-root.json [07:24:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:26:24] this is me sorry --^ [07:26:32] pcc didn't highlight the issue sigh [07:27:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:28:22] (03PS2) 10Arnaudb: Revert^4 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1194932 (https://phabricator.wikimedia.org/T387833) [07:28:25] (03PS2) 10Arnaudb: Revert^4 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1194931 (https://phabricator.wikimedia.org/T387833) [07:28:41] (03PS1) 10Elukey: admin: add amd-nodelabeller group [puppet] - 10https://gerrit.wikimedia.org/r/1195544 (https://phabricator.wikimedia.org/T373806) [07:29:45] FIRING: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:29:50] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1195202 (https://phabricator.wikimedia.org/T405557) (owner: 10Muehlenhoff) [07:30:02] (03CR) 10Elukey: [C:03+2] admin: add amd-nodelabeller group [puppet] - 10https://gerrit.wikimedia.org/r/1195544 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [07:30:31] should be fixed during the next puppet run, sorry folks [07:32:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:33:08] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1194975|arbcom_plwiki: Change favicon (T406883)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:33:12] T406883: Change favicon of arbcom_plwiki - https://phabricator.wikimedia.org/T406883 [07:33:28] !log mszwarc@deploy2002 mszwarc: Continuing with sync [07:33:32] (03CR) 10Jelto: "I'm a bit surprised about the non-production Gerrits hosts backup. At least for GitLab only the production host creates backups and has a " [puppet] - 10https://gerrit.wikimedia.org/r/1195432 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:36:03] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1194931 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:36:53] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1194932 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:37:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [07:37:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:37:48] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:38:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1199 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83772 and previous config saved to /var/cache/conftool/dbconfig/20251013-073825-root.json [07:38:36] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:et-0/0/2 (Core: asw1-b13-drmrs:et-0/0/50 {#D0101}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:41:00] (03PS1) 10Muehlenhoff: thumbor: Update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195551 [07:42:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:44:33] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:44:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11267581 (10elukey) Another test with the Kartotherian difftesting tool: ` | | ssim | |-----:|---------:| | 0.05 | 0.962884 | | 0.1 | 0.979971 | | 0.2 | 0.990422 | | 0.25... [07:44:56] (03CR) 10Elukey: [V:03+2 C:03+2] Add the node labeller binary to the package. (034 comments) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [07:45:21] (03CR) 10Elukey: [C:03+1] thumbor: Update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195551 (owner: 10Muehlenhoff) [07:45:45] any news? [07:46:09] I'll start the remaining config backports soon [07:46:20] ACK [07:46:34] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194975|arbcom_plwiki: Change favicon (T406883)]] (duration: 37m 46s) [07:46:38] T406883: Change favicon of arbcom_plwiki - https://phabricator.wikimedia.org/T406883 [07:46:43] I'm done with my deployment [07:46:47] thanks [07:46:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193521 (https://phabricator.wikimedia.org/T406405) (owner: 10Revi) [07:46:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194916 (https://phabricator.wikimedia.org/T406849) (owner: 10Revi) [07:46:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195212 (owner: 10Kosta Harlan) [07:46:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195408 (https://phabricator.wikimedia.org/T402366) (owner: 10Kosta Harlan) [07:46:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193988 (owner: 10DCausse) [07:47:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:48:40] (03Merged) 10jenkins-bot: kowikisource: Add "해석" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193521 (https://phabricator.wikimedia.org/T406405) (owner: 10Revi) [07:48:43] (03Merged) 10jenkins-bot: kowiki: Restrict move ratelimit for non-extendedconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194916 (https://phabricator.wikimedia.org/T406849) (owner: 10Revi) [07:48:45] (03Merged) 10jenkins-bot: wmgMonologChannels: Set CheckUser to info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195212 (owner: 10Kosta Harlan) [07:48:47] (03Merged) 10jenkins-bot: hCaptcha: Enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195408 (https://phabricator.wikimedia.org/T402366) (owner: 10Kosta Harlan) [07:48:49] (03Merged) 10jenkins-bot: NetworkSession: enable only for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193988 (owner: 10DCausse) [07:49:07] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1193521|kowikisource: Add "해석" namespace (T406405)]], [[gerrit:1194916|kowiki: Restrict move ratelimit for non-extendedconfirmed users (T406849)]], [[gerrit:1195212|wmgMonologChannels: Set CheckUser to info level]], [[gerrit:1195408|hCaptcha: Enable on testwiki (T402366)]], [[gerrit:1193988|NetworkSession: enable only for private wikis]] [07:49:14] T406405: Create "해석" namespace @ kowikisource - https://phabricator.wikimedia.org/T406405 [07:49:14] T406849: Restrict kowiki move ratelimit for non-extendedconfirmed users - https://phabricator.wikimedia.org/T406849 [07:49:15] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:52:43] FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:48] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:53:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1199 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83773 and previous config saved to /var/cache/conftool/dbconfig/20251013-075331-root.json [07:54:03] (03PS1) 10Elukey: Fix the amd-nodelabeller's sysuser config [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195598 (https://phabricator.wikimedia.org/T373806) [07:55:03] !log kharlan@deploy2002 revi, kharlan, dcausse: Backport for [[gerrit:1193521|kowikisource: Add "해석" namespace (T406405)]], [[gerrit:1194916|kowiki: Restrict move ratelimit for non-extendedconfirmed users (T406849)]], [[gerrit:1195212|wmgMonologChannels: Set CheckUser to info level]], [[gerrit:1195408|hCaptcha: Enable on testwiki (T402366)]], [[gerrit:1193988|NetworkSession: enable only for private wikis]] synced to t [07:55:03] he testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:55:09] T406405: Create "해석" namespace @ kowikisource - https://phabricator.wikimedia.org/T406405 [07:55:10] T406849: Restrict kowiki move ratelimit for non-extendedconfirmed users - https://phabricator.wikimedia.org/T406849 [07:55:10] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:55:14] revi: you can verify your change now, if you want [07:55:26] ~ing [07:55:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:55:45] kowikisource looks good [07:55:57] dcausse: same for you, if you want [07:56:42] kostajh: mine is tricky to verify on debug servers so please go ahead [07:57:38] !log kharlan@deploy2002 revi, kharlan, dcausse: Continuing with sync [07:57:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:57:48] FIRING: [14x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:58:15] it's already syncing but kowiki one also look good [07:58:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195410 (https://phabricator.wikimedia.org/T406615) (owner: 10Kosta Harlan) [07:58:37] revi: thanks [07:59:00] (03CR) 10Kosta Harlan: [C:03+2] Fix locally failing QUnit tests [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195410 (https://phabricator.wikimedia.org/T406615) (owner: 10Kosta Harlan) [07:59:10] (03CR) 10Kosta Harlan: [C:03+2] Apply temporary account creation limit to /64 range for IPv6 IPs [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195400 (https://phabricator.wikimedia.org/T406710) (owner: 10Kosta Harlan) [07:59:12] (03CR) 10Kosta Harlan: [C:03+2] Add a short-term rate limit to temp account creation [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195399 (https://phabricator.wikimedia.org/T405565) (owner: 10Kosta Harlan) [07:59:45] FIRING: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:00:15] (03Merged) 10jenkins-bot: Fix locally failing QUnit tests [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195410 (https://phabricator.wikimedia.org/T406615) (owner: 10Kosta Harlan) [08:02:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:43] RESOLVED: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:04:01] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193521|kowikisource: Add "해석" namespace (T406405)]], [[gerrit:1194916|kowiki: Restrict move ratelimit for non-extendedconfirmed users (T406849)]], [[gerrit:1195212|wmgMonologChannels: Set CheckUser to info level]], [[gerrit:1195408|hCaptcha: Enable on testwiki (T402366)]], [[gerrit:1193988|NetworkSession: enable only for private wikis]] (duration [08:04:02] : 14m 54s) [08:04:08] T406405: Create "해석" namespace @ kowikisource - https://phabricator.wikimedia.org/T406405 [08:04:08] T406849: Restrict kowiki move ratelimit for non-extendedconfirmed users - https://phabricator.wikimedia.org/T406849 [08:04:09] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [08:04:45] RESOLVED: [7x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:04:46] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1195410|Fix locally failing QUnit tests (T406615)]] [08:04:49] T406615: hCaptcha: Get QUnit tests to pass locally - https://phabricator.wikimedia.org/T406615 [08:08:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1199 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83776 and previous config saved to /var/cache/conftool/dbconfig/20251013-080837-root.json [08:09:01] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1195410|Fix locally failing QUnit tests (T406615)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:09:41] (03Merged) 10jenkins-bot: Apply temporary account creation limit to /64 range for IPv6 IPs [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195400 (https://phabricator.wikimedia.org/T406710) (owner: 10Kosta Harlan) [08:09:45] (03CR) 10CI reject: [V:04-1] Add a short-term rate limit to temp account creation [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195399 (https://phabricator.wikimedia.org/T405565) (owner: 10Kosta Harlan) [08:09:52] (03PS1) 10Jelto: apt: remove gitlab-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1195623 (https://phabricator.wikimedia.org/T406823) [08:09:54] (03PS1) 10Jelto: apt: remove gitlab-runner from buster and bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1195624 (https://phabricator.wikimedia.org/T406823) [08:10:15] !log kharlan@deploy2002 kharlan: Continuing with sync [08:10:41] (03CR) 10Jcrespo: [C:03+1] "I trust you, I can monitor after restart." [puppet] - 10https://gerrit.wikimedia.org/r/1195432 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:11:52] (03CR) 10Arnaudb: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1195623 (https://phabricator.wikimedia.org/T406823) (owner: 10Jelto) [08:14:23] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195410|Fix locally failing QUnit tests (T406615)]] (duration: 09m 38s) [08:14:55] still have two patches to go [08:15:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) (owner: 10Giuseppe Lavagetto) [08:17:28] (03PS1) 10Slyngshede: P:idp add the Trixie hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1195625 (https://phabricator.wikimedia.org/T406455) [08:17:46] (03PS2) 10Kosta Harlan: Add a short-term rate limit to temp account creation [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195399 (https://phabricator.wikimedia.org/T405565) [08:18:00] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1195624 (https://phabricator.wikimedia.org/T406823) (owner: 10Jelto) [08:18:05] (03PS3) 10Kosta Harlan: Add a short-term rate limit to temp account creation [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195399 (https://phabricator.wikimedia.org/T405565) [08:18:20] kostajh: thanks for the deploy! [08:18:32] dcausse: yw! [08:19:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195399 (https://phabricator.wikimedia.org/T405565) (owner: 10Kosta Harlan) [08:21:15] (03CR) 10Arnaudb: "I am also curious about the reason behind why that backup would need to be performed on all instances of the cluster. maybe @dzahn@wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/1195432 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:25:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1195623 (https://phabricator.wikimedia.org/T406823) (owner: 10Jelto) [08:27:04] (03PS1) 10Marostegui: db1190: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195626 (https://phabricator.wikimedia.org/T406541) [08:27:35] (03CR) 10Marostegui: [C:03+2] db1190: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195626 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [08:27:38] (03CR) 10Arnaudb: [C:03+2] gerrit: local backup on source server only [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:28:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1190.eqiad.wmnet with reason: Maintenance [08:28:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1190 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83777 and previous config saved to /var/cache/conftool/dbconfig/20251013-082818-marostegui.json [08:28:21] (03CR) 10Muehlenhoff: "Let's also update profile::mariadb::ferm_misc, even if currently unused, otherwise this goes out-of-sync and causes subtle errors laster." [puppet] - 10https://gerrit.wikimedia.org/r/1195625 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:29:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094 (10Nahid) 03NEW [08:31:54] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11267691 (10elukey) Example of diff: codfw: {F66747115} eqiad: {F66747117} diff: {F66747119} As far as I can see the main differences are not related to the data itself (the kanji... [08:33:16] (03Merged) 10jenkins-bot: Add a short-term rate limit to temp account creation [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195399 (https://phabricator.wikimedia.org/T405565) (owner: 10Kosta Harlan) [08:33:37] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1195399|Add a short-term rate limit to temp account creation (T405565)]], [[gerrit:1195400|Apply temporary account creation limit to /64 range for IPv6 IPs (T406710)]] [08:33:59] (03PS2) 10Slyngshede: P:idp add the Trixie hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1195625 (https://phabricator.wikimedia.org/T406455) [08:34:03] (03Merged) 10jenkins-bot: gerrit: local backup on source server only [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:36:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1190 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83779 and previous config saved to /var/cache/conftool/dbconfig/20251013-083635-root.json [08:37:32] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1195399|Add a short-term rate limit to temp account creation (T405565)]], [[gerrit:1195400|Apply temporary account creation limit to /64 range for IPv6 IPs (T406710)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:39:34] (03CR) 10Elukey: [V:03+2 C:03+2] "Trivial fix, merging :)" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195598 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [08:41:32] !log kharlan@deploy2002 kharlan: Continuing with sync [08:43:25] (03PS3) 10Dreamy Jazz: ext.confirmEdit.hCaptcha.utils: Track hCaptcha execution rejections [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195628 (https://phabricator.wikimedia.org/T406925) (owner: 10Kosta Harlan) [08:45:41] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195399|Add a short-term rate limit to temp account creation (T405565)]], [[gerrit:1195400|Apply temporary account creation limit to /64 range for IPv6 IPs (T406710)]] (duration: 12m 04s) [08:46:19] last patch [08:46:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195628 (https://phabricator.wikimedia.org/T406925) (owner: 10Kosta Harlan) [08:51:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1190 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83781 and previous config saved to /var/cache/conftool/dbconfig/20251013-085141-root.json [08:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:53:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [08:57:40] (03PS1) 10Elukey: Create the amd-k8s-node-labeller binary package. [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195630 (https://phabricator.wikimedia.org/T373806) [08:59:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [09:00:45] (03Merged) 10jenkins-bot: ext.confirmEdit.hCaptcha.utils: Track hCaptcha execution rejections [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195628 (https://phabricator.wikimedia.org/T406925) (owner: 10Kosta Harlan) [09:01:08] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1195628|ext.confirmEdit.hCaptcha.utils: Track hCaptcha execution rejections (T406925)]] [09:01:11] T406925: hCaptcha: Fix execute duration timings and execution error logging - https://phabricator.wikimedia.org/T406925 [09:05:17] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1195628|ext.confirmEdit.hCaptcha.utils: Track hCaptcha execution rejections (T406925)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:06:19] !log kharlan@deploy2002 kharlan: Continuing with sync [09:06:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1190 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83782 and previous config saved to /var/cache/conftool/dbconfig/20251013-090647-root.json [09:10:27] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195628|ext.confirmEdit.hCaptcha.utils: Track hCaptcha execution rejections (T406925)]] (duration: 09m 19s) [09:10:31] T406925: hCaptcha: Fix execute duration timings and execution error logging - https://phabricator.wikimedia.org/T406925 [09:10:51] ok, done [09:11:19] !log UTC morning deploys done [09:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:30] (03CR) 10Vgutierrez: [V:03+1 C:03+1] cache: exclude logged-in users from requestctl logged_in_filters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) (owner: 10Giuseppe Lavagetto) [09:15:54] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [09:16:06] (03PS3) 10Giuseppe Lavagetto: cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) [09:21:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1190 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83783 and previous config saved to /var/cache/conftool/dbconfig/20251013-092152-root.json [09:26:01] (03PS1) 10Marostegui: db1160: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195634 [09:26:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11267872 (10BTullis) 05Open→03Resolved This is now complete. ` root@druid1011:~# cat /proc/mdstat Personalities : [raid10] [linear... [09:26:30] (03CR) 10CI reject: [V:04-1] db1160: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195634 (owner: 10Marostegui) [09:27:24] (03PS2) 10Marostegui: db1160: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195634 (https://phabricator.wikimedia.org/T406541) [09:28:05] (03CR) 10Marostegui: [C:03+2] db1160: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195634 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [09:29:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1160 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83784 and previous config saved to /var/cache/conftool/dbconfig/20251013-092903-marostegui.json [09:29:48] (03PS4) 10Vgutierrez: cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) (owner: 10Giuseppe Lavagetto) [09:30:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: Cloning [09:31:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:31:29] (03PS3) 10Vgutierrez: haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) [09:31:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11267890 (10elukey) Codfw repooled! I also added a note to https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Maps_backend_upgraded_to_a_new_stack cc: @TheDJ for awar... [09:31:38] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) (owner: 10Vgutierrez) [09:33:07] (03CR) 10Marostegui: "@fceratto@wikimedia.org can you give the testing some priority, I'd like to switch s4 eqiad master very soon and I'd need db1247 to be run" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [09:34:11] (03PS1) 10Phuedx: Port Java Pageview definition to bot detection [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195636 (https://phabricator.wikimedia.org/T406359) [09:34:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195636 (https://phabricator.wikimedia.org/T406359) (owner: 10Phuedx) [09:39:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83785 and previous config saved to /var/cache/conftool/dbconfig/20251013-093910-root.json [09:44:01] (03CR) 10Marostegui: [C:03+1] Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto) [09:45:38] (03CR) 10Vgutierrez: [C:03+2] haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) (owner: 10Vgutierrez) [09:49:50] (03CR) 10Hnowlan: [C:03+1] thumbor: Update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195551 (owner: 10Muehlenhoff) [09:52:23] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194610 (https://phabricator.wikimedia.org/T406318) [09:53:04] (03CR) 10Federico Ceratto: [C:03+2] Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto) [09:53:33] 06SRE, 06Traffic, 06MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11267952 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [09:54:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83786 and previous config saved to /var/cache/conftool/dbconfig/20251013-095416-root.json [09:58:36] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:et-0/0/2 (Core: asw1-b13-drmrs:et-0/0/50 {#D0101}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T1000) [10:01:59] (03PS1) 10Jelto: apt: add thirdparty/gitlab-runner to bullseye-wikimedia-again [puppet] - 10https://gerrit.wikimedia.org/r/1195639 (https://phabricator.wikimedia.org/T406823) [10:02:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1195639 (https://phabricator.wikimedia.org/T406823) (owner: 10Jelto) [10:03:13] !log installing Linux 5.10.244 on Bullseye hosts [10:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:26] (03CR) 10Jelto: [C:03+2] apt: add thirdparty/gitlab-runner to bullseye-wikimedia-again [puppet] - 10https://gerrit.wikimedia.org/r/1195639 (https://phabricator.wikimedia.org/T406823) (owner: 10Jelto) [10:06:45] (03CR) 10Hashar: [C:03+2] Fix link to task in the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194573 (owner: 10Hashar) [10:07:46] (03Merged) 10jenkins-bot: Fix link to task in the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194573 (owner: 10Hashar) [10:08:17] !log hashar@deploy2002 Started deploy [gerrit/gerrit@93bde2a]: Fix link to task in the motd banner [10:08:30] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@93bde2a]: Fix link to task in the motd banner (duration: 00m 13s) [10:09:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83787 and previous config saved to /var/cache/conftool/dbconfig/20251013-100923-root.json [10:10:05] (03PS1) 10Marostegui: es1049: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1195640 (https://phabricator.wikimedia.org/T406488) [10:10:53] (03CR) 10Marostegui: [C:03+2] es1049: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1195640 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [10:13:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83788 and previous config saved to /var/cache/conftool/dbconfig/20251013-101339-root.json [10:13:44] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [10:15:05] (03CR) 10Elukey: profile::thanos: fix xlab SLI's recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [10:16:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 1%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83789 and previous config saved to /var/cache/conftool/dbconfig/20251013-101645-root.json [10:16:50] (03PS2) 10Jelto: apt: remove gitlab-runner from buster and bullseye updates [puppet] - 10https://gerrit.wikimedia.org/r/1195624 (https://phabricator.wikimedia.org/T406823) [10:16:55] (03CR) 10Elukey: Introduce v1 xLab / MPIC SLOs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [10:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:18:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b13-drmrs and cr1-drmrs (185.15.58.148) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:19:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b13-drmrs:et-0/0/50 (Core: cr1-drmrs:et-0/0/2 {#D0101}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b13-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:22:38] uh... `Core BGP session down between asw1-b13-drmrs and cr1-drmrs (185.15.58.148)` expected topranks | XioNoX ? [10:24:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83790 and previous config saved to /var/cache/conftool/dbconfig/20251013-102428-root.json [10:25:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:28:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83791 and previous config saved to /var/cache/conftool/dbconfig/20251013-102845-root.json [10:28:50] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [10:29:19] vgutierrez: no that is not good.... sorry was afk just checking now [10:29:59] thx [10:31:03] and shit it went down 2 days ago now [10:31:16] :| [10:31:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 5%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83792 and previous config saved to /var/cache/conftool/dbconfig/20251013-103151-root.json [10:32:32] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:35:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1195624 (https://phabricator.wikimedia.org/T406823) (owner: 10Jelto) [10:37:10] (03PS1) 10Marostegui: db1247: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195642 (https://phabricator.wikimedia.org/T406541) [10:40:27] (03CR) 10Marostegui: [C:03+2] db1247: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195642 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [10:40:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:41:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1247.eqiad.wmnet with reason: Maintenance [10:41:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1247 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83793 and previous config saved to /var/cache/conftool/dbconfig/20251013-104131-marostegui.json [10:43:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83794 and previous config saved to /var/cache/conftool/dbconfig/20251013-104351-root.json [10:43:56] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [10:45:30] (03Restored) 10Tiziano Fogli: monitoring services: add migration task T357099 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155145 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:45:51] 06SRE, 06Infrastructure-Foundations, 10netops: drmrs: cr1-drmrs <-> asw1-b13-drmrs link down [Oct 2025] - https://phabricator.wikimedia.org/T407107 (10cmooney) 03NEW p:05Triage→03High [10:46:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 7%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83795 and previous config saved to /var/cache/conftool/dbconfig/20251013-104657-root.json [10:49:38] !log installing systemd bugfix updates on bullseye [10:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1247 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83796 and previous config saved to /var/cache/conftool/dbconfig/20251013-104952-root.json [10:55:18] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194610 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [10:56:20] (03CR) 10Vgutierrez: [C:03+1] profile::thanos: fix xlab SLI's recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [10:56:54] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194611 (https://phabricator.wikimedia.org/T406318) [10:58:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83797 and previous config saved to /var/cache/conftool/dbconfig/20251013-105857-root.json [10:59:02] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [11:02:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 10%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83798 and previous config saved to /var/cache/conftool/dbconfig/20251013-110203-root.json [11:04:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1247 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83799 and previous config saved to /var/cache/conftool/dbconfig/20251013-110458-root.json [11:08:09] (03PS5) 10Tiziano Fogli: monitoring services: add migration task T357099 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155145 (https://phabricator.wikimedia.org/T395443) [11:08:11] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T357099 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155145 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [11:10:22] (03PS5) 10Giuseppe Lavagetto: cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) [11:10:44] (03CR) 10Giuseppe Lavagetto: cache: exclude logged-in users from requestctl logged_in_filters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) (owner: 10Giuseppe Lavagetto) [11:14:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83800 and previous config saved to /var/cache/conftool/dbconfig/20251013-111403-root.json [11:14:08] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [11:14:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:29] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11268153 (10JanWMF) approved [11:17:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 20%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83801 and previous config saved to /var/cache/conftool/dbconfig/20251013-111708-root.json [11:19:25] FIRING: [2x] SystemdUnitFailed: prometheus-debian-version-textfile.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1247 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83802 and previous config saved to /var/cache/conftool/dbconfig/20251013-112004-root.json [11:20:43] PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:29:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83803 and previous config saved to /var/cache/conftool/dbconfig/20251013-112909-root.json [11:29:14] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [11:29:52] 06SRE, 06Infrastructure-Foundations, 10netops: drmrs: cr1-drmrs <-> asw1-b13-drmrs link down [Oct 2025] - https://phabricator.wikimedia.org/T407107#11268170 (10cmooney) Remote hands request id CS3321949 [11:30:14] > PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:30:33] ^^^ my bad ... I'll fix asap [11:32:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 25%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83804 and previous config saved to /var/cache/conftool/dbconfig/20251013-113214-root.json [11:32:43] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:32:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:32:59] !log installing openssl security updates on Bullseye [11:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:14] !log restarting blazegraph on wdqs1014 (BlazegraphFreeAllocatorsDecreasingRapidly) - `sudo depool && sleep 30 && sudo systemctl restart wdqs-blazegraph.service && sleep 30 && sudo pool` [11:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1247 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83805 and previous config saved to /var/cache/conftool/dbconfig/20251013-113510-root.json [11:39:25] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83806 and previous config saved to /var/cache/conftool/dbconfig/20251013-114415-root.json [11:44:20] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [11:44:46] (03PS1) 10Btullis: Update the permissions for the dse-k8s-csi user in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1195649 (https://phabricator.wikimedia.org/T404576) [11:45:02] !log fceratto@cumin1002 START - Cookbook sre.mysql.major-upgrade [11:45:46] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7257/co" [puppet] - 10https://gerrit.wikimedia.org/r/1195649 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [11:47:15] (03CR) 10Btullis: [V:03+1 C:03+2] Update the permissions for the dse-k8s-csi user in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1195649 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [11:47:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 30%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83807 and previous config saved to /var/cache/conftool/dbconfig/20251013-114720-root.json [11:48:17] (03PS1) 10Federico Ceratto: db2230.yaml: major MariaDB version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1195650 (https://phabricator.wikimedia.org/T406469) [11:48:17] (03CR) 10Federico Ceratto: "Upgrade test host" [puppet] - 10https://gerrit.wikimedia.org/r/1195650 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [11:48:17] fceratto@cumin1002 major-upgrade (PID 1694194) is awaiting input [11:51:27] (03CR) 10Marostegui: [C:03+1] db2230.yaml: major MariaDB version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1195650 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [11:57:27] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195630 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [11:58:27] (03CR) 10Jelto: [C:03+2] apt: remove gitlab-runner from buster and bullseye updates [puppet] - 10https://gerrit.wikimedia.org/r/1195624 (https://phabricator.wikimedia.org/T406823) (owner: 10Jelto) [11:59:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83808 and previous config saved to /var/cache/conftool/dbconfig/20251013-115921-root.json [11:59:25] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [12:02:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:13] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:02:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 50%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83809 and previous config saved to /var/cache/conftool/dbconfig/20251013-120226-root.json [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:04:09] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 6.811 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:04:17] (03PS2) 10Tiziano Fogli: mailman3-web: revert to previous thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1195648 (https://phabricator.wikimedia.org/T395443) [12:05:26] (03CR) 10Federico Ceratto: [C:03+2] db2230.yaml: major MariaDB version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1195650 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [12:07:13] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:08:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:09:22] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:11:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mirror1001.wikimedia.org [12:14:05] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.518 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:14:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83810 and previous config saved to /var/cache/conftool/dbconfig/20251013-121427-root.json [12:14:32] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [12:16:12] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:16:52] (03PS1) 10Slyngshede: P:idp update CAS configuration for 7.2.X [puppet] - 10https://gerrit.wikimedia.org/r/1195655 (https://phabricator.wikimedia.org/T406455) [12:17:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 60%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83811 and previous config saved to /var/cache/conftool/dbconfig/20251013-121732-root.json [12:18:27] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:18:41] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Log all requests [puppet] - 10https://gerrit.wikimedia.org/r/1195656 (https://phabricator.wikimedia.org/T284558) [12:18:43] RECOVERY - mailman3-web on lists1004 is OK: PROCS OK: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:43] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Include host header in access log [puppet] - 10https://gerrit.wikimedia.org/r/1195657 (https://phabricator.wikimedia.org/T284558) [12:19:41] (03PS2) 10Slyngshede: P:idp update CAS configuration for 7.2.X [puppet] - 10https://gerrit.wikimedia.org/r/1195655 (https://phabricator.wikimedia.org/T406455) [12:20:16] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195655 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [12:20:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mirror1001.wikimedia.org [12:20:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki1002.eqiad.wmnet [12:23:02] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Log all requests [puppet] - 10https://gerrit.wikimedia.org/r/1195656 (https://phabricator.wikimedia.org/T284558) [12:23:02] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Include host header in access log [puppet] - 10https://gerrit.wikimedia.org/r/1195657 (https://phabricator.wikimedia.org/T284558) [12:25:46] (03PS3) 10Slyngshede: P:idp update CAS configuration for 7.2.X [puppet] - 10https://gerrit.wikimedia.org/r/1195655 (https://phabricator.wikimedia.org/T406455) [12:27:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki1002.eqiad.wmnet [12:28:51] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195664 [12:29:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83812 and previous config saved to /var/cache/conftool/dbconfig/20251013-122933-root.json [12:29:37] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [12:30:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet [12:32:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 75%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83813 and previous config saved to /var/cache/conftool/dbconfig/20251013-123238-root.json [12:34:49] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 5 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195656 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:35:26] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet [12:35:47] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-cache2001.codfw.wmnet [12:37:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet [12:37:27] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195655 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [12:39:22] FIRING: [4x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:20] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1001.eqiad.wmnet [12:40:30] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-cache1002.eqiad.wmnet [12:41:01] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2001.codfw.wmnet [12:41:06] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T384571 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195675 (https://phabricator.wikimedia.org/T395443) [12:41:13] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-cache2002.codfw.wmnet [12:41:22] (03PS4) 10Slyngshede: P:idp update CAS configuration for 7.2.X [puppet] - 10https://gerrit.wikimedia.org/r/1195655 (https://phabricator.wikimedia.org/T406455) [12:41:52] RESOLVED: [4x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:51] (03CR) 10Elukey: [C:03+2] profile::thanos: fix xlab SLI's recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [12:44:22] FIRING: [7x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1049 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83814 and previous config saved to /var/cache/conftool/dbconfig/20251013-124439-root.json [12:44:43] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [12:45:11] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T384438 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195676 (https://phabricator.wikimedia.org/T395443) [12:45:22] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1002.eqiad.wmnet [12:45:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:28] (03CR) 10Elukey: [V:03+2 C:03+2] Create the amd-k8s-node-labeller binary package. [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195630 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [12:45:41] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-cache1003.eqiad.wmnet [12:46:09] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2002.codfw.wmnet [12:46:16] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-cache2003.codfw.wmnet [12:46:52] RESOLVED: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:47:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1051 (re)pooling @ 100%: Host provisioned T406488', diff saved to https://phabricator.wikimedia.org/P83815 and previous config saved to /var/cache/conftool/dbconfig/20251013-124744-root.json [12:49:22] FIRING: [11x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:25] FIRING: [3x] SystemdUnitFailed: prometheus-debian-version-textfile.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:34] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1003.eqiad.wmnet [12:51:12] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2003.codfw.wmnet [12:51:34] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2001.codfw.wmnet [12:51:52] FIRING: [12x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:53:57] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2001.codfw.wmnet [12:54:07] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2002.codfw.wmnet [12:54:22] RESOLVED: [12x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:25] FIRING: [4x] SystemdUnitFailed: prometheus-debian-version-textfile.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11268400 (10Marostegui) [12:56:34] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2002.codfw.wmnet [12:56:43] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2003.codfw.wmnet [12:57:48] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw [12:57:54] !log dropped flaggedrevs tables on lawikisource (fT406424) [12:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:07] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2003.codfw.wmnet [12:59:22] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet [12:59:46] (03PS1) 10Hnowlan: trafficserver: remove gateway-check group-specific routes for rest.php [puppet] - 10https://gerrit.wikimedia.org/r/1195679 (https://phabricator.wikimedia.org/T406318) [12:59:55] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194611 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T1300). [13:00:05] xSavitar and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:26] o/ [13:00:42] o/ [13:01:03] (03PS1) 10Hnowlan: Revert "trafficserver: rest-gateway routes for rest.php: enwiki 100%" [puppet] - 10https://gerrit.wikimedia.org/r/1195680 [13:01:09] I can self service my patch today [13:01:46] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet [13:01:54] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet [13:01:59] phuedx, once I'm done, I'll poke you. Sounds okay? [13:02:08] xSavitar: Sounds good [13:02:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187781 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:03:23] (03Merged) 10jenkins-bot: session: Enable MultiBackendSessionStore on `group2` wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187781 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:03:45] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1187781|session: Enable MultiBackendSessionStore on `group2` wikis (T402808)]] [13:03:49] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:04:16] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet [13:04:25] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet [13:04:28] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1001.eqiad.wmnet [13:04:39] (03CR) 10Hnowlan: [C:03+2] Revert "trafficserver: rest-gateway routes for rest.php: enwiki 100%" [puppet] - 10https://gerrit.wikimedia.org/r/1195680 (owner: 10Hnowlan) [13:04:50] (03PS1) 10Elukey: Fix amd-node-labeller's postinst [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) [13:05:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw [13:06:46] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1001.eqiad.wmnet [13:06:48] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet [13:06:54] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1002.eqiad.wmnet [13:08:03] !log derick@deploy2002 derick, d3r1ck01: Backport for [[gerrit:1187781|session: Enable MultiBackendSessionStore on `group2` wikis (T402808)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:34] * xSavitar testing... [13:09:13] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1002.eqiad.wmnet [13:09:20] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1003.eqiad.wmnet [13:10:25] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:00] Testing on debug hosts, all seems fine [13:11:03] syncing now [13:11:12] !log derick@deploy2002 derick, d3r1ck01: Continuing with sync [13:11:14] 06SRE, 06Infrastructure-Foundations, 10netops: drmrs: cr1-drmrs <-> asw1-b13-drmrs link down [Oct 2025] - https://phabricator.wikimedia.org/T407107#11268439 (10cmooney) They checked the fibres and reseated but no change. ` We have reseated the fiber and SFPs on both sides. However, QSFP port 2 on the cr1-drm... [13:11:41] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1003.eqiad.wmnet [13:15:23] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1187781|session: Enable MultiBackendSessionStore on `group2` wikis (T402808)]] (duration: 11m 39s) [13:15:27] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:16:14] phuedx, done! I'll stay here for a while and look at Grafan visualizations and other places to make sure everything is working nicely. [13:16:20] But you can carry on, over to you. [13:16:48] Thanks! [13:17:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11268471 (10Marostegui) a:05Kappakayala→03Marostegui I will come up with a plan for the db*, pc*, dbproxy*, es* [13:17:04] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195551 (owner: 10Muehlenhoff) [13:18:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195636 (https://phabricator.wikimedia.org/T406359) (owner: 10Phuedx) [13:19:58] (03Merged) 10jenkins-bot: Port Java Pageview definition to bot detection [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195636 (https://phabricator.wikimedia.org/T406359) (owner: 10Phuedx) [13:20:19] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1195636|Port Java Pageview definition to bot detection (T406359)]] [13:20:22] T406359: Work on client-side Bot Detection - https://phabricator.wikimedia.org/T406359 [13:21:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11268499 (10Marostegui) @RobH - do you want me to add the plan in the task description or in comments like: ` dbproxy1024 C6 Can be done anytime - just need... [13:21:33] (03PS2) 10Elukey: Fix amd-node-labeller's postinst [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) [13:23:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11268506 (10Eevans) Provided that the moves happen one at a time (probably goes without saying), then the Cassandra hosts can be done at any time, and withou... [13:24:25] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1195636|Port Java Pageview definition to bot detection (T406359)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:24:26] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [13:24:41] Testing now w/ milimetric [13:25:02] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384571 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195675 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:25:08] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384438 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195676 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:25:29] (03CR) 10Muehlenhoff: Fix amd-node-labeller's postinst (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:25:37] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407117 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195685 (https://phabricator.wikimedia.org/T395443) [13:26:04] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:29:55] (03PS3) 10Elukey: Fix amd-node-labeller's install and postinst configs [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) [13:30:13] (03PS6) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [13:30:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11268531 (10jcrespo) For backup* and ms-backup* hosts, I would prefer to **stop the backup process (mediabackups)**. Nothing would be lost if they fail (th... [13:30:25] RESOLVED: SystemdUnitFailed: amd-k8s-node-labeller.service on ml-staging2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:52] !log fceratto@cumin1002 START - Cookbook sre.mysql.major-upgrade [13:31:05] (03CR) 10Elukey: Fix amd-node-labeller's install and postinst configs (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:31:38] (03PS4) 10Elukey: Fix amd-node-labeller's install and postinst configs [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) [13:31:40] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [13:31:45] !log fceratto@cumin1002 START - Cookbook sre.mysql.major-upgrade [13:32:18] (03PS1) 10Btullis: Fix the definition of @dse_kubepods_networks to include codfw [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) [13:33:17] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7259/console" [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [13:33:23] (03CR) 10Federico Ceratto: "tested with" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [13:33:23] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T395448 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195693 (https://phabricator.wikimedia.org/T395443) [13:33:36] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [13:33:43] (03PS2) 10Btullis: Fix the definition of @dse_kubepods_networks to include codfw [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) [13:33:45] Confirmed that the instrument is updated [13:33:48] Continuing [13:33:53] !log phuedx@deploy2002 phuedx: Continuing with sync [13:34:14] !log btullis@cumin1003 START - Cookbook sre.opensearch.roll-restart-reboot rolling reboot on A:datahubsearch [13:34:59] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7260/console" [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [13:36:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:37:57] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195636|Port Java Pageview definition to bot detection (T406359)]] (duration: 17m 39s) [13:38:01] T406359: Work on client-side Bot Detection - https://phabricator.wikimedia.org/T406359 [13:39:43] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T395448 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195693 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:40:22] There aren't any more patches to deploy [13:40:22] So [13:40:24] !log UTC afternoon backport window done [13:40:24] (03CR) 10Vgutierrez: [C:03+1] cache: exclude logged-in users from requestctl logged_in_filters [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) (owner: 10Giuseppe Lavagetto) [13:40:25] (03PS5) 10Elukey: Fix amd-node-labeller's install and postinst configs [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) [13:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:56] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:42:10] (03PS7) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [13:43:08] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1195439 (https://phabricator.wikimedia.org/T407092) (owner: 10Giuseppe Lavagetto) [13:43:14] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:43:48] (03PS8) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [13:45:16] (03PS9) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [13:46:01] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:46:07] (03CR) 10Federico Ceratto: "Added small reliability improvement" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [13:48:47] (03CR) 10Elukey: Fix amd-node-labeller's install and postinst configs (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:49:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling reboot on A:datahubsearch [13:50:16] (03CR) 10Elukey: Fix amd-node-labeller's install and postinst configs (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:51:20] (03CR) 10CI reject: [V:04-1] major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [13:52:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11268583 (10elukey) 05Open→03Declined Tentatively closing this after few days, please re-open if needed! [13:53:36] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:et-0/0/2 (Core: asw1-b13-drmrs:et-0/0/50 {#D0101}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:53:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11268586 (10elukey) [13:53:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between asw1-b13-drmrs and cr1-drmrs (185.15.58.148) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:53:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b13-drmrs:et-0/0/50 (Core: cr1-drmrs:et-0/0/2 {#D0101}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b13-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:54:37] (03CR) 10Muehlenhoff: Fix amd-node-labeller's install and postinst configs (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:57:38] (03PS1) 10Elukey: profile::amd_gpu: add the node-labeller package when needed [puppet] - 10https://gerrit.wikimedia.org/r/1195703 (https://phabricator.wikimedia.org/T373806) [13:58:24] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:59:07] (03CR) 10Elukey: Fix amd-node-labeller's install and postinst configs (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:59:12] (03CR) 10Elukey: [V:03+2 C:03+2] Fix amd-node-labeller's install and postinst configs [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1195681 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [14:01:47] (03PS3) 10Btullis: Update the definition of @dse_kubepods_networks [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) [14:03:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7261/co" [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [14:03:28] !log btullis@cumin1003 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes [14:04:30] (03CR) 10Btullis: [C:03+1] opensearch on k8s: add service definitions [puppet] - 10https://gerrit.wikimedia.org/r/1195342 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [14:04:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [14:04:54] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-be2057.codfw.wmnet [14:05:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11268645 (10SKaram-WMF) [14:06:11] !log fceratto@cumin1002 START - Cookbook sre.mysql.major-upgrade [14:06:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [14:06:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11268646 (10SKaram-WMF) Added the public key. Thank you all! [14:06:58] !incidents [14:06:58] No incidents occurred in the past 24 hours for team SRE [14:07:23] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [14:09:24] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: add the node-labeller package when needed [puppet] - 10https://gerrit.wikimedia.org/r/1195703 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [14:09:26] fceratto@cumin1002 major-upgrade (PID 1905662) is awaiting input [14:12:22] (03PS1) 10Hnowlan: Revert^2 "trafficserver: rest-gateway routes for rest.php: enwiki 100%" [puppet] - 10https://gerrit.wikimedia.org/r/1195705 [14:12:46] (03PS1) 10Federico Ceratto: db1176.yaml: major MariaDB version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1195706 (https://phabricator.wikimedia.org/T406469) [14:13:09] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [14:13:18] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [14:13:58] (03PS4) 10Btullis: Update the definition of @dse_kubepods_networks [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) [14:14:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [14:14:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet [14:15:09] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7263/co" [puppet] - 10https://gerrit.wikimedia.org/r/1195694 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [14:15:14] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-eqiad [14:16:03] (03CR) 10Hnowlan: [C:03+2] Revert^2 "trafficserver: rest-gateway routes for rest.php: enwiki 100%" [puppet] - 10https://gerrit.wikimedia.org/r/1195705 (owner: 10Hnowlan) [14:16:23] fceratto@cumin1002 major-upgrade (PID 1905662) is awaiting input [14:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:18:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11268703 (10elukey) p:05Triage→03Medium [14:19:22] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:22] FIRING: [6x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:34] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet [14:20:48] !log rest.php on rest-gateway at 100% for enwiki (and all other wikis) [14:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet [14:23:22] (03PS2) 10Tiziano Fogli: monitoring services: add migration task T407120 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195701 (https://phabricator.wikimedia.org/T395443) [14:23:29] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T407120 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195701 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [14:24:22] RESOLVED: [6x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:22] FIRING: [7x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T1430) [14:31:52] FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:03] ^ might actually be a result of restbase1032-b having issues? [14:32:32] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:34:22] FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:36:13] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:36:52] RESOLVED: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:05] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.725 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:37:44] (03PS1) 10Elukey: profile::amd_gpu: apply the node labeller to all k8s nodes with a GPU [puppet] - 10https://gerrit.wikimedia.org/r/1195708 (https://phabricator.wikimedia.org/T373806) [14:39:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2002.codfw.wmnet [14:39:59] 06SRE, 06Infrastructure-Foundations, 10netops: drmrs: cr1-drmrs <-> asw1-b13-drmrs link down [Oct 2025] - https://phabricator.wikimedia.org/T407107#11268789 (10cmooney) The QSFP replacement in cr1-drmrs did the trick: ` Oct 13 13:49:58 asw1-b13-drmrs mib2d[9179]: SNMP_TRAP_LINK_UP: ifIndex 574, ifAdminStatu... [14:41:17] 06SRE, 06Infrastructure-Foundations, 10netops: drmrs: cr1-drmrs <-> asw1-b13-drmrs link down [Oct 2025] - https://phabricator.wikimedia.org/T407107#11268792 (10cmooney) p:05High→03Low [14:41:52] FIRING: [18x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:37] (03PS1) 10Elukey: admin_ng: deploy the cluster role for the GPU node labeller to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195709 (https://phabricator.wikimedia.org/T373806) [14:43:20] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T384425 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195707 (https://phabricator.wikimedia.org/T395443) [14:43:22] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384425 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195707 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [14:43:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2002.codfw.wmnet [14:46:52] RESOLVED: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet [14:49:44] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet [14:49:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1003.eqiad.wmnet [14:53:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1003.eqiad.wmnet [14:54:22] FIRING: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:12] (03PS1) 10Muehlenhoff: Remove obsolete Hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1195713 [14:57:21] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [14:57:31] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [14:59:22] RESOLVED: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:17] (03CR) 10Elukey: [C:03+1] Remove obsolete Hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1195713 (owner: 10Muehlenhoff) [15:04:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet [15:05:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [15:06:22] (03PS1) 10Muehlenhoff: Shift tile eqiad invalidation to the bookworm master [puppet] - 10https://gerrit.wikimedia.org/r/1195717 (https://phabricator.wikimedia.org/T381565) [15:06:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1004.eqiad.wmnet [15:06:52] FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:27] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet [15:09:12] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet [15:10:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1004.eqiad.wmnet [15:11:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-etcd1005.eqiad.wmnet [15:11:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195717 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:11:52] RESOLVED: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:12:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes [15:14:35] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T350694 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195712 (https://phabricator.wikimedia.org/T395443) [15:14:38] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T350694 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195712 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [15:15:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-etcd1005.eqiad.wmnet [15:16:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet [15:16:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet [15:17:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet [15:17:14] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet [15:19:22] FIRING: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet [15:23:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [15:24:22] RESOLVED: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:43] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet [15:24:59] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet [15:29:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet [15:29:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [15:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T1530). nyaa~ [15:31:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet [15:31:18] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [15:31:52] FIRING: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:27] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:18] (03PS3) 10Federico Ceratto: site.pp: Remove es2052 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859) [15:36:52] FIRING: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:38:46] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11269050 (10elukey) @Mvolz while reviewing the mesh logs for citoid I noticed that there is a constant amount of HTTP 501s returned by zotero: [[... [15:39:06] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [15:39:15] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet [15:39:22] RESOLVED: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:42:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet [15:42:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet [15:44:22] FIRING: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1195655 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [15:46:37] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet [15:46:44] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet [15:47:34] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Log all requests [puppet] - 10https://gerrit.wikimedia.org/r/1195656 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [15:47:41] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet [15:49:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet [15:49:22] FIRING: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [15:51:18] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet [15:51:28] FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:51:52] RESOLVED: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:54:26] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet [15:55:13] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:56:52] FIRING: [9x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [15:59:22] FIRING: [12x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:58] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet [16:02:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:22] RESOLVED: [12x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:56] !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet [16:06:28] FIRING: [2x] KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:09:00] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407130 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195736 (https://phabricator.wikimedia.org/T395443) [16:09:03] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T407130 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195736 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [16:09:10] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T350694 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195735 (https://phabricator.wikimedia.org/T395443) [16:09:13] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T350694 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195735 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [16:09:22] FIRING: [7x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:34] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407137 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195737 (https://phabricator.wikimedia.org/T395443) [16:09:36] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T407137 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195737 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [16:09:49] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407138 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195738 (https://phabricator.wikimedia.org/T395443) [16:10:02] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T407141 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195739 (https://phabricator.wikimedia.org/T395443) [16:10:03] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T407141 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1195739 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [16:11:24] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11269205 (10elukey) @Mvolz I added two new panels to https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid? in the traffic section, to compare the t... [16:11:52] FIRING: [12x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:52] RESOLVED: [12x] ProbeDown: Service restbase1039-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:22] FIRING: [12x] ProbeDown: Service restbase1040-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:22] RESOLVED: [12x] ProbeDown: Service restbase1040-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:36:52] FIRING: [12x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:41:52] RESOLVED: [12x] ProbeDown: Service restbase1041-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:22] FIRING: [12x] ProbeDown: Service restbase1042-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:52] FIRING: [12x] ProbeDown: Service restbase1042-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:54:22] RESOLVED: [12x] ProbeDown: Service restbase1042-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:59:22] FIRING: [16x] ProbeDown: Service restbase1042-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:59:25] !log fceratto@cumin1002 START - Cookbook sre.ganeti.makevm for new host db-test1001.eqiad.wmnet [16:59:27] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [17:00:06] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T1700) [17:00:06] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T1700). [17:02:16] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:04:22] RESOLVED: [12x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:04:27] fceratto@cumin1002 makevm (PID 2225715) is awaiting input [17:05:16] fceratto@cumin1002 makevm (PID 2222611) is awaiting input [17:08:01] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [17:09:22] FIRING: [16x] ProbeDown: Service restbase1043-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:45] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:10:46] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host db-test1001.eqiad.wmnet [17:11:19] !log fceratto@cumin1002 START - Cookbook sre.ganeti.makevm for new host db-test2001.codfw.wmnet [17:11:21] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [17:11:45] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase-eqiad [17:11:52] FIRING: [12x] ProbeDown: Service restbase1044-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:14:02] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:14:20] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [17:14:22] RESOLVED: [12x] ProbeDown: Service restbase1044-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:17:49] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2001.codfw.wmnet - fceratto@cumin1002" [17:18:44] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2001.codfw.wmnet - fceratto@cumin1002" [17:18:44] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:44] !log fceratto@cumin1002 START - Cookbook sre.dns.wipe-cache db-test2001.codfw.wmnet on all recursors [17:18:47] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test2001.codfw.wmnet on all recursors [17:19:23] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2001.codfw.wmnet - fceratto@cumin1002" [17:19:28] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2001.codfw.wmnet - fceratto@cumin1002" [17:19:58] !log fceratto@cumin1002 START - Cookbook sre.hosts.reimage for host db-test2001.codfw.wmnet with OS trixie [17:34:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:37:50] !log fceratto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test2001.codfw.wmnet with reason: host reimage [17:43:27] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test2001.codfw.wmnet with reason: host reimage [17:53:36] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:59:22] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test2001.codfw.wmnet with OS trixie [17:59:22] !log fceratto@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test2001.codfw.wmnet [18:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:32:32] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:39:32] marostegui@cumin1003 clone_es (PID 2506120) is awaiting input [18:39:45] (03PS1) 10Joely Rooke WMDE: Implement new usage types for statement with qualifiers and references [extensions/Wikibase] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195755 (https://phabricator.wikimedia.org/T401290) [18:54:07] (03PS1) 10MusikAnimal: Add 'accepted' status [extensions/CommunityRequests] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195756 (https://phabricator.wikimedia.org/T406674) [18:55:30] (03PS2) 10Joely Rooke WMDE: Implement new usage types for statement with qualifiers and references [extensions/Wikibase] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195755 (https://phabricator.wikimedia.org/T401290) [18:57:26] (03PS3) 10Joely Rooke WMDE: Implement new usage types for statement with qualifiers and references [extensions/Wikibase] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195755 (https://phabricator.wikimedia.org/T401290) [18:58:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Wikibase] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195755 (https://phabricator.wikimedia.org/T401290) (owner: 10Joely Rooke WMDE) [18:59:17] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1027 gradually with 4 steps - Pool es1027.eqiad.wmnet in after cloning [19:32:28] 06SRE, 10DNS, 10Domains: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156 (10SCampos-WMF) 03NEW [19:33:27] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:44:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1027 gradually with 4 steps - Pool es1027.eqiad.wmnet in after cloning [19:44:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1027.eqiad.wmnet onto es1050.eqiad.wmnet [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T2000). [20:00:05] _Gerges, danisztls, and joelyrookewmde: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] hi! [20:02:10] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:47] 06SRE, 10DNS, 10Domains: Request to create the 25.wikipedia.org domain + 301 redirect to the org site - https://phabricator.wikimedia.org/T407156#11269579 (10SCampos-WMF) [20:06:28] FIRING: [2x] KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:06:29] Is anyone around to deploy? [20:08:13] sorry, I'm late [20:09:37] If Gerges isn't available I can start [20:11:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191688 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [20:12:01] that would be lovely if you can! [20:12:47] (03Merged) 10jenkins-bot: Undeploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191688 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [20:13:05] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1191688|Undeploy Design Research participant recruitment survey on jawiki (T405577)]] [20:13:09] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:17:25] !log dani@deploy2002 dani: Backport for [[gerrit:1191688|Undeploy Design Research participant recruitment survey on jawiki (T405577)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:18:01] !log dani@deploy2002 dani: Continuing with sync [20:19:56] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host arclamp1001.eqiad.wmnet [20:22:06] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191688|Undeploy Design Research participant recruitment survey on jawiki (T405577)]] (duration: 09m 01s) [20:22:11] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:22:30] joelyrookewmde: all yours [20:22:47] hey hey! I actually can't deploy myself [20:23:01] I would need someone else to deploy, idk if you could do that? [20:25:45] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp1001.eqiad.wmnet [20:27:08] joelyrookewmde: Probably not, sorry. I can deploy via spiderpig, trivial patches. [20:27:23] ahh ok, no worries then [20:27:55] I'll just try again in the morning deployment window. [20:29:57] i think it'll be required to avoid some temporary errors during the train deployment tomorrow [20:31:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Wikibase] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195755 (https://phabricator.wikimedia.org/T401290) (owner: 10Joely Rooke WMDE) [20:39:27] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host arclamp2001.codfw.wmnet [20:45:25] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp2001.codfw.wmnet [20:47:47] marostegui@cumin1003 clone_es (PID 2508773) is awaiting input [20:52:00] !log btullis@cumin1003 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes [20:52:32] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:52:51] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host graphite1005.eqiad.wmnet [20:56:12] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1005.eqiad.wmnet [20:57:35] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet [20:58:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [20:58:51] Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [21:00:05] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T2100) [21:03:38] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [21:05:09] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2004.codfw.wmnet [21:05:30] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [21:08:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [21:08:51] Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [21:09:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:10:34] PROBLEM - Host an-druid1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:10:52] RECOVERY - Host an-druid1004 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [21:23:16] (03PS1) 10Andrew Bogott: designate_sink: use shlex.quote() rather than the now-obsolete pipes.quote() [puppet] - 10https://gerrit.wikimedia.org/r/1195765 (https://phabricator.wikimedia.org/T406516) [21:25:36] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:26:48] (03CR) 10Andrew Bogott: [C:03+2] designate_sink: use shlex.quote() rather than the now-obsolete pipes.quote() [puppet] - 10https://gerrit.wikimedia.org/r/1195765 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [21:35:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:38:16] (03PS1) 10Btullis: Cephosd: stop the csi plugins watching the namespace used for tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195766 (https://phabricator.wikimedia.org/T404576) [21:48:54] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet [21:50:28] (03PS1) 10Btullis: analytics::launcher: Fix the ensure parameter on the drop_event timer [puppet] - 10https://gerrit.wikimedia.org/r/1195767 (https://phabricator.wikimedia.org/T402943) [21:50:55] (03CR) 10Btullis: [C:03+2] Cephosd: stop the csi plugins watching the namespace used for tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195766 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [21:51:56] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195767 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [21:52:51] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon1003.eqiad.wmnet [21:53:36] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:53:54] (03CR) 10Btullis: [V:03+1 C:03+2] analytics::launcher: Fix the ensure parameter on the drop_event timer [puppet] - 10https://gerrit.wikimedia.org/r/1195767 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [21:58:26] (03Merged) 10jenkins-bot: Cephosd: stop the csi plugins watching the namespace used for tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195766 (https://phabricator.wikimedia.org/T404576) (owner: 10Btullis) [22:00:59] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [22:01:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes [22:01:30] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [22:01:40] !log btullis@cumin1003 START - Cookbook sre.presto.reboot-workers for Presto an-presto cluster: Reboot Presto nodes [22:11:59] (03PS1) 10Andrew Bogott: prometheus-mysqld-exporter: specify path to config file in $ARGS [puppet] - 10https://gerrit.wikimedia.org/r/1195769 [22:16:19] (03PS1) 10Superpes15: [enwikibooks] Set $wgAutoConfirmAge to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195770 (https://phabricator.wikimedia.org/T407080) [22:17:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:30:15] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host kafkamon2003.codfw.wmnet [22:32:32] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:34:02] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2003.codfw.wmnet [22:38:23] (03PS1) 10Btullis: Add the opensearch namespaces to the list of tenents for rbd in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195771 (https://phabricator.wikimedia.org/T397246) [22:44:06] (03CR) 10Btullis: [C:03+1] admin_ng: deploy the cluster role for the GPU node labeller to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195709 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [22:46:11] (03PS1) 10Andrew Bogott: reprepro: add trixie component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1195773 [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251013T2300) [23:04:06] 06SRE, 07CSS, 13Patch-For-Review: Update the errorpage template to use flex - https://phabricator.wikimedia.org/T392692#11269909 (10Ladsgroup) I will test and deploy this tomorrow if noone objects. [23:06:37] (03PS1) 10Btullis: Enable notifications for an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195775 (https://phabricator.wikimedia.org/T402943) [23:09:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195756 (https://phabricator.wikimedia.org/T406674) (owner: 10MusikAnimal) [23:10:28] (03Merged) 10jenkins-bot: Add 'accepted' status [extensions/CommunityRequests] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195756 (https://phabricator.wikimedia.org/T406674) (owner: 10MusikAnimal) [23:10:49] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1195756|Add 'accepted' status (T406674)]] [23:10:53] T406674: Add a new vote-able status that is above "Under review" but before a triaged status - https://phabricator.wikimedia.org/T406674 [23:21:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:24:22] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:26:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:28:27] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:29:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto an-presto cluster: Reboot Presto nodes [23:33:27] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:34:22] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:36:04] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1195756|Add 'accepted' status (T406674)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:36:08] T406674: Add a new vote-able status that is above "Under review" but before a triaged status - https://phabricator.wikimedia.org/T406674 [23:36:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:16] !log musikanimal@deploy2002 musikanimal: Continuing with sync [23:50:50] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195756|Add 'accepted' status (T406674)]] (duration: 40m 01s) [23:50:54] T406674: Add a new vote-able status that is above "Under review" but before a triaged status - https://phabricator.wikimedia.org/T406674 [23:57:19] (03PS1) 10Btullis: Enable canary events on an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) [23:57:21] (03PS1) 10Btullis: Remove stray hiera value for migrated refinery job [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [23:57:23] (03PS1) 10Btullis: Migrate data_check refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) [23:57:25] (03PS1) 10Btullis: Migrate the hdfs_cleaner refinery jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) [23:57:27] (03PS1) 10Btullis: Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) [23:57:29] (03PS1) 10Btullis: Migrate the project_namespace_map refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) [23:57:31] (03PS1) 10Btullis: Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) [23:57:35] (03PS1) 10Btullis: Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) [23:57:39] (03PS1) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [23:58:11] (03PS2) 10Btullis: Remove stray hiera value for migrated refinery job [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [23:58:11] (03PS2) 10Btullis: Enable notifications for an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195775 (https://phabricator.wikimedia.org/T402943) [23:58:11] (03PS2) 10Btullis: Enable canary events on an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) [23:58:11] (03PS2) 10Btullis: Migrate data_check refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) [23:58:12] (03PS2) 10Btullis: Migrate the hdfs_cleaner refinery jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) [23:58:15] (03PS2) 10Btullis: Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) [23:58:19] (03PS2) 10Btullis: Migrate the project_namespace_map refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) [23:58:23] (03PS2) 10Btullis: Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) [23:58:27] (03PS2) 10Btullis: Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) [23:58:31] (03PS2) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [23:59:52] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)