[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166943 (owner: 10TrainBranchBot) [00:07:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167313 [00:07:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167313 (owner: 10TrainBranchBot) [00:11:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:11:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:12:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:15:47] (03CR) 10Xcollazo: "PPC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [00:27:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:30:22] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167313 (owner: 10TrainBranchBot) [01:01:26] 10ops-eqiad, 06DC-Ops: Unresponsive management for thanos-be1006.mgmt:22 - https://phabricator.wikimedia.org/T399052 (10phaultfinder) 03NEW [01:13:33] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:18:33] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:19:36] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:26:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10986891 (10Andrew) >>! In T394333#10986464, @Jclark-ctr wrote: > @dcaro @Andrew @cmooney @ayounsi I need some assistance. I need to open a block of 4x... [03:18:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [04:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:13:53] (03PS1) 10Clare Ming: xLab: Deploy v0.7.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167322 (https://phabricator.wikimedia.org/T397363) [04:14:37] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.7.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167322 (https://phabricator.wikimedia.org/T397363) (owner: 10Clare Ming) [04:16:06] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167322 (https://phabricator.wikimedia.org/T397363) (owner: 10Clare Ming) [04:16:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:17:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:21:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:23:25] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [04:23:57] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [04:40:00] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#10987045 (10Joe) Quite a bit of the rationalization will depend upon the results of another hypothesis, the one about trusted bots. What we can however build while that's still being designed. What should... [05:54:02] Deploying MinT on staging (staging only change) [05:54:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:54:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:55:04] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.049 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:55:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:58:21] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T0600) [06:06:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:07:56] (03PS1) 10KartikMistry: machinetranslation: Remove extra / from s3 URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167328 (https://phabricator.wikimedia.org/T335491) [06:12:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10987083 (10Marostegui) Thank you I can reach them finely. [06:13:23] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:14:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2234].codfw.wmnet,db[1213,1217,1250].eqiad.wmnet with reason: m3 master switchover T398818 [06:14:36] T398818: Switchover m3 (phabricator) master db1213 -> db1250 - https://phabricator.wikimedia.org/T398818 [06:14:52] (03CR) 10Volans: "I think you could take advantage of existing implementations in wmflib and spicerack. I've suggested them inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [06:15:54] (03PS1) 10Marostegui: m3 proxies: Add db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1167329 (https://phabricator.wikimedia.org/T398818) [06:16:24] (03CR) 10Fabfur: [C:03+2] varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [06:17:04] (03CR) 10Marostegui: [C:03+2] m3 proxies: Add db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1167329 (https://phabricator.wikimedia.org/T398818) (owner: 10Marostegui) [06:18:23] (03CR) 10KartikMistry: [C:03+2] machinetranslation: Remove extra / from s3 URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167328 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [06:19:59] (03Merged) 10jenkins-bot: machinetranslation: Remove extra / from s3 URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167328 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [06:20:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2234].codfw.wmnet,db[1213,1217,1250].eqiad.wmnet with reason: m3 master switchover T398818 [06:20:29] T398818: Switchover m3 (phabricator) master db1213 -> db1250 - https://phabricator.wikimedia.org/T398818 [06:21:01] (03PS1) 10Marostegui: Revert "m3 proxies: Add db1250" [puppet] - 10https://gerrit.wikimedia.org/r/1167330 [06:21:04] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:21:32] (03CR) 10Marostegui: [C:03+2] Revert "m3 proxies: Add db1250" [puppet] - 10https://gerrit.wikimedia.org/r/1167330 (owner: 10Marostegui) [06:23:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:27:00] (03PS1) 10Marostegui: mariadb: Promote db1250 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1167375 (https://phabricator.wikimedia.org/T398818) [06:27:54] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1250 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1167375 (https://phabricator.wikimedia.org/T398818) (owner: 10Marostegui) [06:28:51] I am going to switch phabricator master, expect around 1 minute of RO time https://phabricator.wikimedia.org/T398818 [06:29:19] !log Failover m3 from db1213 to db1250 - T398818 [06:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:22] T398818: Switchover m3 (phabricator) master db1213 -> db1250 - https://phabricator.wikimedia.org/T398818 [06:31:58] Done, RO was 30 seconds [06:36:04] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:36:28] (03PS1) 10Marostegui: db1213: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167435 (https://phabricator.wikimedia.org/T398805) [06:37:07] (03CR) 10Marostegui: [C:03+2] db1213: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167435 (https://phabricator.wikimedia.org/T398805) (owner: 10Marostegui) [06:38:34] (03CR) 10Volans: "My 2.5 cents inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164147 (owner: 10Ayounsi) [06:43:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:44:34] 06SRE, 10LDAP-Access-Requests: Grant Access to for  - https://phabricator.wikimedia.org/T399020#10987164 (10Aklapper) 05Open→03Invalid Hi, per https://phabricator.wikimedia.org/tag/ldap-access-requests/ , `wmf` membership needs to be requested via IDM nowadays [06:46:52] (03PS1) 10Marostegui: db2232: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167437 (https://phabricator.wikimedia.org/T399060) [06:47:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2232].codfw.wmnet,db[1207,1217].eqiad.wmnet with reason: migration to mariadb 10.11 [06:49:06] (03CR) 10Marostegui: [C:03+2] db2232: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1167437 (https://phabricator.wikimedia.org/T399060) (owner: 10Marostegui) [06:53:04] (03PS1) 10Elukey: EventStreamConfig: add the maps.tiles_change_bookworm stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) [06:53:59] (03CR) 10Elukey: services: configure tegola in codfw to use maps-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [06:54:08] kart_, o/ [06:54:48] here [07:00:04] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T0700). Please do the needful. [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:07] (03PS1) 10Volans: cookbook API: expand argument_task_required docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167442 [07:03:12] kart_, shall we start? [07:03:28] sure. Give me a minute. [07:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:04:45] abijeet: You can watch deployment at, https://spiderpig.wikimedia.org/jobs/288 [07:05:27] (03Merged) 10jenkins-bot: CX: Add virtual-cx-shared DatabaseVirtualDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152065 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:05:28] kart_, cool [07:05:31] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: sync [07:05:58] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1152065|CX: Add virtual-cx-shared DatabaseVirtualDomains (T348513)]] [07:06:01] T348513: Migrate ContentTranslation to use a virtual database domain - https://phabricator.wikimedia.org/T348513 [07:08:08] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1152065|CX: Add virtual-cx-shared DatabaseVirtualDomains (T348513)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:10:12] abijeet: Testing please.. [07:10:19] kart_, I can test with the debug tool? [07:10:25] Yes [07:10:56] kart_, ok thanks. We just need to check that CX can still save drafts and publish, even if its to the user namespace. On it [07:11:04] (03CR) 10Vgutierrez: [C:03+2] hiera: Issue dedicated certs for probenet endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1167143 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [07:17:15] abijeet: I can save and load article [07:17:22] (03CR) 10Jgiannelos: "For what it's worth PCS *can* render output for files:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [07:17:33] kart_, testing on the mobile editor once [07:20:29] (having to wait for the 10min "Review translation" dialog) [07:20:37] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: sync [07:20:49] Publishing on username space worked. [07:22:52] kart_, looks good. lets keep an eye on the logs as well [07:23:10] !log upload python3-docker-report to bookworm-wikimedia [07:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:19] !log upload python3-docker-report 0.0.16 to bookworm-wikimedia [07:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:56] (03PS1) 10Jgiannelos: pcs: Disable staging profiler, set log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 [07:26:03] !log kartik@deploy1003 kartik, abi: Continuing with sync [07:26:09] abijeet: sure [07:26:39] (03CR) 10Filippo Giunchedi: [C:03+1] pcs: Disable staging profiler, set log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 (owner: 10Jgiannelos) [07:27:52] (03PS2) 10Jgiannelos: pcs: Set staging log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 [07:28:57] (03PS1) 10Brouberol: airflow-ml: update the principal primary to analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167508 (https://phabricator.wikimedia.org/T398907) [07:28:58] (03PS1) 10Brouberol: airflow-ml: enable the hadoop shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167509 (https://phabricator.wikimedia.org/T398907) [07:29:23] (03CR) 10Jgiannelos: [C:03+2] pcs: Set staging log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 (owner: 10Jgiannelos) [07:31:09] !log installing nginx security updates [07:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:12] (03Merged) 10jenkins-bot: pcs: Set staging log level to info [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167453 (owner: 10Jgiannelos) [07:31:19] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152065|CX: Add virtual-cx-shared DatabaseVirtualDomains (T348513)]] (duration: 25m 21s) [07:31:23] T348513: Migrate ContentTranslation to use a virtual database domain - https://phabricator.wikimedia.org/T348513 [07:31:36] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [07:31:41] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [07:32:14] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [07:32:35] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [07:34:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1036', diff saved to https://phabricator.wikimedia.org/P78817 and previous config saved to /var/cache/conftool/dbconfig/20250709-073458-marostegui.json [07:38:11] (03PS1) 10Marostegui: mariadb: Productionize es1047 [puppet] - 10https://gerrit.wikimedia.org/r/1167520 (https://phabricator.wikimedia.org/T395771) [07:39:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1036.eqiad.wmnet with reason: Maintenance [07:42:10] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [07:42:10] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [07:42:17] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [07:42:17] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [07:42:20] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1047 [puppet] - 10https://gerrit.wikimedia.org/r/1167520 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [07:46:29] (03CR) 10Kevin Bazira: [C:03+1] airflow-ml: update the principal primary to analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167508 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:50:38] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [07:50:39] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:50:59] (03PS1) 10Jgiannelos: pcs: Use purge only requests for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [07:52:28] (03PS2) 10Jgiannelos: pcs: Use purge only requests for staging mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [07:53:27] (03CR) 10Kevin Bazira: [C:03+1] airflow-ml: enable the hadoop shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167509 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:53:53] (03CR) 10Brouberol: [C:03+2] airflow-ml: update the principal primary to analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167508 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:53:55] (03CR) 10Brouberol: [C:03+2] airflow-ml: enable the hadoop shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167509 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:54:31] (03CR) 10Hashar: "recheck after having deployed the CI config (If4e694a76891f65fa159b4e3c0aca26c996ffe6c and I426d3370f3d290f938bdefecc92b1e31e6300e3f)" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [07:54:37] (03PS3) 10Jgiannelos: pcs: Use purge only requests for staging mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [07:54:48] (03CR) 10CI reject: [V:04-1] rename build pipelines for sourcebot [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [07:55:27] (03Merged) 10jenkins-bot: airflow-ml: update the principal primary to analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167508 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:55:35] (03Merged) 10jenkins-bot: airflow-ml: enable the hadoop shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167509 (https://phabricator.wikimedia.org/T398907) (owner: 10Brouberol) [07:58:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [07:58:42] (03PS3) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [07:58:51] (03CR) 10Jgiannelos: "@hnowlan@wikimedia.org Happy to hear if you have a better to idea to only override the header using YAML anchors, but i didn't find a bett" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (owner: 10Jgiannelos) [07:58:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [07:58:58] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [08:00:04] andre and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T0800). [08:00:12] o/ [08:01:02] (03PS4) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [08:01:31] !log aklapper@deploy1003 Started scap sync-world: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] [08:01:36] T398925: TypeError: MediaWiki\Api\ApiFeedContributions::feedItem(): Argument #1 ($row) must be of type stdClass, Flow\Formatter\ContributionsRow given, called in /srv/mediawiki/php-1.45.0-wmf.9/includes/api/ApiFeedContributions.php on l - https://phabricator.wikimedia.org/T398925 [08:02:45] (03CR) 10Volans: [C:03+1] "I'm not too familiar with this puppettization but the change looks ok to me." [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [08:03:57] !log aklapper@deploy1003 zabe, aklapper: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:04:46] !log aklapper@deploy1003 zabe, aklapper: Continuing with sync [08:09:53] !log aklapper@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167296|Remove stdClass type hint from ApiFeedContributions::feedItem() for now (T398925)]] (duration: 08m 21s) [08:09:58] T398925: TypeError: MediaWiki\Api\ApiFeedContributions::feedItem(): Argument #1 ($row) must be of type stdClass, Flow\Formatter\ContributionsRow given, called in /srv/mediawiki/php-1.45.0-wmf.9/includes/api/ApiFeedContributions.php on l - https://phabricator.wikimedia.org/T398925 [08:11:07] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167529 (https://phabricator.wikimedia.org/T392179) [08:11:13] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167529 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:59] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167529 (https://phabricator.wikimedia.org/T392179) (owner: 10TrainBranchBot) [08:14:10] I will be rolling out a minor Netbox update in a few minutes. See: https://phabricator.wikimedia.org/T397300 [08:15:27] (03PS1) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [08:15:39] (03PS2) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [08:16:49] (03PS3) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [08:17:07] !log slyngshede@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:17:35] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:18:26] !log Deploying Netbox v4.0.11 to production T397300 [08:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:29] T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300 [08:18:54] (03PS5) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [08:19:14] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [08:20:33] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.9 refs T392179 [08:20:37] T392179: 1.45.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T392179 [08:21:01] (03PS6) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [08:21:06] (03PS2) 10Hashar: rename build pipelines for sourcebot [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [08:21:09] !log slyngshede@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.11 to production - slyngshede@cumin1002 [08:21:53] (03PS7) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [08:24:38] (03CR) 10Hashar: "I have added:" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [08:24:45] (03CR) 10Hashar: [C:03+1] rename build pipelines for sourcebot [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [08:26:48] (03PS1) 10Brouberol: kafka-jumbo: enable ingress traffic from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) [08:27:19] (03PS2) 10Brouberol: kafka-jumbo: enable ingress traffic from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) [08:27:31] slyngshede@cumin1002 python-code (PID 3995723) is awaiting input [08:27:53] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [08:28:19] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6203/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [08:28:48] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.11 to production - slyngshede@cumin1002 [08:29:12] !log slyngshede@cumin1003 START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.11 to production - slyngshede@cumin1003 [08:33:26] slyngshede@cumin1003 python-code (PID 954055) is awaiting input [08:34:30] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cephosd2001.codfw.wmnet [08:35:16] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cephosd2001.codfw.wmnet [08:38:06] (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [08:40:27] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [08:40:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:42:34] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.11 to production - slyngshede@cumin1003 [08:46:14] 06SRE, 10Observability-Alerting: librenms page didn't auto-resolve in VO - https://phabricator.wikimedia.org/T263423#10987361 (10fgiunchedi) 05Open→03Invalid I don't think we've seen a recorrence of this problem, and we fixed the host-related recoveries in {T264016} [08:46:52] 06SRE, 10Observability-Alerting: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570#10987365 (10fgiunchedi) 05Open→03Invalid Related tasks have been resolved, resolving this one too [08:47:39] 06SRE, 10Observability-Alerting: Better abstractions for puppet & icinga/nagios/shinken - https://phabricator.wikimedia.org/T85624#10987367 (10fgiunchedi) 05Open→03Declined I'm boldly declining this task as part of the icinga/am migration [08:50:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:55:27] 06SRE, 06SRE-OnFire, 10Observability-Alerting, 07I18n: Internationalization (i18n) & localization (l10n) of www.wikimediastatus.net - https://phabricator.wikimedia.org/T305896#10987381 (10fgiunchedi) 05Open→03Declined It looks like this is technically possible, though we'll need a subscription to L... [08:56:38] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: icinga login case mismatch - https://phabricator.wikimedia.org/T275920#10987388 (10fgiunchedi) 05Open→03Declined Given that Icinga is on its way out I'm boldly declining the task [08:57:36] 06SRE, 10Observability-Alerting: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662#10987392 (10fgiunchedi) 05Open→03Declined We are reworking metamonitoring to use alertmanager/prometheus instead, and icinga is on its way out thus declining the task [08:58:06] (03PS2) 10Elukey: profile::docker::reporter: move to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) [08:58:48] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407#10987401 (10fgiunchedi) 05Open→03Invalid Puppet now runs every 5m on the alert hosts, and AFAIK we haven't seen a reoccurence of this? R... [09:01:41] !log Upgrade completed Netbox v4.0.11 T397300 [09:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:44] T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300 [09:03:46] (03PS3) 10Elukey: profile::docker::reporter: move to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) [09:07:00] (03PS1) 10Btullis: Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) [09:07:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10987432 (10Marostegui) 05Resolved→03Open There's something wrong with these hosts RAID's ` root@es1048:~# pvs PV VG Fmt Attr PSize PFree /dev/sda3 tank lvm2 a-... [09:08:17] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:08:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10987435 (10Marostegui) Both hosts, es1047 and es1048 are showing the same issue. [09:10:45] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6205/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:11:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1001.eqiad.wmnet [09:13:37] (03PS1) 10Brouberol: deployment_server: group chown all airflow kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) [09:14:02] (03PS2) 10Brouberol: deployment_server: group chown all airflow kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) [09:15:17] (03CR) 10Elukey: "@rcoccioli@wikimedia.org I added a couple of changes, I realized that the Exec command was wrong :( We need something like:" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:15:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1001.eqiad.wmnet [09:16:16] (03PS2) 10Btullis: Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) [09:16:57] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6206/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:17:11] (03CR) 10Btullis: [C:03+1] deployment_server: group chown all airflow kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:17:33] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6207/" [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:17:41] (03CR) 10Btullis: [C:03+1] kafka-jumbo: enable ingress traffic from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [09:17:44] (03PS1) 10Slyngshede: data.yaml: Offboarding chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1167546 [09:18:19] (03CR) 10Brouberol: [V:03+1 C:03+2] kafka-jumbo: enable ingress traffic from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1167533 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [09:18:45] (03CR) 10Brouberol: [V:03+1 C:03+2] deployment_server: group chown all airflow kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1167543 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:19:36] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:19:40] (03CR) 10Brouberol: [C:03+1] Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:19:54] (03PS3) 10Btullis: Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) [09:20:30] (03PS1) 10Clément Goubert: PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) [09:20:57] (03CR) 10Btullis: [C:03+2] Ceph: configure the ceph::osd::excluded_slots per cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:21:13] (03CR) 10Btullis: [V:03+1 C:03+2] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6208/" [puppet] - 10https://gerrit.wikimedia.org/r/1167542 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [09:21:26] (03PS2) 10Clément Goubert: PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) [09:21:35] (03PS3) 10Clément Goubert: PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) [09:21:57] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:23:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host cephosd2001.codfw.wmnet [09:27:06] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10987462 (10ABran-WMF) >>! In T372804#10985583, @Dzahn wrote: > "determine if this is resolved once it's a warm standby host or if we switch production to... [09:29:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10987465 (10MoritzMuehlenhoff) [09:29:16] (03CR) 10Volans: [C:03+1] "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:29:39] (03PS1) 10Clément Goubert: PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) [09:29:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10987466 (10MoritzMuehlenhoff) [09:29:40] (03PS1) 10Clément Goubert: PS.php: Restore poolcounter config post-reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) [09:29:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10987467 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [09:30:34] (03CR) 10CI reject: [V:04-1] PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [09:34:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10987482 (10elukey) I checked the BIOS configs via Redfish and they are different from what we expect, the cookbook fails since we expect `BootModeSelect` to be present... [09:35:16] (03CR) 10Arnaudb: [C:03+2] "thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167226 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:35:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10987484 (10klausman) >>! In T393948#10985332, @Jclark-ctr wrote: > @klausman Will this be legacy or uefi? it is reachable We don't have a particular preference for... [09:39:28] (03PS1) 10David Caro: aptly: add arm64 arch support [puppet] - 10https://gerrit.wikimedia.org/r/1167551 (https://phabricator.wikimedia.org/T398016) [09:41:29] (03Merged) 10jenkins-bot: gerrit: standardize expected rc on systemctl check [cookbooks] - 10https://gerrit.wikimedia.org/r/1167226 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:41:30] (03CR) 10Majavah: [C:03+1] aptly: add arm64 arch support [puppet] - 10https://gerrit.wikimedia.org/r/1167551 (https://phabricator.wikimedia.org/T398016) (owner: 10David Caro) [09:43:07] 10SRE-tools, 06Infrastructure-Foundations, 10SRE Observability (FY2025/2026-Q1): More frequent Puppet runs on the alert hosts? - https://phabricator.wikimedia.org/T398444#10987493 (10Volans) I wonder if the prometheus servers have a similar behavior of applying changes from puppet exported resources. FYI th... [09:44:25] (03PS1) 10Tiziano Fogli: pdb_resource_exporter: add unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167554 (https://phabricator.wikimedia.org/T395442) [09:45:34] (03PS1) 10Fabfur: varnish: remove X-Known-Client netmapper [puppet] - 10https://gerrit.wikimedia.org/r/1167555 (https://phabricator.wikimedia.org/T396621) [09:45:35] !log installing Zookeeper security updates on zk-flink [09:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:12] (03CR) 10David Caro: [C:03+2] aptly: add arm64 arch support [puppet] - 10https://gerrit.wikimedia.org/r/1167551 (https://phabricator.wikimedia.org/T398016) (owner: 10David Caro) [09:48:20] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[4037,4045].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [09:48:23] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [09:50:13] (03CR) 10Filippo Giunchedi: [C:03+1] pdb_resource_exporter: add unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167554 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:40] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:57:10] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167557 [09:58:00] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167557 (owner: 10PipelineBot) [09:58:05] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[4037,4045].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [09:58:08] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [09:59:11] (03CR) 10Tiziano Fogli: [C:03+2] pdb_resource_exporter: add unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167554 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [09:59:57] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167557 (owner: 10PipelineBot) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1000) [10:03:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:03:11] (03CR) 10Clément Goubert: [C:03+2] mwaint: Remove from scap [puppet] - 10https://gerrit.wikimedia.org/r/1167196 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:03:28] (03PS2) 10Clément Goubert: mwaint: Remove from scap [puppet] - 10https://gerrit.wikimedia.org/r/1167196 (https://phabricator.wikimedia.org/T397017) [10:04:06] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:04:26] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:05:15] (03CR) 10Clément Goubert: [C:03+2] mwaint: Remove from scap [puppet] - 10https://gerrit.wikimedia.org/r/1167196 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:05:36] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10987531 (10MoritzMuehlenhoff) [10:05:58] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10987532 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [10:06:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:06:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:06:47] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-ulsfo and not P{cp[4037,4045].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [10:06:50] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [10:11:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:13:25] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cephosd2001.codfw.wmnet [10:13:59] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host cephosd2001.codfw.wmnet [10:14:25] !log Cutting off access to mwmaint servers - T397017 [10:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:28] T397017: Turn down mwmaint production servers - https://phabricator.wikimedia.org/T397017 [10:14:33] (03CR) 10Clément Goubert: [C:03+2] mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:14:41] (03PS4) 10Clément Goubert: mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) [10:15:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1167546 (owner: 10Slyngshede) [10:15:58] (03CR) 10Clément Goubert: [C:03+2] mwmaint: deprecate mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1167197 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:18:59] (03PS1) 10FNegri: offboard-user: remove WMCS-related LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1167561 (https://phabricator.wikimedia.org/T398215) [10:19:27] (03CR) 10Ladsgroup: Catalog newsletter tables (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) (owner: 10Pppery) [10:19:32] (03PS5) 10Pppery: Catalog newsletter tables [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) [10:19:33] (03CR) 10Ladsgroup: [C:03+2] Catalog newsletter tables [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) (owner: 10Pppery) [10:19:35] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Catalog newsletter tables [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) (owner: 10Pppery) [10:19:42] (03PS3) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [10:20:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2001.codfw.wmnet [10:20:42] (03Abandoned) 10FNegri: offboard-user: remove WMCS-related LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1167561 (https://phabricator.wikimedia.org/T398215) (owner: 10FNegri) [10:20:51] (03PS1) 10Cathal Mooney: WMF Plugin: do not process disabled ports for block speed setting [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1167564 (https://phabricator.wikimedia.org/T394333) [10:24:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2001.codfw.wmnet [10:25:32] (03PS1) 10AikoChou: ml-services: update edit-check image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167565 (https://phabricator.wikimedia.org/T397013) [10:29:03] (03CR) 10Clément Goubert: [C:03+2] Revert "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167177 (owner: 10Clément Goubert) [10:29:56] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10987593 (10Jhancock.wm) @elukey the only settings i can change on this boss card is to create and delete the virtual disk. I didn't see any other settings. [10:30:32] (03Merged) 10jenkins-bot: Revert "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167177 (owner: 10Clément Goubert) [10:30:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd2001.codfw.wmnet [10:33:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:34:06] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10987612 (10Jhancock.wm) @Marostegui it does have a hardware raid. Feel free to change it and reimage it to your liking. [10:34:20] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1167546 (owner: 10Slyngshede) [10:36:20] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:37:04] !log Restoring memory limits on mw-cron - T395436 - T395465 [10:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:09] T395436: Limit CPU usage for mw-on-k8s cli deployments - https://phabricator.wikimedia.org/T395436 [10:37:09] T395465: Investigate EQIAD daily completion suggester rebuild failure - https://phabricator.wikimedia.org/T395465 [10:38:35] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:38:42] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cephosd1001.eqiad.wmnet [10:39:14] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cephosd1001.eqiad.wmnet [10:39:30] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [10:42:07] PROBLEM - Host cephosd2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:44:00] (03CR) 10Vgutierrez: pyrra: remove multi-dc for istio-based SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [10:44:16] (03CR) 10Hnowlan: pcs: Use purge only requests for staging mobile-html transcludes (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (owner: 10Jgiannelos) [10:45:28] (03PS1) 10Zabe: Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167569 (https://phabricator.wikimedia.org/T398861) [10:45:39] (03PS1) 10Zabe: Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167570 (https://phabricator.wikimedia.org/T398861) [10:45:48] (03CR) 10Vgutierrez: "even with this refactor we are looking at a class with 520 lines already, could it make sense to split this per service?" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [10:47:13] (03PS4) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [10:47:34] (03CR) 10Hnowlan: [C:03+1] PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:47:48] (03PS5) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 [10:48:14] (03CR) 10Hnowlan: [C:03+1] PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:48:38] (03CR) 10Hnowlan: [C:03+1] PS.php: Restore poolcounter config post-reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:49:43] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:50:24] (03CR) 10Elukey: [C:03+2] admin_ng: update knative's queue proxy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165850 (owner: 10Elukey) [10:50:24] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update edit-check image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167565 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [10:51:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:51:20] (03PS1) 10Aqu: data-engineering: Refine switch over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [10:51:37] !log elukey@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:51:38] (03CR) 10Fabfur: cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [10:51:57] !log elukey@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:52:08] (03Merged) 10jenkins-bot: PS.php: Disable secondary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167548 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [10:52:31] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1167548|PS.php: Disable secondary poolcounters for reboot (T395240)]] [10:52:53] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@3a0cdd4]: bump image suggestions to v1.8.0 [10:53:17] 10SRE-tools, 06Data-Platform-SRE, 10Spicerack: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#10987669 (10brouberol) [10:53:30] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@3a0cdd4]: bump image suggestions to v1.8.0 (duration: 00m 48s) [10:53:37] (03PS4) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 [10:53:46] (03PS1) 10Zabe: ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167573 (https://phabricator.wikimedia.org/T399037) [10:53:51] RECOVERY - Host cephosd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [10:53:56] (03PS1) 10Zabe: ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167574 (https://phabricator.wikimedia.org/T399037) [10:54:31] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [10:54:48] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1167548|PS.php: Disable secondary poolcounters for reboot (T395240)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:54:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:55:35] PROBLEM - Bird Internet Routing Daemon on cephosd2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:55:36] !log cgoubert@deploy1003 cgoubert: Continuing with sync [10:56:35] RECOVERY - Bird Internet Routing Daemon on cephosd2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:57:28] (03PS1) 10Majavah: dynamicproxy: Allow normal users to delete deprecated proxies [puppet] - 10https://gerrit.wikimedia.org/r/1167575 [10:58:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10987673 (10elukey) Note for myself: ` BIOS - Found a NIC device: P1_AIOMAOC_AG_i2LAN1OPROM Set PXE to the NIC P1_AIOMAOC_AG_i2LAN1OPROM BIOS: P1_AIOMAOC... [10:59:01] (03CR) 10Fabfur: cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [10:59:21] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [11:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1100) [11:01:04] (03PS1) 10Ladsgroup: mariadb: Remove tables that are not cataloged from filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) [11:01:34] (03CR) 10LD: "@dreamyjazzwikipedia@gmail.com As noted in CorePermissions, using 'ukwiki' (without the +) might fully override the default configuration " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [11:02:02] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167548|PS.php: Disable secondary poolcounters for reboot (T395240)]] (duration: 09m 30s) [11:02:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:03:50] (03PS6) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:04:11] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (owner: 10Fabfur) [11:04:38] (03CR) 10Ladsgroup: mariadb: Remove tables that are not cataloged from filtered_tables.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [11:04:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10987687 (10elukey) I tried to reimage after two run of provision with uefi, and this is what I get: ` ┌────────────────────┤ [!!] Configure the network... [11:05:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987688 (10cmooney) >>! In T394333#10964951, @Andrew wrote: > That should be possible as long as I can get support with refactoring... [11:05:13] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1053.eqiad.wmnet with OS bookworm [11:05:24] (03PS7) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:06:08] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter2006.codfw.wmnet [11:07:04] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10Spicerack: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#10987690 (10brouberol) To illustrate the proposal, this is one of many things you can do with an admin client: `lang=python >>> from ka... [11:07:42] (03CR) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:08:32] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10987693 (10Marostegui) >>! In T393042#10987612, @Jhancock.wm wrote: > @Marostegui it does have a hardware raid. Feel free to change it and reimage it to your liking. Would you be... [11:08:35] (03PS8) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:09:37] !log disable puppet on A:cp to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167530 [11:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:54] (03PS9) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:09:55] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2006.codfw.wmnet [11:10:14] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter1007.eqiad.wmnet [11:11:32] (03PS10) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:11:52] (03CR) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:12:07] (03PS5) 10Fabfur: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (https://phabricator.wikimedia.org/T399071) [11:13:25] (03CR) 10LD: "As I can't edit the patch, I suggested this here: https://phabricator.wikimedia.org/F63617907" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [11:13:58] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1007.eqiad.wmnet [11:14:49] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on A:cp-ulsfo and not P{cp[4037,4045].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [11:14:52] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [11:15:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:15:47] (03CR) 10Fabfur: [C:03+2] cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1167530 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:17:14] (03CR) 10CI reject: [V:04-1] PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:18:14] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10987728 (10Marostegui) @Jhancock.wm this host isn't accessible, so I cannot even do anything with it. Do you think, if I provide you with a hostname we can go ahead and "treat it li... [11:18:28] (03CR) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:18:46] (03PS2) 10Clément Goubert: PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) [11:18:53] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10987729 (10Marostegui) >>! In T393042#10987728, @Marostegui wrote: > @Jhancock.wm this host isn't accessible, so I cannot even do anything with it. Do you think, if I provide you wi... [11:19:20] (03PS1) 10Fabfur: cache::haproxy: rename http frontend [puppet] - 10https://gerrit.wikimedia.org/r/1167578 (https://phabricator.wikimedia.org/T399071) [11:20:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:20:44] (03PS11) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:20:52] (03PS8) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [11:20:55] (03Merged) 10jenkins-bot: PS.php: Disable primary poolcounters for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167549 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:21:21] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1167549|PS.php: Disable primary poolcounters for reboot (T395240)]] [11:22:23] (03PS12) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:23:28] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [11:23:29] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1167549|PS.php: Disable primary poolcounters for reboot (T395240)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:23:44] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [11:24:16] !log cgoubert@deploy1003 cgoubert: Continuing with sync [11:26:32] (03PS13) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:28:08] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:28:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:29:40] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167549|PS.php: Disable primary poolcounters for reboot (T395240)]] (duration: 08m 19s) [11:31:03] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[4052].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [11:31:06] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [11:31:42] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter2005.codfw.wmnet [11:32:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:32:08] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on P{cp[4052].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [11:33:12] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2002.codfw.wmnet [11:33:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'depool pc2011', diff saved to https://phabricator.wikimedia.org/P78821 and previous config saved to /var/cache/conftool/dbconfig/20250709-113322-marostegui.json [11:33:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:33:41] (03PS1) 10Btullis: ceph::osds - Use sdparm instead of hdparm to disable the write cache [puppet] - 10https://gerrit.wikimedia.org/r/1167585 (https://phabricator.wikimedia.org/T374923) [11:34:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'pool pc2011', diff saved to https://phabricator.wikimedia.org/P78823 and previous config saved to /var/cache/conftool/dbconfig/20250709-113413-marostegui.json [11:34:41] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter1006.eqiad.wmnet [11:34:48] (03PS1) 10Fabfur: cache::haproxy: rename backend httpreqrate too [puppet] - 10https://gerrit.wikimedia.org/r/1167586 (https://phabricator.wikimedia.org/T399071) [11:35:04] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6210/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167585 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [11:35:38] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2005.codfw.wmnet [11:35:48] (03PS1) 10Michael Große: Growth: Enable limiting Add Link for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) [11:35:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) (owner: 10Michael Große) [11:37:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2002.codfw.wmnet [11:37:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'depool pc1', diff saved to https://phabricator.wikimedia.org/P78824 and previous config saved to /var/cache/conftool/dbconfig/20250709-113717-marostegui.json [11:38:03] (03PS2) 10Fabfur: cache::haproxy: rename backend httpreqrate too [puppet] - 10https://gerrit.wikimedia.org/r/1167586 (https://phabricator.wikimedia.org/T399071) [11:38:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'pool pc1', diff saved to https://phabricator.wikimedia.org/P78826 and previous config saved to /var/cache/conftool/dbconfig/20250709-113831-marostegui.json [11:38:36] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter1006.eqiad.wmnet [11:39:04] (03PS3) 10Clément Goubert: PS.php: Restore poolcounter config post-reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) [11:39:36] (03PS9) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [11:40:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:40:43] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [11:40:51] (03Merged) 10jenkins-bot: PS.php: Restore poolcounter config post-reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167550 (https://phabricator.wikimedia.org/T395240) (owner: 10Clément Goubert) [11:41:03] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: rename backend httpreqrate too [puppet] - 10https://gerrit.wikimedia.org/r/1167586 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:41:17] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1167550|PS.php: Restore poolcounter config post-reboot (T395240)]] [11:41:33] (03CR) 10Fabfur: [C:03+2] cache::haproxy: rename backend httpreqrate too [puppet] - 10https://gerrit.wikimedia.org/r/1167586 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:41:41] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1073 [11:42:00] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1073 [11:43:24] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1167550|PS.php: Restore poolcounter config post-reboot (T395240)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:43:26] (03CR) 10Hnowlan: [C:04-1] pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:43:50] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2003.codfw.wmnet [11:44:21] !log cgoubert@deploy1003 cgoubert: Continuing with sync [11:45:14] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [11:47:06] (03PS14) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) [11:47:20] (03CR) 10Jgiannelos: pcs: Use purge only requests for mobile-html transcludes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:48:00] (03CR) 10Hnowlan: [C:03+1] pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:48:29] !log puppet enabled again on A:cp (T399071) [11:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:33] T399071: Split haproxy configuration in different files - https://phabricator.wikimedia.org/T399071 [11:48:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2003.codfw.wmnet [11:49:01] (03CR) 10Brouberol: [C:03+1] ceph::osds - Use sdparm instead of hdparm to disable the write cache [puppet] - 10https://gerrit.wikimedia.org/r/1167585 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [11:49:12] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:49:41] (03PS2) 10Fabfur: cache::haproxy: rename http frontend and backend to pristine name [puppet] - 10https://gerrit.wikimedia.org/r/1167578 (https://phabricator.wikimedia.org/T399071) [11:49:56] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167550|PS.php: Restore poolcounter config post-reboot (T395240)]] (duration: 08m 39s) [11:50:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:50:43] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: rename http frontend and backend to pristine name [puppet] - 10https://gerrit.wikimedia.org/r/1167578 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:50:52] (03PS10) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [11:51:03] (03CR) 10Jgiannelos: [C:03+2] pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:51:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987796 (10cmooney) 1050 and 1051 are now connected and ports up too. ` cmooney@cloudsw1-f4-eqiad> show interfaces descriptions | ma... [11:51:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:51:57] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:52:44] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:52:51] (03Merged) 10jenkins-bot: pcs: Use purge only requests for mobile-html transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167524 (https://phabricator.wikimedia.org/T397750) (owner: 10Jgiannelos) [11:52:57] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [11:53:39] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:54:02] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [11:54:50] (03PS1) 10David Caro: toolforge: install misctools as any other toolforge package [puppet] - 10https://gerrit.wikimedia.org/r/1167590 [11:55:06] (03CR) 10Btullis: [V:03+1 C:03+2] ceph::osds - Use sdparm instead of hdparm to disable the write cache [puppet] - 10https://gerrit.wikimedia.org/r/1167585 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [11:55:44] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:55:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:55:59] (03PS2) 10David Caro: toolforge: install misctools as any other toolforge package [puppet] - 10https://gerrit.wikimedia.org/r/1167590 [11:56:04] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:56:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:56:49] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:57:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:57:14] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns cloudcephosd1048,49 - jclark@cumin1002" [11:57:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns cloudcephosd1048,49 - jclark@cumin1002" [11:57:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:57:52] jelto@cumin1003 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:57:59] (03PS1) 10Mhorsey: Add new script to update old freetext country data new schema [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) [11:58:24] (03CR) 10Fabfur: [C:03+2] cache::haproxy: rename http frontend and backend to pristine name [puppet] - 10https://gerrit.wikimedia.org/r/1167578 (https://phabricator.wikimedia.org/T399071) (owner: 10Fabfur) [11:58:36] heads up, i am deploying some changes in changeprop [11:59:35] (03CR) 10Sergio Gimeno: [C:03+1] Growth: Enable limiting Add Link for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) (owner: 10Michael Große) [12:00:36] (03CR) 10David Caro: "Allowed puppet to continue running in tools (expected warning message I think):" [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:00:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) (owner: 10Mhorsey) [12:01:53] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [12:02:17] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [12:02:26] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:02:53] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:03:59] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:05:33] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:05:44] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [12:06:35] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 116757 bytes in 1.455 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:07:02] (03CR) 10Majavah: [C:03+1] "I'm fine with removing the `ensure => latest` bit, but also wonder whether this definition should stay in `profile::toolforge::bastion` or" [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:07:24] !log brouberol@cumin1003 END (FAIL) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=1) rolling restart_daemons on A:kafka-jumbo-eqiad [12:07:25] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [12:08:12] !log installing nginx security updates [12:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:20] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=97) rolling restart_daemons on A:kafka-jumbo-eqiad [12:09:19] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [12:09:33] ^ these are logged by test-cookbook, which performs no action [12:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:11:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:12:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:20] !log installing openjdk-17 security updates [12:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:15:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:16:15] (03PS3) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [12:16:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [12:17:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye [12:17:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.... [12:17:18] (03PS3) 10David Caro: toolforge: install misctools as any other toolforge package [puppet] - 10https://gerrit.wikimedia.org/r/1167590 [12:17:19] (03PS1) 10David Caro: toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 [12:17:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eqiad.... [12:18:00] (03CR) 10CI reject: [V:04-1] toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 (owner: 10David Caro) [12:19:06] (03PS2) 10David Caro: toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 [12:21:18] (03CR) 10CI reject: [V:04-1] toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 (owner: 10David Caro) [12:22:15] (03CR) 10CI reject: [V:04-1] kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:25:52] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[4052].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [12:25:55] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [12:27:22] (03CR) 10David Caro: "It's a package part of toolforge, that installs tools that you need for toolforge, I would even be tempted to rename the package `toolforg" [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:30:19] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[4052].ulsfo.wmnet} and A:cp - 2.8.15 upgrade (T398720) [12:30:26] (03CR) 10Majavah: "Yeah, the difference being is that misctools is specific to operations that need to happen on the bastion (`take` for example is specifica" [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:30:45] (03CR) 10Majavah: [C:04-1] "As explained on -cloud-admin, I do not think this is a good idea." [puppet] - 10https://gerrit.wikimedia.org/r/1167594 (owner: 10David Caro) [12:33:56] (03PS4) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [12:34:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [12:34:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [12:36:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:36:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:37:27] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - 2.8.15 upgrade (T398720) [12:37:30] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [12:37:57] (03PS1) 10Hashar: Add readonly pugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1167605 (https://phabricator.wikimedia.org/T387833) [12:38:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [12:38:36] (03CR) 10Volans: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:39:15] !log slyngshede@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - 2.8.15 upgrade (T398720) [12:39:17] (03PS2) 10Aqu: data-engineering: Refine switch over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [12:39:17] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:39:23] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1050 [12:39:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1050 [12:39:33] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1051 [12:39:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1051 [12:40:00] (03CR) 10Hashar: [C:03+2] Add readonly pugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1167605 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [12:40:16] (03PS3) 10David Caro: toolforge: skip toolforge clis from unattended upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1167594 [12:40:40] (03Merged) 10jenkins-bot: Add readonly pugin [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1167605 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [12:41:05] (03CR) 10Aqu: "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [12:41:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [12:41:47] (03PS5) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [12:41:55] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:42:38] (03CR) 10Xcollazo: "Thanks Ben. Could you please +2 when you have a min?" [puppet] - 10https://gerrit.wikimedia.org/r/1167286 (https://phabricator.wikimedia.org/T399013) (owner: 10Xcollazo) [12:44:01] (03CR) 10David Caro: [C:03+2] toolforge: install misctools as any other toolforge package [puppet] - 10https://gerrit.wikimedia.org/r/1167590 (owner: 10David Caro) [12:44:19] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1051 [12:44:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1051 [12:47:29] (03CR) 10Daimona Eaytoy: [C:03+1] Add new script to update old freetext country data new schema [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) (owner: 10Mhorsey) [12:48:20] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [12:48:33] !log hashar@deploy1003 Started deploy [gerrit/gerrit@9666238]: Add readonly plugin - T387833 [12:48:39] T387833: Gerrit failover process - https://phabricator.wikimedia.org/T387833 [12:48:44] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@9666238]: Add readonly plugin - T387833 (duration: 00m 11s) [12:48:45] jclark@cumin1002 reimage (PID 149651) is awaiting input [12:49:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987962 (10Jclark-ctr) @elukey i am having issues with 2 servers both fail to reimage after switching to 25g dac . cloudcephosd... [12:49:14] (03CR) 10Jgiannelos: [C:03+1] "Needs rebase and bump in version but lets try this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [12:50:35] !log hashar@deploy1003 Started deploy [gerrit/gerrit@9666238]: Add readonly plugin - T387833 [12:50:46] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@9666238]: Add readonly plugin - T387833 (duration: 00m 10s) [12:53:40] (03CR) 10Btullis: [C:03+1] kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:54:11] !log installing jetty9 security updates [12:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:19] (03PS1) 10Brouberol: spicerack: add kafka-test-eqiad to spicerack/kafka/config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) [12:54:40] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1001.eqiad.wmnet [12:54:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:55:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:55:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye [12:55:18] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6214/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:55:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmne... [12:55:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987971 (10Jclark-ctr) [12:56:17] (03CR) 10Volans: "LGTM, couple of nits and a question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:57:11] (03CR) 10Btullis: [C:03+1] spicerack: add kafka-test-eqiad to spicerack/kafka/config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:57:15] (03CR) 10Federico Ceratto: "(replied few comments)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [12:57:22] (03PS1) 10KartikMistry: machinetranslation: staging: Update MinT to 2025-07-09-124154-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167608 (https://phabricator.wikimedia.org/T335491) [12:57:25] (03CR) 10Volans: [C:03+1] "This is great, thanks for adding it!" [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [12:57:59] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:58:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [12:58:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:58:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye [12:58:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1001.eqiad.wmnet [12:58:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmne... [12:58:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987977 (10Jclark-ctr) [12:59:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye [12:59:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1300). [13:00:05] MichaelG_WMF and houseofm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] * MichaelG_WMF is here [13:00:15] (03PS3) 10Aqu: data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [13:00:21] o/ [13:00:58] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [13:00:58] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:01:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye [13:01:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10987985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [13:02:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:49] (03PS6) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [13:02:51] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:03:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:03:48] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:04:52] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-07-02-122843 to 2025-07-08-183416 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167613 (https://phabricator.wikimedia.org/T397355) [13:04:56] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-07-02-123323 to 2025-07-09-124522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167614 (https://phabricator.wikimedia.org/T397355) [13:04:59] (03CR) 10Muehlenhoff: [C:03+2] New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:05:04] @MichaelG_WMF @HouseOfM I guess I can deploy [13:05:22] sergi0: thank you <3 [13:05:44] (03PS1) 10Vgutierrez: hiera: Deploy and enable measure cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) [13:05:54] (03CR) 10Brouberol: [V:03+1 C:03+2] spicerack: add kafka-test-eqiad to spicerack/kafka/config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1167607 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:06:08] ty @sergi0 [13:06:50] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:07:45] I think we can do both together [13:07:47] @HouseOfM you'll run the script after deployment? [13:08:29] Daimona will be running it, but not committing the results yet [13:08:39] ack [13:08:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) (owner: 10Mhorsey) [13:08:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) (owner: 10Michael Große) [13:09:36] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [13:09:47] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for thanos-be1006.mgmt:22 - https://phabricator.wikimedia.org/T399052#10988001 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated cable and reset idrac [13:09:49] (03Merged) 10jenkins-bot: Growth: Enable limiting Add Link for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167587 (https://phabricator.wikimedia.org/T396382) (owner: 10Michael Große) [13:10:05] (03Merged) 10jenkins-bot: Add new script to update old freetext country data new schema [extensions/CampaignEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167592 (https://phabricator.wikimedia.org/T397270) (owner: 10Mhorsey) [13:10:27] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1167592|Add new script to update old freetext country data new schema (T397270)]], [[gerrit:1167587|Growth: Enable limiting Add Link for dewiki (T396382)]] [13:10:32] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:10:32] T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382 [13:11:02] (03CR) 10Brouberol: "That seems to be working! I tested it on kafka-test-eqiad, which has the following brokers:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:11:47] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=97) rolling restart_daemons on A:kafka-test-eqiad [13:12:04] (03PS2) 10Hnowlan: changeprop: don't process File: pages for mobile html pages in PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) [13:12:31] !log sgimeno@deploy1003 mhorsey, sgimeno, migr: Backport for [[gerrit:1167592|Add new script to update old freetext country data new schema (T397270)]], [[gerrit:1167587|Growth: Enable limiting Add Link for dewiki (T396382)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:14:01] I can see it working with the debug extension [13:14:01] (03CR) 10Marostegui: [C:03+1] ":(" [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [13:14:29] (03PS2) 10Vgutierrez: hiera: Deploy and enable measure cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) [13:14:43] @HouseOfM should I sync already? Or is Daimona giving a try now? [13:14:49] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:14:51] Please sync [13:14:55] alright [13:14:56] (03CR) 10Andrew Bogott: [C:03+1] dynamicproxy: Allow normal users to delete deprecated proxies [puppet] - 10https://gerrit.wikimedia.org/r/1167575 (owner: 10Majavah) [13:15:01] from my side, we would be good to move forward too [13:15:03] Yep you can go ahead, thank you! [13:15:12] !log sgimeno@deploy1003 mhorsey, sgimeno, migr: Continuing with sync [13:15:16] (03CR) 10Marostegui: "Question, for the filtered tables, we have nothing to do regarding sanitarium right, this is transparent to any of it, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [13:16:33] (03PS1) 10Tiziano Fogli: pdb_resource_exporter: fix unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167619 (https://phabricator.wikimedia.org/T395442) [13:16:50] (03CR) 10Majavah: [C:03+2] dynamicproxy: Allow normal users to delete deprecated proxies [puppet] - 10https://gerrit.wikimedia.org/r/1167575 (owner: 10Majavah) [13:17:00] (03CR) 10Volans: "Nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:17:47] (03CR) 10Tiziano Fogli: [C:03+2] pdb_resource_exporter: fix unaudited tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1167619 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [13:18:08] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:18:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:18:45] (03CR) 10Volans: kafka.roll-restart-reboot-broker: perform action on controller last (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:19:45] (03CR) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:20:27] (03CR) 10Hnowlan: [C:03+2] changeprop: don't process File: pages for mobile html pages in PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [13:20:35] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167592|Add new script to update old freetext country data new schema (T397270)]], [[gerrit:1167587|Growth: Enable limiting Add Link for dewiki (T396382)]] (duration: 10m 07s) [13:20:40] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:20:40] T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382 [13:21:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1041', diff saved to https://phabricator.wikimedia.org/P78828 and previous config saved to /var/cache/conftool/dbconfig/20250709-132111-marostegui.json [13:21:23] changes are live! [13:21:51] (03PS7) 10Brouberol: kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) [13:21:56] thanks! [13:21:57] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1041.eqiad.wmnet with reason: Maintenance [13:22:12] (03Merged) 10jenkins-bot: changeprop: don't process File: pages for mobile html pages in PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167249 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [13:22:32] thanks! [13:24:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reboot-single for host es1041.eqiad.wmnet [13:25:19] (03PS9) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [13:25:33] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - 2.8.15 upgrade (T398720) [13:25:37] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [13:26:35] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - 2.8.15 upgrade (T398720) [13:27:06] (03CR) 10Ssingh: [C:03+1] "Verified per-site hieras." [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:28:11] Is that all for the backport window? If so, I'll run the script [13:29:04] (03CR) 10Volans: [C:03+1] "LGTM, thanks a lot!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:30:27] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1002.eqiad.wmnet [13:31:06] (03CR) 10Ssingh: [C:03+2] team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:31:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es1041.eqiad.wmnet [13:31:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1041.eqiad.wmnet with reason: Maintenance [13:31:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.reboot-single for host es1041.eqiad.wmnet [13:32:09] (03CR) 10Brouberol: [C:03+2] kafka.roll-restart-reboot-broker: perform action on controller last [cookbooks] - 10https://gerrit.wikimedia.org/r/1167593 (https://phabricator.wikimedia.org/T399005) (owner: 10Brouberol) [13:32:57] (03Merged) 10jenkins-bot: team-traffic: add dnsbox alert for service status mismatch [alerts] - 10https://gerrit.wikimedia.org/r/1166225 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:33:49] (03CR) 10Elukey: [C:03+1] cookbook API: expand argument_task_required docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167442 (owner: 10Volans) [13:34:06] (03CR) 10Volans: [C:03+2] cookbook API: expand argument_task_required docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167442 (owner: 10Volans) [13:34:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1002.eqiad.wmnet [13:34:37] jouncebot: nowandnext [13:34:38] For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1300) [13:34:38] In 0 hour(s) and 25 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1400) [13:35:29] (03CR) 10Vgutierrez: [C:03+2] hiera: Deploy and enable measure cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1167616 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:35:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:36:00] (03PS1) 10DDesouza: Pre-deploy Readers Use Cases Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167622 (https://phabricator.wikimedia.org/T398870) [13:36:30] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki metawiki --exceptions countryExceptionMappings.csv [13:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:33] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:36:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1044.eqiad.wmnet with reason: Maintenance [13:36:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1044 for upgrade', diff saved to https://phabricator.wikimedia.org/P78829 and previous config saved to /var/cache/conftool/dbconfig/20250709-133639-marostegui.json [13:38:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.reboot-single for host es1044.eqiad.wmnet [13:38:47] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1003.eqiad.wmnet [13:39:01] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki officewiki --exceptions countryExceptionMappings.csv [13:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:07] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki testwiki --exceptions countryExceptionMappings.csv [13:40:09] jouncebot: nowandnext [13:40:09] For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1300) [13:40:09] In 0 hour(s) and 19 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1400) [13:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:17] (03CR) 10Zabe: [C:03+2] ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167574 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:40:19] (03CR) 10Zabe: [C:03+2] ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167573 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:40:20] (03CR) 10Zabe: [C:03+2] Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167570 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:40:22] (03CR) 10Zabe: [C:03+2] Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167569 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:40:37] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10Spicerack: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#10988166 (10Volans) An immediate workaround was implemented in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1167593 that giv... [13:40:59] (03PS11) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [13:41:31] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki test2wiki --exceptions countryExceptionMappings.csv [13:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:34] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:41:55] (03PS1) 10Hnowlan: Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 [13:42:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167622 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [13:42:08] I am done. [13:42:17] (03Merged) 10jenkins-bot: cookbook API: expand argument_task_required docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167442 (owner: 10Volans) [13:42:23] (03CR) 10Jgiannelos: [C:03+1] Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 (owner: 10Hnowlan) [13:42:26] (03PS2) 10Hnowlan: Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 [13:42:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167574 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:42:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167573 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:42:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/intersection] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167570 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:42:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/intersection] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167569 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:42:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1003.eqiad.wmnet [13:43:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es1041.eqiad.wmnet [13:43:20] PROBLEM - Host cephosd2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:51] (03PS3) 10Hnowlan: Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 [13:44:54] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:44:56] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:44:56] PROBLEM - Host cephosd2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:56] PROBLEM - Host cephosd2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:09] (03CR) 10Dreamy Jazz: [C:04-1] "As a compromise, could we consider grouping the wikis which don't use `+` somewhere closer to the top of the list? It then makes it cleare" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [13:45:19] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:45:22] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:45:53] !log Depooling chartmuseum in codfw [13:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:04] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=helm-charts.*,name=codfw [13:46:28] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet [13:46:34] !log deploy measure/measure-goog certs in the upload CDN cluster - T394484 [13:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:37] T394484: Consider using a dedicated TLS certificate for upload.w.o - https://phabricator.wikimedia.org/T394484 [13:46:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [13:47:57] looks like one pod misbehaving [13:47:58] RECOVERY - Host cephosd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [13:48:13] looking [13:48:24] RECOVERY - Host cephosd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [13:48:24] RECOVERY - Host cephosd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [13:48:30] yeah [13:48:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167555 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [13:48:38] PROBLEM - Bird Internet Routing Daemon on cephosd2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:48:41] I am going to test a new DNS box alert by stopping the VIP advertisement but keeping the service pooled. no impact is expected. will keep an eye out. [13:48:53] 10SRE-tools, 06cloud-services-team, 06Infrastructure-Foundations: sre.hosts.decommission often leaves dangling things in netbox - https://phabricator.wikimedia.org/T398052#10988206 (10taavi) →14Duplicate dup:03T398412 [13:48:56] 06SRE, 06Infrastructure-Foundations, 10netbox, 10netops: Decom cookbook: delete virtual interfaces from device - https://phabricator.wikimedia.org/T398412#10988208 (10taavi) [13:49:04] hnowlan: delete it? [13:49:24] PROBLEM - Bird Internet Routing Daemon on cephosd2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:49:32] claime: just checking it out first [13:49:42] hnowlan: ack, all yours [13:49:55] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host es1044.eqiad.wmnet [13:50:17] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet [13:50:26] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=helm-charts.*,name=codfw [13:50:47] !log Depooling chartmuseum in eqiad [13:50:47] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7002.wikimedia.org,service=authdns-update [reason: testing alert] [13:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:55] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=helm-charts.*,name=eqiad [13:51:08] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet [13:51:43] (03PS1) 10Marostegui: reboot_es.sh: Reboot standalone external store [software] - 10https://gerrit.wikimedia.org/r/1167626 [13:51:57] (03Merged) 10jenkins-bot: ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167574 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:52:03] (03Merged) 10jenkins-bot: ApiQueryCategoryMembers: Try stop forcing index in read new code [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167573 (https://phabricator.wikimedia.org/T399037) (owner: 10Zabe) [13:52:05] jclark@cumin1002 reimage (PID 174304) is awaiting input [13:52:05] (03Merged) 10jenkins-bot: Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1167570 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:52:08] (03Merged) 10jenkins-bot: Fix categorylinks read new code for excluding categories [extensions/intersection] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1167569 (https://phabricator.wikimedia.org/T398861) (owner: 10Zabe) [13:52:15] yeah, wedged for a while [13:52:19] deleted the pod [13:52:26] recurrence of https://phabricator.wikimedia.org/T374350 [13:52:27] ack [13:52:35] (03CR) 10Marostegui: "Federico, FYI, you can use this to reboot the pending RO hosts in external store." [software] - 10https://gerrit.wikimedia.org/r/1167626 (owner: 10Marostegui) [13:52:37] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167574|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167573|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167570|Fix categorylinks read new code for excluding categories (T398861 T398939)]], [[gerrit:1167569|Fix categorylinks read new code for excluding categories (T39886 [13:52:37] 1 T398939)]] [13:52:38] (03CR) 10Marostegui: [C:03+2] reboot_es.sh: Reboot standalone external store [software] - 10https://gerrit.wikimedia.org/r/1167626 (owner: 10Marostegui) [13:52:47] T399037: Expectation (readQueryTime <= 5) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T399037 [13:52:47] T398861: Expectation (readQueryTime <= 5) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T398861 [13:52:48] T398939: DynamicPageList with notcategory producing duplicates - https://phabricator.wikimedia.org/T398939 [13:52:48] T39886: action=mobileview & page=Main_Page & sections=references returns HTTP 500 error - https://phabricator.wikimedia.org/T39886 [13:52:58] (03PS2) 10Muehlenhoff: Move docker-report from build2001 to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1164219 (https://phabricator.wikimedia.org/T379343) [13:53:10] (03Merged) 10jenkins-bot: reboot_es.sh: Reboot standalone external store [software] - 10https://gerrit.wikimedia.org/r/1167626 (owner: 10Marostegui) [13:53:14] jclark@cumin1002 reimage (PID 172340) is awaiting input [13:53:30] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet [13:53:42] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [13:53:43] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=helm-charts.*,name=eqiad [13:53:47] (03CR) 10Hnowlan: [C:03+2] Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 (owner: 10Hnowlan) [13:54:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1050.eqiad.wmnet with OS bullseye [13:54:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1051.eqiad.wmnet with OS bullseye [13:54:14] !log delete three wedged thumbor pods showing signs of T374350 [13:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:18] T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error - https://phabricator.wikimedia.org/T374350 [13:54:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988272 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmne... [13:54:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988273 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [13:54:45] (03PS4) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) [13:54:47] !log zabe@deploy1003 zabe: Backport for [[gerrit:1167574|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167573|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167570|Fix categorylinks read new code for excluding categories (T398861 T398939)]], [[gerrit:1167569|Fix categorylinks read new code for excluding categories (T398861 T398939)]] synced [13:54:47] to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:54:51] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org,service=authdns-update [reason: testing alert] [13:55:04] !log sukhe@dns1004 START - running authdns-update [13:55:11] (03CR) 10Elukey: [C:03+2] Move docker-report from build2001 to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1164219 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [13:55:16] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: move to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [13:55:39] (03Merged) 10jenkins-bot: Revert "changeprop: don't process File: pages for mobile html pages in PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167624 (owner: 10Hnowlan) [13:55:48] !log sukhe@dns1004 END - running authdns-update [13:55:50] !log zabe@deploy1003 zabe: Continuing with sync [13:56:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye [13:56:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [13:56:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [13:56:56] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:57:06] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:57:13] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [13:57:21] (03CR) 10Jforrester: [C:03+1] "Design are happy, thank you! Let's get this deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153557 (https://phabricator.wikimedia.org/T326094) (owner: 10Jon Harald Søby) [13:57:27] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [13:57:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78835 and previous config saved to /var/cache/conftool/dbconfig/20250709-135732-root.json [13:57:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153557 (https://phabricator.wikimedia.org/T326094) (owner: 10Jon Harald Søby) [13:57:39] (03PS1) 10Vgutierrez: hiera: Switch esams/eqsin/drmrs to Let's Encrypt certs [puppet] - 10https://gerrit.wikimedia.org/r/1167630 (https://phabricator.wikimedia.org/T398596) [13:57:57] PROBLEM - Bird Internet Routing Daemon on cephosd2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:58:06] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [13:58:37] RECOVERY - Bird Internet Routing Daemon on cephosd2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:58:51] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [13:58:57] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:59:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78836 and previous config saved to /var/cache/conftool/dbconfig/20250709-135923-root.json [13:59:39] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:59:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167630 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1400) [14:01:17] (03PS1) 10Elukey: profile::docker::reporter: correctly propagate ensure [puppet] - 10https://gerrit.wikimedia.org/r/1167632 [14:01:20] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167574|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167573|ApiQueryCategoryMembers: Try stop forcing index in read new code (T399037)]], [[gerrit:1167570|Fix categorylinks read new code for excluding categories (T398861 T398939)]], [[gerrit:1167569|Fix categorylinks read new code for excluding categories (T3988 [14:01:20] 61 T398939)]] (duration: 08m 42s) [14:01:28] T399037: Expectation (readQueryTime <= 5) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T399037 [14:01:28] T398861: Expectation (readQueryTime <= 5) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T398861 [14:01:28] T398939: DynamicPageList with notcategory producing duplicates - https://phabricator.wikimedia.org/T398939 [14:01:29] T3988: Search ignores numbers - https://phabricator.wikimedia.org/T3988 [14:01:32] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2025-07-02-122843 to 2025-07-08-183416 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167613 (https://phabricator.wikimedia.org/T397355) (owner: 10Jforrester) [14:01:45] (03CR) 10CI reject: [V:04-1] profile::docker::reporter: correctly propagate ensure [puppet] - 10https://gerrit.wikimedia.org/r/1167632 (owner: 10Elukey) [14:02:38] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6215/console" [puppet] - 10https://gerrit.wikimedia.org/r/1167632 (owner: 10Elukey) [14:03:03] (03PS1) 10Slyngshede: Netbox: add limit to rate [alerts] - 10https://gerrit.wikimedia.org/r/1167633 [14:03:34] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-07-02-122843 to 2025-07-08-183416 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167613 (https://phabricator.wikimedia.org/T397355) (owner: 10Jforrester) [14:04:03] (03PS2) 10Elukey: profile::docker::reporter: correctly propagate ensure [puppet] - 10https://gerrit.wikimedia.org/r/1167632 [14:04:23] RECOVERY - Bird Internet Routing Daemon on cephosd2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:04:29] (03CR) 10CI reject: [V:04-1] profile::docker::reporter: correctly propagate ensure [puppet] - 10https://gerrit.wikimedia.org/r/1167632 (owner: 10Elukey) [14:05:10] (03PS10) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [14:05:14] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6216/console" [puppet] - 10https://gerrit.wikimedia.org/r/1167632 (owner: 10Elukey) [14:07:26] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:08:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:08:28] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:09:06] (03PS3) 10Elukey: profile::docker::reporter: correctly propagate ensure [puppet] - 10https://gerrit.wikimedia.org/r/1167632 [14:09:32] (03CR) 10CI reject: [V:04-1] profile::docker::reporter: correctly propagate ensure [puppet] - 10https://gerrit.wikimedia.org/r/1167632 (owner: 10Elukey) [14:09:33] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:10:30] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:10:49] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:11:37] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:12:18] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-07-02-123323 to 2025-07-09-124522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167614 (https://phabricator.wikimedia.org/T397355) (owner: 10Jforrester) [14:12:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78837 and previous config saved to /var/cache/conftool/dbconfig/20250709-141238-root.json [14:13:54] (03CR) 10Vgutierrez: [C:03+2] hiera,cirrus: Enable IPIP on search*@codfw services [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [14:14:03] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-07-02-123323 to 2025-07-09-124522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167614 (https://phabricator.wikimedia.org/T397355) (owner: 10Jforrester) [14:14:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78838 and previous config saved to /var/cache/conftool/dbconfig/20250709-141428-root.json [14:14:50] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:14:58] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [14:14:59] (03PS1) 10Zabe: Revert^2 "Enable categorylinks read new on a few large wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167637 [14:15:00] (03PS4) 10Elukey: profile::docker::reporter: correctly propagate ensure [puppet] - 10https://gerrit.wikimedia.org/r/1167632 [14:15:31] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:16:20] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6217/console" [puppet] - 10https://gerrit.wikimedia.org/r/1167632 (owner: 10Elukey) [14:16:27] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: cirrussearch-codfw-psi@codfw [14:16:39] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:17:10] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:17:19] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:17:47] (03PS1) 10TChin: [eventstreams] Bump version 0.16.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167638 (https://phabricator.wikimedia.org/T390140) [14:17:51] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:17:57] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [14:19:27] (03CR) 10CDanis: [C:03+1] varnish: remove X-Known-Client netmapper [puppet] - 10https://gerrit.wikimedia.org/r/1167555 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [14:19:44] (03PS11) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [14:21:53] (03Abandoned) 10Elukey: profile::docker::reporter: correctly propagate ensure [puppet] - 10https://gerrit.wikimedia.org/r/1167632 (owner: 10Elukey) [14:23:52] !log installing bash updates from Bookworm point release [14:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:59] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for alias: cirrussearch-codfw-psi@codfw [14:24:08] ^^ FAIL expected :) [14:24:39] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: cirrussearch-codfw-omega@codfw [14:26:55] (03CR) 10Fabfur: [C:03+2] varnish: remove X-Known-Client netmapper [puppet] - 10https://gerrit.wikimedia.org/r/1167555 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [14:27:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10988465 (10MoritzMuehlenhoff) [14:27:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78839 and previous config saved to /var/cache/conftool/dbconfig/20250709-142744-root.json [14:27:56] (03CR) 10JHathaway: [C:03+1] reimage: don't stop if FQDN is used instead of hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164147 (owner: 10Ayounsi) [14:29:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78840 and previous config saved to /var/cache/conftool/dbconfig/20250709-142934-root.json [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1430) [14:30:06] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for alias: cirrussearch-codfw-omega@codfw [14:30:14] ^^ FAIL expected [14:30:17] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10988488 (10MoritzMuehlenhoff) [14:30:37] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: cirrussearch@codfw [14:33:32] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye [14:34:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988500 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [14:34:21] (03CR) 10Filippo Giunchedi: [C:03+1] Netbox: add limit to rate [alerts] - 10https://gerrit.wikimedia.org/r/1167633 (owner: 10Slyngshede) [14:34:48] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [14:35:50] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [14:37:27] (03PS5) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) [14:37:28] (03PS3) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [14:37:28] (03PS4) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) [14:37:50] (03CR) 10Elukey: pyrra: remove multi-dc for istio-based SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [14:38:50] (03PS1) 10Muehlenhoff: Add library hint for abseil [puppet] - 10https://gerrit.wikimedia.org/r/1167647 [14:38:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:39:04] (03PS1) 10Btullis: ceph: Enable the RGW anycast address for the cephosd cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1167648 (https://phabricator.wikimedia.org/T374923) [14:39:11] (03PS2) 10Muehlenhoff: Add library hint for abseil [puppet] - 10https://gerrit.wikimedia.org/r/1167647 [14:40:14] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:40:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167648 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [14:41:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:41:25] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [14:41:26] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: cirrussearch@codfw [14:41:57] RECOVERY - Bird Internet Routing Daemon on cephosd2003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:42:12] (03CR) 10Elukey: pyrra: remove multi-dc for istio-based SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [14:42:31] (03PS3) 10Vgutierrez: hiera,cirrus: Enable IPIP on search*@eqiad services [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) [14:42:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78841 and previous config saved to /var/cache/conftool/dbconfig/20250709-144250-root.json [14:43:07] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1050.eqiad.wmnet with OS bullseye [14:43:17] RESOLVED: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988523 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmne... [14:43:42] (03PS6) 10Elukey: pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) [14:43:42] (03PS4) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [14:43:42] (03PS5) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) [14:44:20] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for abseil [puppet] - 10https://gerrit.wikimedia.org/r/1167647 (owner: 10Muehlenhoff) [14:44:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78842 and previous config saved to /var/cache/conftool/dbconfig/20250709-144440-root.json [14:44:45] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6220/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [14:45:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:45:58] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:46:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:42] (03CR) 10Btullis: [V:03+1 C:03+2] ceph: Enable the RGW anycast address for the cephosd cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1167648 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [14:47:31] (03CR) 10Elukey: [V:03+1] pyrra: remove multi-dc for istio-based SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [14:47:55] (03CR) 10Elukey: "To keep archives happy - we agreed to do it after this patch." [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [14:49:13] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for rgw.codfw.dpe.anycast.wmnet - cmooney@cumin1003" [14:49:17] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for rgw.codfw.dpe.anycast.wmnet - cmooney@cumin1003" [14:49:17] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:21] (03PS12) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [14:52:57] PROBLEM - Bird Internet Routing Daemon on cephosd2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:53:23] PROBLEM - Bird Internet Routing Daemon on cephosd2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:53:38] (03CR) 10Mforns: "Code looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 (owner: 10Abban Dunne) [14:54:39] (03CR) 10Vgutierrez: [C:03+2] hiera,cirrus: Enable IPIP on search*@eqiad services [puppet] - 10https://gerrit.wikimedia.org/r/1123653 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [14:55:23] RECOVERY - Bird Internet Routing Daemon on cephosd2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:55:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:55:57] RECOVERY - Bird Internet Routing Daemon on cephosd2003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:56:13] (03CR) 10Zabe: [C:03+2] Revert^2 "Enable categorylinks read new on a few large wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167637 (owner: 10Zabe) [14:57:02] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: cirrussearch-eqiad-psi@eqiad [14:57:02] (03Merged) 10jenkins-bot: Revert^2 "Enable categorylinks read new on a few large wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167637 (owner: 10Zabe) [14:57:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:57:31] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1167637|Revert^2 "Enable categorylinks read new on a few large wikis"]] [14:58:33] (03CR) 10Muehlenhoff: "The SPDX header is no longer visible in the config file compared to the old ERB, but given that SPDX annotates the source and doesn't actu" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:59:35] !log zabe@deploy1003 zabe: Backport for [[gerrit:1167637|Revert^2 "Enable categorylinks read new on a few large wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:00:20] !log zabe@deploy1003 zabe: Continuing with sync [15:00:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:01:07] (03PS18) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [15:03:20] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for alias: cirrussearch-eqiad-psi@eqiad [15:04:24] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: cirrussearch-eqiad-omega@eqiad [15:04:34] (03PS19) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [15:04:55] !log installing abseil security updates [15:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:42] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167637|Revert^2 "Enable categorylinks read new on a few large wikis"]] (duration: 08m 11s) [15:06:34] (03CR) 10Pppery: Catalog newsletter tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167252 (https://phabricator.wikimedia.org/T398941) (owner: 10Pppery) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:10] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10988652 (10MoritzMuehlenhoff) [15:09:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:09:27] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for alias: cirrussearch-eqiad-omega@eqiad [15:09:36] ^^ FAIL expected [15:09:59] (03CR) 10Hnowlan: [C:04-1] "This is most of the way there but needs some fixes, noting them for my own follow-up. Proxy behaviour works as expected, header scrubbing " [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [15:10:06] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: cirrussearch@eqiad [15:10:50] (03PS1) 10Volans: CHANGELOG: add changelogs for release v11.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167652 [15:11:26] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v11.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167652 (owner: 10Volans) [15:12:01] (03PS1) 10Cathal Mooney: Cephosd: add hiera keys for rack-specific BGP peers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1167653 (https://phabricator.wikimedia.org/T374923) [15:12:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1051.eqiad.wmnet with OS bullseye [15:12:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10988676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [15:13:36] (03CR) 10Btullis: [C:03+1] Cephosd: add hiera keys for rack-specific BGP peers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1167653 (https://phabricator.wikimedia.org/T374923) (owner: 10Cathal Mooney) [15:13:49] (03CR) 10Cathal Mooney: [C:03+2] Cephosd: add hiera keys for rack-specific BGP peers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1167653 (https://phabricator.wikimedia.org/T374923) (owner: 10Cathal Mooney) [15:14:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:14:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:57] 10ops-codfw, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399097 (10phaultfinder) 03NEW [15:19:25] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:19:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:20:02] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v11.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167652 (owner: 10Volans) [15:20:55] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:22:11] (03PS1) 10Ssingh: team-traffic: dnsbox: use metrics anycast_healthchecker_service_state [alerts] - 10https://gerrit.wikimedia.org/r/1167659 (https://phabricator.wikimedia.org/T374619) [15:22:15] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:22:40] (03PS1) 10Volans: Upstream release v11.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1167660 [15:22:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:23:01] (03CR) 10Volans: [C:03+2] Upstream release v11.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1167660 (owner: 10Volans) [15:23:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:23:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: cirrussearch@eqiad [15:25:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:26:35] (03CR) 10Volans: [C:03+2] git::clone: remove remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [15:26:37] (03PS1) 10Fabfur: Spawn separate backend to answer requests for HAProxy [software/httpbb] - 10https://gerrit.wikimedia.org/r/1167662 [15:28:56] (03CR) 10JHathaway: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [15:29:25] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:29:46] (03PS4) 10Hnowlan: hcaptcha: initial commit for proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [15:31:12] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1164432 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [15:31:26] (03Merged) 10jenkins-bot: Upstream release v11.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1167660 (owner: 10Volans) [15:31:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:31:45] (03CR) 10Ssingh: [C:03+1] "Verified the per-site hiera overrides and confirmed magru omission." [puppet] - 10https://gerrit.wikimedia.org/r/1167630 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [15:32:14] jclark@cumin1002 provision (PID 372477) is awaiting input [15:32:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:33:02] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bookworm [15:33:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:34:25] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:34:58] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399097#10988776 (10cmooney) p:05Triage→03Medium Nothing in the maintenance calendar. Status bouncing on the CRs: ` cmooney@re0.c... [15:37:08] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:37:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:38:00] (03PS1) 10Andrew Bogott: cloudceph: upgrade hiera settings from octopus to pacific [puppet] - 10https://gerrit.wikimedia.org/r/1167667 [15:39:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167667 (owner: 10Andrew Bogott) [15:40:24] (03CR) 10CI reject: [V:04-1] cloudceph: upgrade hiera settings from octopus to pacific [puppet] - 10https://gerrit.wikimedia.org/r/1167667 (owner: 10Andrew Bogott) [15:40:50] !log uploaded spicerack_11.3.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [15:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:25] (03CR) 10BCornwall: [C:03+1] team-traffic: dnsbox: use metrics anycast_healthchecker_service_state [alerts] - 10https://gerrit.wikimedia.org/r/1167659 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [15:41:40] (03CR) 10Ssingh: [C:03+2] team-traffic: dnsbox: use metrics anycast_healthchecker_service_state [alerts] - 10https://gerrit.wikimedia.org/r/1167659 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [15:42:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:43:08] (03Merged) 10jenkins-bot: team-traffic: dnsbox: use metrics anycast_healthchecker_service_state [alerts] - 10https://gerrit.wikimedia.org/r/1167659 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [15:43:17] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:43:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:44:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:44:30] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:44:50] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:45:16] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:45:48] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:45:50] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:48:02] jclark@cumin1002 provision (PID 399394) is awaiting input [15:48:40] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:49:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10988907 (10elukey) @VRiley-WMF Hi! Anything happening on the network on your side? We can't really understand how DHCP could fail in that way, unless some... [15:49:48] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:49:49] (03CR) 10JHathaway: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 (owner: 10JHathaway) [15:49:52] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:50:54] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167668 [15:51:10] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167668 (owner: 10PipelineBot) [15:51:47] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10988917 (10elukey) @Jhancock.wm me and Moritz tried to debug it, there seems to be an issue with UEFI and the new settings, we'll try to work on it asap. [15:52:50] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:52:57] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167668 (owner: 10PipelineBot) [15:53:40] RESOLVED: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:53:48] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:54:07] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [15:54:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:55:35] (03PS1) 10Hnowlan: wikimedia: add CNAMEs for hcaptcha domains [dns] - 10https://gerrit.wikimedia.org/r/1167669 (https://phabricator.wikimedia.org/T397841) [15:57:43] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [15:59:43] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch esams/eqsin/drmrs to Let's Encrypt certs [puppet] - 10https://gerrit.wikimedia.org/r/1167630 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [15:59:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:00:08] (03CR) 10JHathaway: "overall looks good, proposed a few suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [16:02:17] !log switching esams, eqsin and drmrs to Let's Encrypt unified/upload certs - T398596 [16:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:21] T398596: Consider using the alternate chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596 [16:04:14] (03PS1) 10Hnowlan: trafficserver, cache: add config for edge routing of hcaptcha [puppet] - 10https://gerrit.wikimedia.org/r/1167670 (https://phabricator.wikimedia.org/T397841) [16:04:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:04:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:05:44] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3073 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:05:58] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:06] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:07] vgutierrez: we should probably disable the monitoring as well I guess [16:06:08] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:10] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:10] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:23] oh yes [16:06:24] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:06:25] my fault [16:06:30] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:30] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:35] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:06:40] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:40] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:40] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:40] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:40] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:42] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:46] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:47] ^ expected, no cause for worry [16:06:49] fixing [16:06:50] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:06:58] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:03] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:07:10] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:10] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:13] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:07:34] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:40] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3071 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:44] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:44] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:44] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:57] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:07:58] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:07:58] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:08] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:08:08] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:08] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:10] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:12] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:12] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:18] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:08:28] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:28] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:30] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:38] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:08:44] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5023 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:44] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:44] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:44] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:44] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:44] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:49] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:08:50] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:58] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:08:58] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:09:08] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:09:11] (03PS9) 10JHathaway: reimage: add support for using the host UUID for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1164317 [16:09:38] patch coming for the above [16:09:40] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3074 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:09:42] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:09:58] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3074 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:10:07] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6015 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:07] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5032 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:07] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6002 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:07] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5027 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:07] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5030 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:08] sukhe: patch? [16:10:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6004 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6011 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6010 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6014 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:11] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5024 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:11] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5032 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2025-09-15 07:57:34 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:11] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5018 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5019 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:27] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5028 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2025-09-15 07:57:34 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:31] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3079 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:10:31] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6008 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:31] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6006 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:31] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5025 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:33] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5029 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:39] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3071 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:39] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6002 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:39] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6001 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:39] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6009 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3074 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6003 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6012 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3066 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:41] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3079 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:10:41] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6003 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:42] vgutierrez: I was about to set ocsp: false [16:10:45] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5023 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:45] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5031 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2025-09-15 07:57:34 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:45] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5017 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:45] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5021 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:45] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5029 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2025-09-15 07:57:34 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:45] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5025 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2025-09-15 07:57:34 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:46] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5022 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:46] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5020 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:47] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5030 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2025-09-15 07:57:34 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:47] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5026 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2025-09-15 07:57:34 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:48] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6005 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:51] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6007 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:51] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5026 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:57] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3074 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:57] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6001 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:57] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp6013 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:59] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp5027 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2025-09-15 07:57:34 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:59] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5031 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:10:59] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp5028 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:11:05] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:11:29] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:11:31] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3081 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:11:33] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3081 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [16:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:13:31] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3081 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:13:31] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3079 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:13:33] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3081 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:13:41] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp3079 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:15:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:17:10] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1004.eqiad.wmnet with OS bookworm [16:17:17] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp6004 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2025-10-07 06:16:04 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:17:24] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:17:46] (03PS4) 10Vgutierrez: hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) [16:17:47] (03PS1) 10Andrew Bogott: get_images: use getReplicaServer.pyp instead of getSlaveServer.php [wikitech-static] - 10https://gerrit.wikimedia.org/r/1167671 [16:18:36] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] get_images: use getReplicaServer.pyp instead of getSlaveServer.php [wikitech-static] - 10https://gerrit.wikimedia.org/r/1167671 (owner: 10Andrew Bogott) [16:20:06] (03PS5) 10Vgutierrez: hiera: Switch to upload cert on upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) [16:20:07] (03PS2) 10Andrew Bogott: cloudceph: upgrade hiera settings from octopus to pacific [puppet] - 10https://gerrit.wikimedia.org/r/1167667 [16:21:09] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:24:17] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (-246 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [16:25:11] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10989042 (10elukey) @Jhancock.wm is there another node that we can test provision on? For example 2044 or 2045, just to understand if it is a problem of 2043 or not. Lem... [16:26:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10989045 (10Eevans) >>! In T396970#10965457, @VRiley-WMF wrote: > Is there a time when we can plan for me to look and try to swap at least one of those drives? I'll need to power down the unit to see where thos... [16:31:09] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107 (10STran) 03NEW [16:32:09] (03PS1) 10Ssingh: P:cache::haproxy and C:haproxy: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167674 (https://phabricator.wikimedia.org/T370821) [16:33:12] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10989096 (10Volans) Are we sure that the network card is properly installed? I'm getting this from racadm: ` racadm>>get nic.nicconfig ERROR: SWC0244 : Invalid Fully Qu... [16:34:31] (03PS1) 10BCornwall: site: Set lvs1017 to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1167675 (https://phabricator.wikimedia.org/T387145) [16:37:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989102 (10elukey) Full stack trace: ` 2025-07-09 12:36:49,306 jclark 138654 [INFO] Completed command 'puppet lookup --render-as s... [16:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:50:09] !log bking@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for wdqs2023.codfw.wmnet: Renew puppet certificate - bking@cumin1002 [16:52:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T1700) [17:02:54] (03CR) 10Vgutierrez: [C:03+1] site: Set lvs1017 to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1167675 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [17:11:05] (03CR) 10Vgutierrez: "this will break deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/1167674 (https://phabricator.wikimedia.org/T370821) (owner: 10Ssingh) [17:21:23] (03CR) 10Ssingh: "Yeah thanks for the reminder about that too. I am just going to break this up into multiple patches and then fix that along the way." [puppet] - 10https://gerrit.wikimedia.org/r/1167674 (https://phabricator.wikimedia.org/T370821) (owner: 10Ssingh) [17:29:54] (03CR) 10BCornwall: [C:03+2] site: Set lvs1017 to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1167675 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [17:33:25] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10989307 (10KFrancis) Please provide an email address for @sowmya.guru. If it's preferred not to post it here, please send it to kfrancis@wikimedia.org. Thanks! [17:34:21] !log re-enabling Puppet on P{ganeti7002* or ganeti7003*}: it was left disabled there during rollout of CR 1166222 by sukhe [17:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:01] (03Abandoned) 10Ssingh: P:cache::haproxy and C:haproxy: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167674 (https://phabricator.wikimedia.org/T370821) (owner: 10Ssingh) [17:42:27] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10989328 (10KFrancis) Please disregard the last message. The NDA has been sent for signatures. I'll confirm when it's complete. Thanks! [17:49:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10989357 (10ssingh) On the DNS hosts as of today, we have an alert in place if we detect a mismatch between the service state as defined by confd/confct... [17:59:14] (03CR) 10Muehlenhoff: "Thanks for these! I've added all edits." [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [17:59:57] (03PS13) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [18:00:34] (03CR) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [18:01:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [18:05:04] (03CR) 10Ssingh: [C:03+1] "Verified sites including keeping -goog for magru." [puppet] - 10https://gerrit.wikimedia.org/r/1165842 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [18:23:05] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [18:23:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10989461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye [18:23:25] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Upgrade [18:23:43] (03PS1) 10Ssingh: P:cache::haproxy: remove obsolete do_ocsp [puppet] - 10https://gerrit.wikimedia.org/r/1167686 (https://phabricator.wikimedia.org/T399114) [18:25:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10989482 (10BCornwall) I've updated lvs1017's BIOS and Mellanox firmware to the latest versions (2.23.0 and 16.35.30.06) prior to reimaging [18:25:53] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6223/console" [puppet] - 10https://gerrit.wikimedia.org/r/1167686 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [18:33:22] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Upgrade [18:33:27] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10989491 (10Dzahn) That makes sense to me. Sounds good. Thank you! [18:33:31] (03CR) 10BCornwall: [C:03+1] P:cache::haproxy: remove obsolete do_ocsp [puppet] - 10https://gerrit.wikimedia.org/r/1167686 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [18:35:55] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Upgrade [18:36:42] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10989496 (10Dzahn) Gotcha! Thanks @SKivlehan-WMF ! I did not see a new ticket. We can just reuse this one for the new request. I am adding the group requ... [18:36:52] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10989497 (10Dzahn) [18:37:05] (03PS1) 10Ssingh: hiera: disable OCSP for GTS certs [puppet] - 10https://gerrit.wikimedia.org/r/1167687 (https://phabricator.wikimedia.org/T399079) [18:38:17] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167687 (https://phabricator.wikimedia.org/T399079) (owner: 10Ssingh) [18:38:44] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10989503 (10Dzahn) 05Stalled→03In progress [18:39:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10989505 (10VRiley-WMF) Understood, thanks for pointing that out! I'm working on this now [18:39:25] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10989504 (10Dzahn) Once access works for you feel free to just change the status to "resolved" here. [18:40:54] !log removing ocsp from deployment-prep: commit 3307286c7d18827b87231b61652efbaf0e3ba4c8: T399114 [18:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:57] T399114: Remove OCSP monitoring and related bits - https://phabricator.wikimedia.org/T399114 [18:42:31] !log re-adding ocsp from deployment-prep: commit 3307286c7d18827b87231b61652efbaf0e3ba4c8: T399114: will remove after Puppet removal [18:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:57] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Upgrade [18:47:58] (03PS1) 10Jcrespo: prometheus: Proof of concept of a nrpe to prometheus translation wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) [18:49:16] (03PS2) 10Acamicamacaraca: shwiki: Add bs, hr and sr as import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167688 (https://phabricator.wikimedia.org/T399113) [18:50:04] (03CR) 10CI reject: [V:04-1] prometheus: Proof of concept of a nrpe to prometheus translation wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) (owner: 10Jcrespo) [18:50:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167688 (https://phabricator.wikimedia.org/T399113) (owner: 10Acamicamacaraca) [18:50:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167688 (https://phabricator.wikimedia.org/T399113) (owner: 10Acamicamacaraca) [18:52:02] (03PS2) 10Jcrespo: prometheus: Proof of concept of a nrpe to prometheus translation wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) [18:56:24] (03CR) 10Jcrespo: "What would you think about going this way to speed up the migration? Or more specifically, being able to remove icinga while keeping the n" [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) (owner: 10Jcrespo) [18:57:46] (03PS1) 10TChin: Revert "services: mw-page-content-change-enrich: version bump image." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167694 [18:59:39] (03CR) 10Edgar Allan Poe: [C:03+1] "Seems fine. Support." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167688 (https://phabricator.wikimedia.org/T399113) (owner: 10Acamicamacaraca) [19:02:17] (03PS1) 10Ssingh: P:cache::haproxy, C:haproxy, hiera: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) [19:02:26] (03CR) 10Gmodena: [C:03+2] Revert "services: mw-page-content-change-enrich: version bump image." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167694 (owner: 10TChin) [19:04:08] (03Merged) 10jenkins-bot: Revert "services: mw-page-content-change-enrich: version bump image." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167694 (owner: 10TChin) [19:04:17] (03CR) 10Gmodena: [C:03+1] [eventstreams] Bump version 0.16.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167638 (https://phabricator.wikimedia.org/T390140) (owner: 10TChin) [19:09:21] (03PS3) 10Jcrespo: prometheus: Proof of concept of a nrpe to prometheus translation wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) [19:10:13] (03PS4) 10Jcrespo: prometheus: Proof of concept of a nrpe to prometheus translation wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) [19:10:41] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host es1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:11:03] (03PS5) 10Jcrespo: prometheus: Proof of concept of a nrpe to prometheus translation wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1167691 (https://phabricator.wikimedia.org/T288622) [19:11:05] (03CR) 10Andrew Bogott: [C:03+2] cloudceph: upgrade hiera settings from octopus to pacific [puppet] - 10https://gerrit.wikimedia.org/r/1167667 (owner: 10Andrew Bogott) [19:11:28] (03PS2) 10Ssingh: P:cache::haproxy, C:haproxy, hiera: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) [19:12:02] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:16:01] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host es1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:16:02] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [19:16:04] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:16:27] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [19:16:33] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:16:46] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:16:49] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:17:25] (03CR) 10TChin: [C:03+2] [eventstreams] Bump version 0.16.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167638 (https://phabricator.wikimedia.org/T390140) (owner: 10TChin) [19:17:37] PROBLEM - Disk space on an-worker1082 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 186018 MB (4% inode=99%): /var/lib/hadoop/data/f 142765 MB (3% inode=99%): /var/lib/hadoop/data/e 122870 MB (3% inode=99%): /var/lib/hadoop/data/d 224507 MB (5% inode=99%): /var/lib/hadoop/data/b 165328 MB (4% inode=99%): /var/lib/hadoop/data/m 250251 MB (6% inode=99%): /var/lib/hadoop/data/c 152420 MB (4% inode=99%): /var/lib/hadoop/data [19:17:37] 5 MB (6% inode=99%): /var/lib/hadoop/data/l 203099 MB (5% inode=99%): /var/lib/hadoop/data/g 181673 MB (4% inode=99%): /var/lib/hadoop/data/h 197637 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1082&var-datasource=eqiad+prometheus/ops [19:18:17] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 9 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [19:19:05] (03Merged) 10jenkins-bot: [eventstreams] Bump version 0.16.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167638 (https://phabricator.wikimedia.org/T390140) (owner: 10TChin) [19:21:14] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:24:23] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host es1047.eqiad.wmnet with OS bookworm [19:24:36] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10989659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host es1047.eqiad.wmnet with OS bookworm [19:25:32] (03PS1) 10Ssingh: nagios_common: remove check_ssl_cdn_ocsp* [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) [19:29:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:29:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:42:57] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1047.eqiad.wmnet with reason: host reimage [19:43:26] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bullseye [19:43:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10989724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL**) - Removed f... [19:47:08] (03CR) 10Dzahn: "Thank you for this, Antoine. Appreciate it." [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [19:47:17] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1047.eqiad.wmnet with reason: host reimage [19:49:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:49:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:51:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye [19:51:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye [19:51:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [19:51:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989742 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [19:51:39] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1051.eqiad.wmnet with OS bullseye [19:51:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [19:52:56] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye [19:53:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T2000). [20:00:05] danisztls, James_F, and Aca: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] * James_F waves. [20:00:13] *waves* [20:00:17] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:21] Anyone planning to deploy? [20:01:07] (03CR) 10Dzahn: [C:03+2] rename build pipelines for sourcebot [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [20:01:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:02:37] Fine, I'll do it. [20:03:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167688 (https://phabricator.wikimedia.org/T399113) (owner: 10Acamicamacaraca) [20:03:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153557 (https://phabricator.wikimedia.org/T326094) (owner: 10Jon Harald Søby) [20:04:19] (03Merged) 10jenkins-bot: shwiki: Add bs, hr and sr as import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167688 (https://phabricator.wikimedia.org/T399113) (owner: 10Acamicamacaraca) [20:04:21] (03Merged) 10jenkins-bot: Remove white outline from Wikifunctions favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153557 (https://phabricator.wikimedia.org/T326094) (owner: 10Jon Harald Søby) [20:04:44] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1167688|shwiki: Add bs, hr and sr as import sources (T399113)]], [[gerrit:1153557|Remove white outline from Wikifunctions favicon (T326094)]] [20:04:49] T399113: shwiki: Add bs, hr and sr as import sources - https://phabricator.wikimedia.org/T399113 [20:04:50] T326094: Wikifunctions favicon has a white outline - https://phabricator.wikimedia.org/T326094 [20:04:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:05:16] (03CR) 10Dzahn: "I dont want to block you just because of the host names. We can merge this and replace them afterwards with variables." [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [20:06:53] !log jforrester@deploy1003 jforrester, jhsoby, aleksandar: Backport for [[gerrit:1167688|shwiki: Add bs, hr and sr as import sources (T399113)]], [[gerrit:1153557|Remove white outline from Wikifunctions favicon (T326094)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:02] checkin [20:07:05] Aca: Thanks! [20:07:25] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [20:08:11] looks good to me, new import sources are now available in Special:Import [20:08:16] !log jforrester@deploy1003 jforrester, jhsoby, aleksandar: Continuing with sync [20:08:19] Excellent. [20:09:30] (03CR) 10Dzahn: "I am a bit conflicted about this." [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [20:09:30] danisztls isn't present to check their survey. Is anyone else around the the Readers teams to check deployment? [20:10:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [20:10:20] James_F: I'm present [20:10:33] (03CR) 10Dzahn: "@Paladox you created this ticket 10 years ago:)" [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [20:10:40] using different IRC client, sorry [20:10:44] dani: Aha, cool. [20:10:49] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:10:53] vriley@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [20:10:53] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:10:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1047.eqiad.wmnet with OS bookworm [20:10:56] Once this one is done I'll do yours, then. :-) [20:11:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10989809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host es1047.eqiad.wmnet with OS bookworm completed: - es1047 (**WARN**) - Down... [20:11:13] James_F: ok [20:11:41] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:13:36] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167688|shwiki: Add bs, hr and sr as import sources (T399113)]], [[gerrit:1153557|Remove white outline from Wikifunctions favicon (T326094)]] (duration: 08m 52s) [20:13:41] T399113: shwiki: Add bs, hr and sr as import sources - https://phabricator.wikimedia.org/T399113 [20:13:41] T326094: Wikifunctions favicon has a white outline - https://phabricator.wikimedia.org/T326094 [20:13:43] looking at the registry2004. anyone working on that? [20:13:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [20:13:57] !log jforrester@deploy1003:~$ echo 'https://en.wikipedia.org/static/favicon/wikifunctions.ico' | mwscript-k8s --attach purgeList.php -- --wiki enwiki # T326094 [20:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:05] thanks for the deloy :) [20:14:41] Aca: Of course. Happy hacking! [20:14:44] just a side quest, does anyone know how can I rename / change my "full name" of Gerrit account [20:16:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167622 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [20:16:56] Aca: I don't think you easily can, unfortunately. If you create a new account and transfer your associated e-mail on wikitech that might do it? I'd post on wikitech to ask. [20:17:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [20:17:16] (03Merged) 10jenkins-bot: Pre-deploy Readers Use Cases Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167622 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [20:17:38] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1167622|Pre-deploy Readers Use Cases Survey on enwiki (T398870)]] [20:17:42] T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870 [20:19:18] James_F Okay, do let me know if there are new findings about that. [20:19:46] !log jforrester@deploy1003 jforrester, dani: Backport for [[gerrit:1167622|Pre-deploy Readers Use Cases Survey on enwiki (T398870)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:20:46] (03CR) 10Andriy.v: "I'm agree with this solution. Cases without `+` should be added right below default with a comment to make clear that without `+` is used " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [20:21:19] danisztls: Can you confirm in debug if the code works? [20:22:59] James_F: yes [20:23:25] !log jforrester@deploy1003 jforrester, dani: Continuing with sync [20:23:27] Cool. [20:23:35] James_F: actually [20:23:39] Oh. [20:23:46] Aca: it's very complicated because the "database" for gerrit users is in git itself. and then gerrit uses the "cn" field in LDAP. "Gerrit uses the ldap cn as the user's username and full name. " [20:23:49] not seeing it in k8s-mwdebug [20:24:11] danisztls: How can you test this? [20:24:37] Aca: if you dont have a lot of history it's by far easier to just create a new developer account [20:24:37] mhm mhm, interesting [20:24:56] James_F: it is working [20:25:02] danisztls: Good. [20:25:38] url param for survey name was case sensitive [20:25:53] Ha. Of course. [20:25:54] https://wikitech.wikimedia.org/wiki/SRE/LDAP/Renaming_users [20:26:19] Yeah, understandable. Thanks for this insight. [20:26:34] yea.. https://wikitech.wikimedia.org/wiki/SRE/LDAP/Renaming_users/Gerrit .. but it was causing more problems [20:27:11] okii [20:28:39] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167622|Pre-deploy Readers Use Cases Survey on enwiki (T398870)]] (duration: 11m 00s) [20:28:43] T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870 [20:28:48] danisztls: OK, all done. [20:28:55] Backport window complete. [20:29:27] James_F: Thanks! [20:30:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:31:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:31:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye [20:31:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmne... [20:31:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10989858 (10BCornwall) Hi, @VRiley-WMF, I'm unable to get lvs1017 to PXE boot - I'm getting `media test failure` errors that advise checking the cables. I'm able to ping the connected switch (lsw1-e... [20:33:29] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:34:11] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host es1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:35:25] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:36:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:36:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye [20:36:13] (03CR) 10Dzahn: "Hello, we got alerts:" [puppet] - 10https://gerrit.wikimedia.org/r/1166213 (owner: 10Clément Goubert) [20:36:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmne... [20:36:34] (03CR) 10BCornwall: [C:03+1] nagios_common: remove check_ssl_cdn_ocsp* [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [20:36:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989891 (10Jclark-ctr) [20:37:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10989902 (10Jclark-ctr) 05Open→03Resolved [20:38:46] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6227/console" [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [20:43:38] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6228/console" [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [20:46:41] (03PS20) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [20:49:14] Anyone around? Could I add a change to the current backport window? [20:50:02] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host es1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:50:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [20:54:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#10989963 (10Jclark-ctr) a:05Jclark-ctr→03BTullis [20:55:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:59:01] Winston_Sung: Do you have a way to validate the change? [20:59:22] If so, I can run that backport for you [20:59:31] s/backport/config change deployment/ [20:59:32] vriley@cumin1002 reimage (PID 604844) is awaiting input [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T2100) [21:02:30] I have WikimediaDebug installed on browser. [21:02:48] ok. Ready then? [21:03:07] I assume we are talking about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/558052 by the way [21:03:29] Sure. [21:03:43] Alright. Pressing the button. [21:03:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [21:04:50] (03Merged) 10jenkins-bot: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [21:05:13] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:558052|Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap (T248352)]] [21:05:16] T248352: Move deprecated language codes from MediaWiki to WMF configuration - https://phabricator.wikimedia.org/T248352 [21:05:48] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host es1048.eqiad.wmnet with OS bookworm [21:06:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10989997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host es1048.eqiad.wmnet with OS bookworm [21:07:19] !log dancy@deploy1003 dancy, fomafix: Backport for [[gerrit:558052|Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap (T248352)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:07:33] Verifying... [21:08:56] 10ops-codfw, 06DC-Ops: Alert for device lsw1-b7-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T399136 (10phaultfinder) 03NEW [21:09:53] Verified; working as expected with difference checked. [21:10:11] Excellent [21:10:15] !log dancy@deploy1003 dancy, fomafix: Continuing with sync [21:15:39] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:558052|Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap (T248352)]] (duration: 10m 25s) [21:15:42] T248352: Move deprecated language codes from MediaWiki to WMF configuration - https://phabricator.wikimedia.org/T248352 [21:16:19] Winston_Sung: Done! [21:16:25] Thanks! [21:24:31] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1048.eqiad.wmnet with reason: host reimage [21:28:14] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1048.eqiad.wmnet with reason: host reimage [21:45:58] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [21:46:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [21:46:04] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1048.eqiad.wmnet with OS bookworm [21:46:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10990071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host es1048.eqiad.wmnet with OS bookworm completed: - es1048 (**WARN**) - Down... [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T2200) [22:10:14] (03PS2) 10Dreamy Jazz: ukwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [22:10:35] (03CR) 10Dreamy Jazz: "I've made the changes to the patch to do this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [22:11:58] (03PS3) 10Dreamy Jazz: ukwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [22:12:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:12:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:14:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:16:29] jouncebot: nowandnext [22:16:29] For the next 0 hour(s) and 43 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250709T2200) [22:16:30] In 7 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0600) [22:16:30] In 7 hour(s) and 43 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250710T0600) [22:17:37] PROBLEM - Disk space on an-worker1082 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 179276 MB (4% inode=99%): /var/lib/hadoop/data/f 169900 MB (4% inode=99%): /var/lib/hadoop/data/e 220689 MB (5% inode=99%): /var/lib/hadoop/data/b 236430 MB (6% inode=99%): /var/lib/hadoop/data/m 252330 MB (6% inode=99%): /var/lib/hadoop/data/c 209982 MB (5% inode=99%): /var/lib/hadoop/data/l 165336 MB (4% inode=99%): /var/lib/hadoop/data [22:17:37] 3 MB (3% inode=99%): /var/lib/hadoop/data/h 209350 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1082&var-datasource=eqiad+prometheus/ops [22:17:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [22:18:28] (03Merged) 10jenkins-bot: ukwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [22:18:50] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1166499|ukwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T398738)]] [22:18:54] T398738: Update ukwiki setting for group to assign temporary-account-viewer group - https://phabricator.wikimedia.org/T398738 [22:21:00] !log dreamyjazz@deploy1003 dreamyjazz, dreamrimmer: Backport for [[gerrit:1166499|ukwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T398738)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:23:43] !log dreamyjazz@deploy1003 dreamyjazz, dreamrimmer: Continuing with sync [22:25:44] (03PS1) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 [22:26:10] (03CR) 10CI reject: [V:04-1] Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (owner: 10Andrew Bogott) [22:29:09] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166499|ukwiki: allow bureaucrats to assign and remove temporary-account-viewer group (T398738)]] (duration: 10m 18s) [22:29:12] T398738: Update ukwiki setting for group to assign temporary-account-viewer group - https://phabricator.wikimedia.org/T398738 [22:33:04] (03PS2) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 [22:33:30] (03CR) 10CI reject: [V:04-1] Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (owner: 10Andrew Bogott) [22:35:31] (03PS3) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 [22:38:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (owner: 10Andrew Bogott) [22:50:08] (03PS4) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 [22:50:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (owner: 10Andrew Bogott) [22:53:52] (03PS3) 10Cwhite: logstash: scale gitlab durations to expected unit (ns) [puppet] - 10https://gerrit.wikimedia.org/r/1164523 (https://phabricator.wikimedia.org/T234565) [22:54:10] (03PS5) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 [22:55:38] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (owner: 10Andrew Bogott) [22:56:49] (03CR) 10Cwhite: [C:03+2] logstash: scale gitlab durations to expected unit (ns) [puppet] - 10https://gerrit.wikimedia.org/r/1164523 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:14:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:17:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:17:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:21:51] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm [23:23:22] (03PS6) 10Andrew Bogott: Cloudcephosd1048: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1167708 (https://phabricator.wikimedia.org/T395910) [23:30:06] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167309 (owner: 10TrainBranchBot) [23:38:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167711 [23:38:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167711 (owner: 10TrainBranchBot) [23:50:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1167711 (owner: 10TrainBranchBot)