[00:00:20] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:00:22] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:00:45] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:02:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056622 (owner: 10TrainBranchBot) [00:04:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:44] (03CR) 10RLazarus: [C:03+1] "🚱" [puppet] - 10https://gerrit.wikimedia.org/r/1054968 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [00:24:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:01] (03PS1) 10Zabe: Further configs for cswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056632 (https://phabricator.wikimedia.org/T370913) [00:32:42] (03CR) 10CI reject: [V:04-1] Further configs for cswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056632 (https://phabricator.wikimedia.org/T370913) (owner: 10Zabe) [00:32:58] (03PS2) 10Zabe: Further configs for cswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056632 (https://phabricator.wikimedia.org/T370913) [00:34:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056632 (https://phabricator.wikimedia.org/T370913) (owner: 10Zabe) [00:34:48] (03Merged) 10jenkins-bot: Further configs for cswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056632 (https://phabricator.wikimedia.org/T370913) (owner: 10Zabe) [00:35:26] !log zabe@deploy1002 Started scap sync-world: Backport for [[gerrit:1056632|Further configs for cswikivoyage (T370913)]] [00:35:34] T370913: Post-creation work for cswikivoyage - https://phabricator.wikimedia.org/T370913 [00:37:53] !log zabe@deploy1002 zabe: Backport for [[gerrit:1056632|Further configs for cswikivoyage (T370913)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:39:04] !log zabe@deploy1002 zabe: Continuing with sync [00:43:48] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1056632|Further configs for cswikivoyage (T370913)]] (duration: 08m 22s) [00:43:53] T370913: Post-creation work for cswikivoyage - https://phabricator.wikimedia.org/T370913 [00:56:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:18] 06SRE, 10DNS, 06Traffic: DKIM Key to Public DNS (Dayforce) - https://phabricator.wikimedia.org/T370961#10013516 (10ssingh) Reached out to @APaul-WMF for a clarification; will update the task with more information. [01:32:09] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#10013538 (10Dzahn) 05Stalled→03In progress Thanks. signature confirmed. I think it's -because- you have been employee for that long. You got the rights a little before that f... [01:33:12] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#10013540 (10Dzahn) a:05Milimetric→03None [01:34:02] (03PS1) 10Dzahn: admin: add milimetric to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1056635 (https://phabricator.wikimedia.org/T365074) [01:37:54] (03PS2) 10Dzahn: admin: add milimetric to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1056635 (https://phabricator.wikimedia.org/T365074) [01:38:51] (03PS3) 10Dzahn: admin: add milimetric to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1056635 (https://phabricator.wikimedia.org/T365074) [01:40:07] (03CR) 10Dzahn: [C:03+2] admin: add milimetric to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1056635 (https://phabricator.wikimedia.org/T365074) (owner: 10Dzahn) [01:47:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#10013554 (10Dzahn) 05In progress→03Resolved a:03Dzahn Hey @Milimetric code change deployed. Ran puppet on `cassandra-dev200[1-3].codfw.wmnet`.... [01:59:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:20] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:20] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:44:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T0600) [06:00:05] marostegui, Amir1, and arnaudb: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T0600). Please do the needful. [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:41] (03PS3) 10Ayounsi: Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) [06:39:41] (03PS1) 10Ayounsi: Netbox 4: add version flag [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) [06:40:14] (03CR) 10CI reject: [V:04-1] Netbox 4: add version flag [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [06:40:48] (03Abandoned) 10Ayounsi: Netbox report timers: run as sre_bot user [puppet] - 10https://gerrit.wikimedia.org/r/1055957 (owner: 10Ayounsi) [06:44:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:45] (03PS2) 10Ayounsi: Netbox 4: add version flag [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) [06:48:45] (03PS4) 10Ayounsi: Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) [06:51:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T367856)', diff saved to https://phabricator.wikimedia.org/P66920 and previous config saved to /var/cache/conftool/dbconfig/20240725-065159-marostegui.json [06:52:04] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [06:52:10] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [06:59:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:23] (03PS1) 10Ayounsi: Re-add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1056789 (https://phabricator.wikimedia.org/T322630) [07:04:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:39] (03CR) 10Ayounsi: [C:03+2] Re-add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1056789 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi) [07:06:08] (03Merged) 10jenkins-bot: Re-add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1056789 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi) [07:06:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:07:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P66921 and previous config saved to /var/cache/conftool/dbconfig/20240725-070706-marostegui.json [07:12:35] (03PS3) 10Ayounsi: Netbox 4: add version flag [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) [07:12:35] (03PS5) 10Ayounsi: Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) [07:14:02] !log add transit BGP session to KPN in esams [07:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:32] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:22:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P66922 and previous config saved to /var/cache/conftool/dbconfig/20240725-072213-marostegui.json [07:26:38] (03CR) 10Brouberol: growthbook: replace ferretdb by mongo itself (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056485 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [07:26:45] (03CR) 10Brouberol: [C:03+2] growthbook: replace ferretdb by mongo itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056485 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [07:31:25] I'll deploy cxserver, no patches in backport window it seems. [07:35:00] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [07:35:32] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [07:35:59] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [07:36:30] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:36:46] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [07:37:03] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:37:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T367856)', diff saved to https://phabricator.wikimedia.org/P66923 and previous config saved to /var/cache/conftool/dbconfig/20240725-073720-marostegui.json [07:37:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [07:37:25] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:37:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [07:37:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T367856)', diff saved to https://phabricator.wikimedia.org/P66924 and previous config saved to /var/cache/conftool/dbconfig/20240725-073742-marostegui.json [07:38:30] !log Updated cxserver to 2024-07-22-050142-production (T363968) [07:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:35] T363968: Migrate cxserver in production to node20 - https://phabricator.wikimedia.org/T363968 [07:39:26] (03PS1) 10Filippo Giunchedi: site: add insetup for prometheus[12]00[78] [puppet] - 10https://gerrit.wikimedia.org/r/1056868 (https://phabricator.wikimedia.org/T370426) [07:49:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:03] (03CR) 10Fabfur: [C:03+1] "Ok for me!" [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto) [08:39:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:54] (03CR) 10Hashar: package_builder: don't install python-all on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1049180 (https://phabricator.wikimedia.org/T367544) (owner: 10Jelto) [08:42:48] (03PS1) 10Alexandros Kosiaris: Revert "deploy1003: Comment them out from scap_masters" [puppet] - 10https://gerrit.wikimedia.org/r/1056871 (https://phabricator.wikimedia.org/T364417) [08:43:13] (03CR) 10CI reject: [V:04-1] Revert "deploy1003: Comment them out from scap_masters" [puppet] - 10https://gerrit.wikimedia.org/r/1056871 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [08:44:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:38] (03PS2) 10Alexandros Kosiaris: Revert "deploy1003: Comment them out from scap_masters" [puppet] - 10https://gerrit.wikimedia.org/r/1056871 (https://phabricator.wikimedia.org/T364417) [08:45:52] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert "deploy1003: Comment them out from scap_masters" [puppet] - 10https://gerrit.wikimedia.org/r/1056871 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [09:01:00] (03CR) 10Elukey: Netbox 4: add version flag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:01:49] (03PS5) 10Giuseppe Lavagetto: puppet-lint.rc: add specification of our indentation [puppet] - 10https://gerrit.wikimedia.org/r/1056131 [09:01:49] (03PS6) 10Giuseppe Lavagetto: profile::haproxy: move tls_terminator.pp to profile module [puppet] - 10https://gerrit.wikimedia.org/r/1056466 [09:01:49] (03PS1) 10Giuseppe Lavagetto: haproxy: add confd_file define [puppet] - 10https://gerrit.wikimedia.org/r/1056875 (https://phabricator.wikimedia.org/T370745) [09:01:51] (03PS1) 10Giuseppe Lavagetto: haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) [09:03:14] (03PS1) 10Alexandros Kosiaris: deployment: Switch master deployment host to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1056878 (https://phabricator.wikimedia.org/T364417) [09:03:25] (03PS1) 10Brouberol: growthbook: fix mongo connection string and update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056879 (https://phabricator.wikimedia.org/T365839) [09:04:18] (03PS3) 10Ayounsi: Cookbooks: fix Netbox 4 breaking changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) [09:10:12] (03CR) 10Elukey: [C:03+2] puppetmaster: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056201 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:13:17] (03CR) 10Elukey: [V:03+1 C:03+2] Move the dump_cloud_ip_ranges etcd upload to puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [09:14:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:27] (03PS1) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) [09:19:45] !log move dump_cloud_ip_ranges from puppetmaster1001 to puppetserver1001 - T368023 [09:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:50] T368023: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023 [09:21:11] (03PS4) 10Ayounsi: Netbox 4: add version flag [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) [09:21:11] (03PS6) 10Ayounsi: Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) [09:21:46] (03CR) 10Ayounsi: Netbox 4: add version flag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:23:47] (03CR) 10Brouberol: [C:03+2] growthbook: fix mongo connection string and update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056879 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [09:25:36] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [09:26:11] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:26:43] (03CR) 10Btullis: [C:03+1] "This looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [09:27:02] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:29:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:58] (03CR) 10Btullis: [C:03+2] Add MPIC service port [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [09:32:09] (03CR) 10Elukey: admin: add dcops to the system adm POSIX group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:33:20] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:33:30] (03PS1) 10Btullis: Add the new presto servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1056886 (https://phabricator.wikimedia.org/T370543) [09:34:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:33] (03PS1) 10Brouberol: growthbook: deploy mongodb with auth enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056887 (https://phabricator.wikimedia.org/T365839) [09:39:13] (03PS1) 10Elukey: profile::docker::reporter::report: add min_debian_version arg [puppet] - 10https://gerrit.wikimedia.org/r/1056888 (https://phabricator.wikimedia.org/T367427) [09:39:47] (03PS2) 10Elukey: profile::docker::reporter::report: add min_debian_version arg [puppet] - 10https://gerrit.wikimedia.org/r/1056888 (https://phabricator.wikimedia.org/T367427) [09:41:52] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3423/co" [puppet] - 10https://gerrit.wikimedia.org/r/1056888 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [09:43:01] (03CR) 10Elukey: [C:03+1] Netbox 4: add version flag [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:44:26] (03CR) 10Filippo Giunchedi: [C:03+1] Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [09:45:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:06] (03CR) 10Alexandros Kosiaris: [C:03+1] profile::docker::reporter::report: add min_debian_version arg [puppet] - 10https://gerrit.wikimedia.org/r/1056888 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [09:48:58] akosiaris: <3 [09:49:15] (03CR) 10Elukey: [V:03+1 C:03+2] profile::docker::reporter::report: add min_debian_version arg [puppet] - 10https://gerrit.wikimedia.org/r/1056888 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [09:49:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:58] (03PS2) 10Alexandros Kosiaris: mesh: Bump from configuration 1.8.0 to 1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053009 [09:54:58] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:56:17] (03CR) 10Alexandros Kosiaris: [C:03+2] mesh: Bump from configuration 1.8.0 to 1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053009 (owner: 10Alexandros Kosiaris) [09:57:03] (03CR) 10Ayounsi: [C:03+2] Netbox 4: add version flag [puppet] - 10https://gerrit.wikimedia.org/r/1056785 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:57:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:57:39] (03PS1) 10Hnowlan: mesh.configuration: copypasta commit in advance of changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056891 (https://phabricator.wikimedia.org/T356241) [09:58:19] (03Merged) 10jenkins-bot: mesh: Bump from configuration 1.8.0 to 1.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053009 (owner: 10Alexandros Kosiaris) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T1000) [10:00:18] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=kubernetes1051.eqiad.wmnet,cluster=kubernetes,service=kubesvc [reason: Uncordoning kubernetes1051 - T369011] [10:00:28] T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011 [10:01:13] (03PS1) 10Cathal Mooney: Add an-redacteddb to list of hosts that do not get IPv6 records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056892 (https://phabricator.wikimedia.org/T365453) [10:02:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011#10014019 (10Clement_Goubert) 05Open→03Resolved Host BGP re-enabled, back in `Active` status and uncordo... [10:07:33] (03CR) 10Ayounsi: [C:03+1] Add an-redacteddb to list of hosts that do not get IPv6 records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056892 (https://phabricator.wikimedia.org/T365453) (owner: 10Cathal Mooney) [10:07:47] (03PS4) 10ClĂ©ment Goubert: mwdebug: Add hosts to testserver pool [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) [10:13:20] (03CR) 10ClĂ©ment Goubert: [C:03+2] sre.mediawiki.restart-appservers: Remove legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1056470 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [10:17:13] (03Merged) 10jenkins-bot: sre.mediawiki.restart-appservers: Remove legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1056470 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [10:17:40] (03CR) 10ClĂ©ment Goubert: [C:03+2] sre.mediawiki.route-traffic: Use switchdc defined services [cookbooks] - 10https://gerrit.wikimedia.org/r/1056471 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [10:17:40] (03PS3) 10Hnowlan: mesh.configuration: set idle_timeout to timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) [10:18:54] (03CR) 10CI reject: [V:04-1] mesh.configuration: set idle_timeout to timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:19:17] (03PS1) 10Elukey: Release 0.0.15 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1056896 [10:19:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:24] (03CR) 10Elukey: "Related change: https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-report/+/1054845" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1056896 (owner: 10Elukey) [10:21:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:36] (03Merged) 10jenkins-bot: sre.mediawiki.route-traffic: Use switchdc defined services [cookbooks] - 10https://gerrit.wikimedia.org/r/1056471 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [10:21:58] (03CR) 10ClĂ©ment Goubert: [C:03+2] sre.switchdc.mediawiki: No-op formatting change [cookbooks] - 10https://gerrit.wikimedia.org/r/1056472 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [10:22:45] (03PS1) 10Filippo Giunchedi: ignore, test [alerts] - 10https://gerrit.wikimedia.org/r/1056897 [10:23:50] (03CR) 10Hnowlan: [C:03+1] "lgtm, one minor nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055872 (owner: 10Effie Mouzeli) [10:24:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:56] (03CR) 10ClĂ©ment Goubert: [C:03+1] Release 0.0.15 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1056896 (owner: 10Elukey) [10:25:21] (03CR) 10ClĂ©ment Goubert: [C:03+1] mesh.configuration: copypasta commit in advance of changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056891 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:25:57] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: No-op formatting change [cookbooks] - 10https://gerrit.wikimedia.org/r/1056472 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [10:26:15] (03CR) 10ClĂ©ment Goubert: [C:03+2] sre.switchdc.mediawiki: Remove legacy services [cookbooks] - 10https://gerrit.wikimedia.org/r/1056473 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [10:27:25] (03PS1) 10Ayounsi: Prometheus SSH probe: ignore network devices [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) [10:27:50] (03CR) 10CI reject: [V:04-1] Prometheus SSH probe: ignore network devices [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [10:29:10] (03CR) 10Elukey: [C:03+2] Release 0.0.15 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1056896 (owner: 10Elukey) [10:30:03] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Remove legacy services [cookbooks] - 10https://gerrit.wikimedia.org/r/1056473 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [10:31:18] (03CR) 10Stevemunene: [C:03+1] Add the new presto servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1056886 (https://phabricator.wikimedia.org/T370543) (owner: 10Btullis) [10:32:45] (03PS1) 10Ayounsi: Move the /srv/netbox/ directory creation behind netbox4 flag [puppet] - 10https://gerrit.wikimedia.org/r/1056901 (https://phabricator.wikimedia.org/T336275) [10:34:31] (03PS4) 10Hnowlan: mesh.configuration: set idle_timeout to timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) [10:35:25] 07Puppet, 10MW-on-K8s, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Port or delete "git repo needs merge" icinga check - https://phabricator.wikimedia.org/T370530#10014130 (10Clement_Goubert) I think the `mediawiki-config` check is still needed since we're building the image with that re... [10:35:45] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056901 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [10:42:30] !log upload docker-report 0.0.15 to bullseye-wimedia and upgrade build2001 [10:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:25] (03PS2) 10Giuseppe Lavagetto: haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) [10:54:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:03:08] (03PS2) 10Ayounsi: Prometheus SSH probe: ignore network devices [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) [11:06:25] FIRING: [3x] SystemdUnitFailed: prometheus-ipmi-exporter.service on kubernetes1051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:43] (03PS1) 10Btullis: Use a 10 GB persistent volume for mongodb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056910 (https://phabricator.wikimedia.org/T365839) [11:10:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:03] (03PS2) 10Btullis: Use a 10 GB persistent volume for mongodb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056910 (https://phabricator.wikimedia.org/T365839) [11:14:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:16:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:41] (03CR) 10Btullis: [C:03+2] Use a 10 GB persistent volume for mongodb (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056910 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [11:18:36] (03Merged) 10jenkins-bot: Use a 10 GB persistent volume for mongodb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056910 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [11:18:37] (03CR) 10Btullis: [C:03+2] Add the new presto servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1056886 (https://phabricator.wikimedia.org/T370543) (owner: 10Btullis) [11:21:30] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10014201 (10BTullis) a:05BTullis→03None >>! In T370543#10011851, @RobH wrote: > @btullis, > > Please note that while this racking task is filed, we... [11:22:03] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [11:22:14] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [11:22:52] (03PS1) 10Ayounsi: Netbox 4: add transition flag for Redis database [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) [11:23:17] (03CR) 10CI reject: [V:04-1] Netbox 4: add transition flag for Redis database [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:25:39] (03PS2) 10Ayounsi: Netbox 4: add transition flag for Redis database [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) [11:26:05] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:31:49] (03PS3) 10Ayounsi: Netbox 4: add transition flag for Redis database [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) [11:32:04] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:32:15] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:39:18] (03PS2) 10Btullis: growthbook: deploy mongodb with auth enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056887 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [11:40:08] (03CR) 10Btullis: [C:03+2] growthbook: deploy mongodb with auth enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056887 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [11:41:37] (03Merged) 10jenkins-bot: growthbook: deploy mongodb with auth enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056887 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [11:42:16] (03CR) 10Btullis: [C:03+1] Add MPIC service listener proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [11:42:31] (03CR) 10Btullis: [C:03+2] Add MPIC service listener proxy [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [11:44:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:07] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [11:49:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:29] !log cgoubert@deploy1002 Started scap sync-world: Deploying mpic envoy listener - 1056163 - T366234 [11:51:36] (03CR) 10Filippo Giunchedi: [C:03+2] site: add insetup for prometheus[12]00[78] [puppet] - 10https://gerrit.wikimedia.org/r/1056868 (https://phabricator.wikimedia.org/T370426) (owner: 10Filippo Giunchedi) [11:51:38] T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234 [11:53:31] 10ops-codfw, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10014276 (10fgiunchedi) puppet.git part is done [11:53:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10014273 (10fgiunchedi) a:05fgiunchedi→03None >>! In T370426#10011859, @RobH wrote: > @fgiunchedi, > > Please note that while this racking task is fi... [11:53:47] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [11:56:49] 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010 (10Michael) 03NEW [11:57:21] (03CR) 10Filippo Giunchedi: "Change logic LGTM, see inline though" [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [11:59:44] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [11:59:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10014314 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T1200) [12:00:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:49] (03CR) 10Filippo Giunchedi: "See inline for a simpler solution" [puppet] - 10https://gerrit.wikimedia.org/r/1056579 (https://phabricator.wikimedia.org/T366573) (owner: 10Andrea Denisse) [12:04:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:26] (03CR) 10Jobo: [C:03+1] admin: add tappof to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/1056452 (owner: 10Filippo Giunchedi) [12:08:44] !log cgoubert@deploy1002 sync-world aborted: Deploying mpic envoy listener - 1056163 - T366234 (duration: 17m 59s) [12:08:48] T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234 [12:12:10] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:12:33] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:12:34] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:13:06] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:13:15] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [12:14:14] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:14:50] 07Puppet, 10MW-on-K8s, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Port or delete "git repo needs merge" icinga check - https://phabricator.wikimedia.org/T370530#10014344 (10fgiunchedi) Thank you @Clement_Goubert ! Makes sense to me, I'll adjust the task accordingly [12:15:27] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:15:28] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:16:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [12:16:36] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:16:38] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [12:17:46] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [12:17:47] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:18:49] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:18:50] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [12:20:33] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [12:20:34] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [12:20:51] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [12:20:53] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [12:21:14] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056927 [12:21:25] 07Puppet, 10MW-on-K8s, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q1): Clean up "git repo needs merge" checks - https://phabricator.wikimedia.org/T370530#10014357 (10fgiunchedi) [12:22:04] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [12:22:05] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [12:22:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011#10014356 (10cmooney) >>! In T369011#10014019, @Clement_Goubert wrote: > Host BGP re-enabled, back in `Activ... [12:23:05] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [12:23:06] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:23:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011#10014360 (10Clement_Goubert) >>! In T369011#10014356, @cmooney wrote: >>>! In T369011#10014019, @Clement_Go... [12:23:18] (03PS3) 10Filippo Giunchedi: admin: add tappof to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/1056452 [12:24:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:24:15] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:25:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:25:21] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:26:40] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:26:42] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:27:51] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:27:52] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [12:28:30] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [12:28:31] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [12:29:02] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [12:29:04] (03CR) 10Filippo Giunchedi: [C:03+2] admin: add tappof to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/1056452 (owner: 10Filippo Giunchedi) [12:33:14] (03PS1) 10Btullis: Configure growthbook/mongodb deployment with a recreate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056936 (https://phabricator.wikimedia.org/T365839) [12:33:25] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:34:00] (03CR) 10CI reject: [V:04-1] Configure growthbook/mongodb deployment with a recreate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056936 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [12:34:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:11] (03PS2) 10Btullis: Configure growthbook/mongodb deployment with a recreate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056936 (https://phabricator.wikimedia.org/T365839) [12:38:32] (03CR) 10Btullis: [C:03+2] Configure growthbook/mongodb deployment with a recreate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056936 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [12:38:48] (03PS1) 10Giuseppe Lavagetto: wikireplicas::backend: convert to using haproxy::confd_site [puppet] - 10https://gerrit.wikimedia.org/r/1056937 [12:39:20] (03Merged) 10jenkins-bot: Configure growthbook/mongodb deployment with a recreate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056936 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [12:42:32] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [12:42:44] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [12:50:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:08] (03CR) 10Giuseppe Lavagetto: [C:03+2] puppet-lint.rc: add specification of our indentation [puppet] - 10https://gerrit.wikimedia.org/r/1056131 (owner: 10Giuseppe Lavagetto) [12:56:24] (03CR) 10Elukey: mediawiki: fetch active deployment host (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [12:56:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:56:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host pc1017.eqiad.wmnet with OS bookworm [12:56:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10014487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed: - pc1017 (**FAIL**)... [12:57:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10014488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm executed with errors: - pc1017 (*... [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T1300) [13:00:05] physikerwelt and joelyrookewmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:34] hi [13:00:35] Hi, I'm here :) [13:00:50] I can probably deploy at some point in the window, not sure if in a few minutes or in ~20 minutes ^^ [13:01:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [13:01:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10014515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm [13:01:11] * Lucas_WMDE peeks at the patches [13:01:52] (03CR) 10Bking: [C:03+2] elasticsearch: remove obsolete alerts [puppet] - 10https://gerrit.wikimedia.org/r/1054647 (https://phabricator.wikimedia.org/T359033) (owner: 10Bking) [13:02:43] (03PS17) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [13:03:01] (03CR) 10Bking: [C:03+2] elasticsearch: remove obsolete alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054647 (https://phabricator.wikimedia.org/T359033) (owner: 10Bking) [13:03:32] okay, I can deploy! [13:03:53] nice [13:04:40] (03CR) 10CDanis: [C:03+1] haproxy: add confd_file define [puppet] - 10https://gerrit.wikimedia.org/r/1056875 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [13:05:03] I have the debug extension installed, and am ready to test [13:05:22] I will not need to test anything [13:05:31] I’m still looking at the change, slightly confused [13:05:31] (03CR) 10CI reject: [V:04-1] maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [13:05:34] (03PS3) 10CDanis: haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [13:05:35] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [13:05:47] (03PS2) 10Physikerwelt: Enable optional MathJax rendering in everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055397 (https://phabricator.wikimedia.org/T370507) [13:06:26] ok, I was wondering if the beta-specific config was obsolete now [13:06:31] but I guess reverting https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1055395/2/wmf-config/CommonSettings-labs.php would be incorrect [13:06:47] idk why CS-labs needs to override the math config but I don’t mind keeping it like that for this deployment [13:07:00] so, good to go I think [13:07:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055397 (https://phabricator.wikimedia.org/T370507) (owner: 10Physikerwelt) [13:08:03] after this change the labs specific config *could* be removed. However, it would not have an effect [13:08:19] (03Merged) 10jenkins-bot: Enable optional MathJax rendering in everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055397 (https://phabricator.wikimedia.org/T370507) (owner: 10Physikerwelt) [13:08:58] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1055397|Enable optional MathJax rendering in everywhere (T370507)]] [13:09:03] T370507: Enable MathJax rendering as opt-in - https://phabricator.wikimedia.org/T370507 [13:12:06] ^^works [13:12:28] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, physikerwelt: Backport for [[gerrit:1055397|Enable optional MathJax rendering in everywhere (T370507)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:12:57] that was a bit too soon, really :P [13:13:06] I guess you got lucky [13:13:44] I mean it works when the debug toolbar is enabled [13:14:00] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, physikerwelt: Continuing with sync [13:14:02] if it is disabled it falls back to the default [13:14:11] yeah, but you tested it before scap told you it was on the debug servers ^^ [13:15:13] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [13:15:32] anyway, it’s deploying now [13:15:40] yes, I was lucky [13:17:24] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v8.9.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056939 [13:17:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [13:18:56] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1055397|Enable optional MathJax rendering in everywhere (T370507)]] (duration: 09m 57s) [13:19:00] T370507: Enable MathJax rendering as opt-in - https://phabricator.wikimedia.org/T370507 [13:19:50] (03PS2) 10Joely Rooke WMDE: Add wikibase client interaction stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055228 (https://phabricator.wikimedia.org/T370045) [13:20:16] Lucas_WMDE thank you again. That was very smooth. [13:20:34] np :) [13:20:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055228 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [13:21:33] (03Merged) 10jenkins-bot: Add wikibase client interaction stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055228 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [13:22:03] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1055228|Add wikibase client interaction stream (T370045)]] [13:22:09] T370045: Monitor sidebar wikidata link usage - https://phabricator.wikimedia.org/T370045 [13:24:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:23] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, joelyrookewmde: Backport for [[gerrit:1055228|Add wikibase client interaction stream (T370045)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:24:45] joelyrookewmde: I guess we can deploy directly? [13:25:01] yes please! [13:25:09] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, joelyrookewmde: Continuing with sync [13:25:16] No testing since the rest of the tracking is not merged yet anyway [13:25:22] (03PS2) 10Hnowlan: shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) [13:25:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:59] (03CR) 10Volans: [C:03+1] "Thanks for the release" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056939 (owner: 10Elukey) [13:27:51] (03PS2) 10Elukey: CHANGELOG: add changelogs for release v8.9.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056939 [13:28:45] (03CR) 10Elukey: CHANGELOG: add changelogs for release v8.9.0 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056939 (owner: 10Elukey) [13:28:52] (03PS1) 10EoghanGaffney: apt-staging: Remove log directive from staging distributions file [puppet] - 10https://gerrit.wikimedia.org/r/1056941 [13:29:01] (03CR) 10Peter Fischer: [C:03+1] "Looks reasonable to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) (owner: 10DCausse) [13:29:59] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1055228|Add wikibase client interaction stream (T370045)]] (duration: 07m 56s) [13:30:14] T370045: Monitor sidebar wikidata link usage - https://phabricator.wikimedia.org/T370045 [13:30:30] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=kubernetes1051.eqiad.wmnet,cluster=kubernetes,service=kubesvc [reason: Cordoning kubernetes1051 for missed upgrades - T369011] [13:30:39] !log UTC afternoon backport+config window done [13:30:47] T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011 [13:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:56] Thank you Lucas! [13:31:04] yw :) [13:31:16] 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10014665 (10DMburugu) Approved [13:32:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubernetes1051.eqiad.wmnet [13:32:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011#10014672 (10ops-monitoring-bot) Host rebooted by cgoubert@cumin1002 with reason: Missed kernel upgrade [13:34:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host pc1017.eqiad.wmnet with OS bookworm [13:34:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10014675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed: - pc1017 (**FAIL**)... [13:34:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10014676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm executed with errors: - pc1017 (*... [13:38:37] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: cookbook failed after the fist "go" host cloudcephmod1004 - https://phabricator.wikimedia.org/T371024 (10Papaul) 03NEW [13:40:21] (03CR) 10Ebernhardson: [C:03+1] "Seems like a reasonable starting place for the settings, estimating from the dashboards this should allow the current traffic to flow most" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) (owner: 10DCausse) [13:40:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1051.eqiad.wmnet [13:40:37] (03PS1) 10Peter Fischer: EventStreamConfig for mediawiki.cirrussearch.page_weighted_tags_change.rc0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056944 (https://phabricator.wikimedia.org/T366253) [13:40:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:40] and with T371023 I’ve apparently fulfilled my quota of filing one task from logspam-watch per backport window ;) [13:40:41] T371023: Error: Call to a member function canExist() on null - https://phabricator.wikimedia.org/T371023 [13:40:44] * Lucas_WMDE done deploying [13:41:22] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=kubernetes1051.eqiad.wmnet,cluster=kubernetes,service=kubesvc [reason: Uncordoning kubernetes1051 for missed upgrades - T369011] [13:41:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:44] T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011 [13:42:38] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [13:42:48] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [13:42:50] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [13:43:06] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [13:43:07] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [13:43:30] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [13:44:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:25] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [13:45:34] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [13:45:35] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [13:45:50] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [13:45:51] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [13:47:06] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [13:47:43] (03PS1) 10Ssingh: wikimedia.org: add DKIM selector for Dayforce [dns] - 10https://gerrit.wikimedia.org/r/1056946 (https://phabricator.wikimedia.org/T370961) [13:48:20] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [13:48:24] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [13:48:25] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: apply [13:48:29] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [13:48:30] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply [13:48:38] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [13:50:25] (03CR) 10Volans: mediawiki: fetch active deployment host (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [13:51:03] (03CR) 10Ssingh: "I am quite unhappy to see Dayforce use a generic word such as "corporate" for the DKIM selector. There is no public documentation but that" [dns] - 10https://gerrit.wikimedia.org/r/1056946 (https://phabricator.wikimedia.org/T370961) (owner: 10Ssingh) [13:52:04] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync [13:52:07] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync [13:52:08] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: sync [13:52:14] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: sync [13:52:42] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10014806 (10cmooney) >>! In T370852#10010089, @Volans wrote: > Without too much previous experience from past migrations, I think we could tackle it per... [13:53:34] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: sync [13:54:16] (03CR) 10Effie Mouzeli: "we have it in the wizard, so it ends up being shipped in values.yaml anyway :/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055872 (owner: 10Effie Mouzeli) [13:57:32] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: sync [13:57:33] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: sync [14:03:15] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [14:03:46] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [14:03:47] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [14:04:13] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [14:04:14] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/recommendation-api: sync [14:04:21] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [14:04:38] (03PS2) 10Ssingh: wikimedia.org: add DKIM selector for Dayforce [dns] - 10https://gerrit.wikimedia.org/r/1056946 (https://phabricator.wikimedia.org/T370961) [14:04:51] (03CR) 10Kamila SoučkovĂĄ: [C:03+1] shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:06:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:01] (03CR) 10Ottomata: [C:03+1] EventStreamConfig for mediawiki.cirrussearch.page_weighted_tags_change.rc0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056944 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [14:12:10] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: sync [14:12:11] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: sync [14:13:15] (03CR) 10AOkoth: [C:03+2] install: adjust vrts partition configs [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [14:13:24] (03PS4) 10AOkoth: install: adjust vrts partition configs [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) [14:14:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:35] (03PS3) 10Ssingh: wikimedia.org: add DKIM selector for Dayforce [dns] - 10https://gerrit.wikimedia.org/r/1056946 (https://phabricator.wikimedia.org/T370961) [14:19:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:51] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: sync [14:21:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T367856)', diff saved to https://phabricator.wikimedia.org/P66925 and previous config saved to /var/cache/conftool/dbconfig/20240725-142155-marostegui.json [14:22:01] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:23:23] (03CR) 10Kamila SoučkovĂĄ: [C:03+1] mesh.configuration: set idle_timeout to timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:28:47] (03CR) 10Ayounsi: Prometheus SSH probe: ignore network devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [14:29:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:39] (03PS1) 10Alexandros Kosiaris: mesh: Patch faultinjection config stanza mistake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056953 [14:30:18] (03CR) 10AOkoth: [V:03+2 C:03+2] install: adjust vrts partition configs [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [14:30:38] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v8.9.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056939 (owner: 10Elukey) [14:33:03] (03PS1) 10Elukey: Upstream release v8.9.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1056954 [14:33:49] (03CR) 10Elukey: [C:03+1] Move the /srv/netbox/ directory creation behind netbox4 flag [puppet] - 10https://gerrit.wikimedia.org/r/1056901 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:34:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:52] (03CR) 10Elukey: [C:03+1] "I'd have preferred to have something more configurable for Redis db instances to target, but it is already hardcoded so we can go ahead wi" [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:36:34] !log dcausse@deploy1002 Started deploy [airflow-dags/search@87b91b6]: search: drop hourly weighted_tags support [14:36:42] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v8.9.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1056954 (owner: 10Elukey) [14:36:54] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@87b91b6]: search: drop hourly weighted_tags support (duration: 00m 20s) [14:37:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P66926 and previous config saved to /var/cache/conftool/dbconfig/20240725-143703-marostegui.json [14:37:58] (03CR) 10Ayounsi: [C:03+2] Netbox 4: add transition flag for Redis database [puppet] - 10https://gerrit.wikimedia.org/r/1056911 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:38:01] (03CR) 10Ayounsi: [C:03+2] Move the /srv/netbox/ directory creation behind netbox4 flag [puppet] - 10https://gerrit.wikimedia.org/r/1056901 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:39:21] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:27] (03PS2) 10Alexandros Kosiaris: mesh: Patch faultinjection config stanza mistake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056953 [14:40:55] (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1056466/3424/" [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto) [14:41:54] (03CR) 10Ssingh: [C:03+1] profile::haproxy: move tls_terminator.pp to profile module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto) [14:44:57] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org [reason: upgrading anycast-hc: T370068] [14:45:11] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [14:46:41] !log [dns4003] upgrade anycast-healthchecker to 0.9.8-1+wmf12u2: T370068 [14:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:03] (03CR) 10Alexandros Kosiaris: [C:03+1] mesh.configuration: set idle_timeout to timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:48:54] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org [reason: finished upgrading anycast-hc: T370068] [14:51:54] !log running authdns-update after dns4003 depool [14:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P66928 and previous config saved to /var/cache/conftool/dbconfig/20240725-145210-marostegui.json [14:53:32] !log uploaded spicerack_8.9.0 to apt.wikimedia.org bullseye-wikimedia [14:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:30] (03CR) 10Elukey: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1056534 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:00:05] dduvall and dancy: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T1500). [15:00:38] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:04] (03CR) 10Fabfur: [C:03+1] "Looks good to me, even if a bit fishy :)" [dns] - 10https://gerrit.wikimedia.org/r/1056946 (https://phabricator.wikimedia.org/T370961) (owner: 10Ssingh) [15:02:23] (03CR) 10Ssingh: "🐟" [dns] - 10https://gerrit.wikimedia.org/r/1056946 (https://phabricator.wikimedia.org/T370961) (owner: 10Ssingh) [15:02:27] (03CR) 10Ssingh: [C:03+2] wikimedia.org: add DKIM selector for Dayforce [dns] - 10https://gerrit.wikimedia.org/r/1056946 (https://phabricator.wikimedia.org/T370961) (owner: 10Ssingh) [15:04:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:19] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [15:07:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T367856)', diff saved to https://phabricator.wikimedia.org/P66929 and previous config saved to /var/cache/conftool/dbconfig/20240725-150717-marostegui.json [15:07:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:07:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:07:36] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [15:07:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T367856)', diff saved to https://phabricator.wikimedia.org/P66930 and previous config saved to /var/cache/conftool/dbconfig/20240725-150739-marostegui.json [15:08:53] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: DKIM Key to Public DNS (Dayforce) - https://phabricator.wikimedia.org/T370961#10015168 (10ssingh) 05Open→03Resolved a:03ssingh @APaul-WMF: This has been merged and is complete from our end; you can let Dayforce know about the same. If there are any con... [15:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:32] (03CR) 10Cathal Mooney: [C:03+1] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [15:10:47] can we/should we silence the uncomitted DNS changes for netbox1003? (which I guess is still WIP) [15:15:41] !log upgrade spicerack to 8.9.0 on cumin nodes [15:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:31] (03PS7) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [15:16:31] (03PS1) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046) [15:17:36] (03CR) 10Ebernhardson: "I would have never realized, and probably had odd errors because of it. Thanks! Current intent is to deploy this monday or tuesday of next" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [15:26:19] (03PS2) 10Amire80: Add muddyb255 to Planet [puppet] - 10https://gerrit.wikimedia.org/r/1054688 [15:26:25] (03PS8) 10Amire80: planet: add various feeds, reorganize [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [15:26:53] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: add workaround for PXE boot issue on some NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1056534 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:28:11] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by (output started at /srv/mediawiki/php-1.43.0-wmf.11/includes/libs/http/MultiHttpClient.p... - https://phabricator.wikimedia.org/T369186#10015295 [15:30:56] (03Merged) 10jenkins-bot: sre.hosts.reimage: add workaround for PXE boot issue on some NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1056534 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:34:58] 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10015346 (10fgiunchedi) Please excuse the drive-by comment, as a ONFIRE alumni I seem to remember we talked about hosting corto on alert hosts. As a o11y member, the offer is still valid in terms of hosti... [15:48:50] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10015377 (10Jhancock.wm) a:03Jhancock.wm [15:50:48] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10015395 (10Jhancock.wm) [15:53:34] (03PS1) 10Ayounsi: Netbox 4 breaking change (choices is now in netbox) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056972 [15:54:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: cookbook failed after the fist "go" host cloudcephmod1004 - https://phabricator.wikimedia.org/T371024#10015410 (10elukey) This seems to be the issue: ` INFO:cumin.transports.clustershell.SyncEventHandler:Completed command 'puppet lookup --render-... [15:59:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:03] (03CR) 10Elukey: [C:03+1] Netbox 4 breaking change (choices is now in netbox) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056972 (owner: 10Ayounsi) [16:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:07:10] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [16:08:17] (03PS1) 10Cwhite: site: add insetup configs for logging-sd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056973 (https://phabricator.wikimedia.org/T370546) [16:09:26] (03PS1) 10Scott French: Rename / reimage one appserver to k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1056974 (https://phabricator.wikimedia.org/T351074) [16:09:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:51] (03CR) 10Dzahn: [C:03+2] Add muddyb255 to Planet [puppet] - 10https://gerrit.wikimedia.org/r/1054688 (owner: 10Amire80) [16:10:34] (03PS19) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [16:10:46] (03CR) 10ClĂ©ment Goubert: [C:03+1] Rename / reimage one appserver to k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1056974 (https://phabricator.wikimedia.org/T351074) (owner: 10Scott French) [16:10:52] (03CR) 10Elukey: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [16:11:34] (03PS1) 10Btullis: Update the hostname for mongodb access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056975 (https://phabricator.wikimedia.org/T365839) [16:12:28] (03CR) 10Scott French: [C:03+2] Rename / reimage one appserver to k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1056974 (https://phabricator.wikimedia.org/T351074) (owner: 10Scott French) [16:14:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:44] (03PS1) 10JHathaway: pcc: improve error checking [puppet] - 10https://gerrit.wikimedia.org/r/1056976 (https://phabricator.wikimedia.org/T367547) [16:14:47] (03CR) 10Btullis: [C:03+2] Update the hostname for mongodb access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056975 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [16:15:02] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10015493 (10sbassett) @aude service-template-node is indeed quite dated and fairly unmaintained. And it would be difficult to recommend it for new pr... [16:15:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: cookbook failed after the fist "go" host cloudcephmod1004 - https://phabricator.wikimedia.org/T371024#10015498 (10Papaul) i get this when i run puppet on the node from cumin ` Error: Could not retrieve catalog from remote server: Error 500 on SER... [16:15:08] (03CR) 10CI reject: [V:04-1] pcc: improve error checking [puppet] - 10https://gerrit.wikimedia.org/r/1056976 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [16:15:44] (03Merged) 10jenkins-bot: Update the hostname for mongodb access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056975 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [16:17:08] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:18:19] !log swfrench@cumin1002 START - Cookbook sre.hosts.rename from mw1364 to wikikube-worker1032 [16:18:24] !log swfrench@cumin1002 START - Cookbook sre.dns.netbox [16:18:58] (03PS2) 10JHathaway: pcc: improve error checking [puppet] - 10https://gerrit.wikimedia.org/r/1056976 (https://phabricator.wikimedia.org/T367547) [16:19:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: cookbook failed after the fist "go" host cloudcephmod1004 - https://phabricator.wikimedia.org/T371024#10015525 (10Dzahn) I would suggest to apply the "insetup" role so that the reimage can move ahead. Then after reimage is done the production rol... [16:21:32] !log swfrench@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1364 to wikikube-worker1032 - swfrench@cumin1002" [16:22:42] (03PS1) 10Btullis: Growthbook: disable aythmechanism PLAIN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056977 (https://phabricator.wikimedia.org/T365839) [16:23:31] !log swfrench@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1364 to wikikube-worker1032 - swfrench@cumin1002" [16:23:31] !log swfrench@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:31] !log swfrench@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1032 [16:23:53] (03CR) 10Btullis: [C:03+2] Growthbook: disable aythmechanism PLAIN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056977 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [16:24:41] (03Merged) 10jenkins-bot: Growthbook: disable aythmechanism PLAIN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056977 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [16:24:59] !log swfrench@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1032 [16:25:07] !log swfrench@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1364 to wikikube-worker1032 [16:27:34] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:27:51] (03CR) 10CDanis: [C:03+1] pcc: improve error checking [puppet] - 10https://gerrit.wikimedia.org/r/1056976 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [16:27:54] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:29:35] !log swfrench@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1032.eqiad.wmnet on all recursors [16:29:38] !log swfrench@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1032.eqiad.wmnet on all recursors [16:30:18] (03PS1) 10Papaul: Update new cloudcephmon node to use insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1056980 (https://phabricator.wikimedia.org/T371024) [16:30:54] !log swfrench@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1032.eqiad.wmnet with OS bullseye [16:31:11] (03PS1) 10Hashar: cumin: clone homer public repo with default parameters [puppet] - 10https://gerrit.wikimedia.org/r/1056981 (https://phabricator.wikimedia.org/T338277) [16:31:37] (03PS3) 10Dzahn: ci: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) [16:35:27] (03PS2) 10Papaul: Update new cloudcephmon node to use insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1056980 (https://phabricator.wikimedia.org/T371024) [16:35:51] (03CR) 10CI reject: [V:04-1] Update new cloudcephmon node to use insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1056980 (https://phabricator.wikimedia.org/T371024) (owner: 10Papaul) [16:36:28] (03CR) 10JHathaway: [C:03+2] pcc: improve error checking [puppet] - 10https://gerrit.wikimedia.org/r/1056976 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [16:38:02] (03PS3) 10Papaul: Update new cloudcephmon node to use insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1056980 (https://phabricator.wikimedia.org/T371024) [16:38:25] (03CR) 10CI reject: [V:04-1] Update new cloudcephmon node to use insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1056980 (https://phabricator.wikimedia.org/T371024) (owner: 10Papaul) [16:38:54] (03CR) 10Papaul: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056980 (https://phabricator.wikimedia.org/T371024) (owner: 10Papaul) [16:39:39] (03CR) 10Hashar: cumin: clone homer public repo with default parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056981 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:39:51] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056981 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:40:31] (03PS1) 10Dzahn: wikistats: adjust min and max size hints for cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1056983 [16:41:19] 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10015615 (10BCornwall) That's right! Thanks for reminding. Anyone have any qualms with going that route? [16:41:29] (03PS4) 10Papaul: Update new cloudcephmon node to use insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1056980 (https://phabricator.wikimedia.org/T371024) [16:42:43] (03CR) 10Papaul: [C:03+2] Update new cloudcephmon node to use insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1056980 (https://phabricator.wikimedia.org/T371024) (owner: 10Papaul) [16:45:22] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [16:45:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10015630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye [16:45:40] !log swfrench@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1032.eqiad.wmnet with reason: host reimage [16:46:34] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@8c8f4c2]: Add new fields to search_satisfaction metrics [16:46:53] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@8c8f4c2]: Add new fields to search_satisfaction metrics (duration: 00m 19s) [16:48:10] (03CR) 10Hashar: "The Puppet 5 compilation fails due to an unknown clause `group_by`, then I guess it has long migrated to Puppet 7 which passes. The compi" [puppet] - 10https://gerrit.wikimedia.org/r/1056981 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:48:41] !log swfrench@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1032.eqiad.wmnet with reason: host reimage [16:49:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: cookbook failed after the fist "go" host cloudcephmod1004 - https://phabricator.wikimedia.org/T371024#10015637 (10Papaul) 05Open→03Resolved a:03Papaul This was failing because the node didn't have the insteup role .... [16:53:29] (03PS1) 10Hashar: cumin: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) [16:58:48] (03CR) 10Hashar: "This probably need the parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056981/ to be merged before triggering a PCC to" [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [16:59:35] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [17:00:04] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T1700) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T1700) [17:02:07] .me looks to see if there is anything to do today... [17:03:27] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@b1a04fc]: bump discolytics to 0.25 [17:03:53] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@b1a04fc]: bump discolytics to 0.25 (duration: 00m 25s) [17:04:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:52] !log running homer 'cr*eqiad*' commit 'T351074' for k8s worker reimage [17:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:10] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [17:08:40] !log swfrench@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1032.eqiad.wmnet with OS bullseye [17:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:18] (03PS5) 10Andrea Denisse: burrow: Create the /var/run/burrow dir with systemd-tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1056579 (https://phabricator.wikimedia.org/T366573) [17:11:26] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T371045 (10Scott_French) 03NEW [17:14:36] (03PS4) 10Ayounsi: Cookbooks: fix Netbox 4 breaking changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) [17:14:36] (03PS2) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) [17:14:36] (03PS1) 10Ayounsi: netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) [17:17:00] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1056579/3426/kafkamon1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1056579 (https://phabricator.wikimedia.org/T366573) (owner: 10Andrea Denisse) [17:18:04] (03CR) 10Dzahn: [C:03+2] wikistats: adjust min and max size hints for cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1056983 (owner: 10Dzahn) [17:20:19] !log swfrench@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1032.eqiad.wmnet),cluster=kubernetes,service=kubesvc [reason: T351074 - pooling after reimage] [17:20:39] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [17:22:01] (03PS6) 10Andrea Denisse: burrow: Create a runtime directory in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/1056579 (https://phabricator.wikimedia.org/T366573) [17:25:47] (03CR) 10David Caro: [C:03+2] hieradata: Update Striker to 2024-07-20-113830-production [puppet] - 10https://gerrit.wikimedia.org/r/1056263 (https://phabricator.wikimedia.org/T369395) (owner: 10BryanDavis) [17:26:15] (03PS1) 10Aklapper: Include tags and subscibers in quarterly Phabricator data for WMF QLS [puppet] - 10https://gerrit.wikimedia.org/r/1056992 (https://phabricator.wikimedia.org/T370947) [17:32:50] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:33:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10015750 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye ex... [17:34:26] (03PS1) 10Dzahn: Revert "wikistats: adjust min and max size hints for cinder volume" [puppet] - 10https://gerrit.wikimedia.org/r/1056993 [17:35:05] (03CR) 10Dzahn: [C:03+2] Revert "wikistats: adjust min and max size hints for cinder volume" [puppet] - 10https://gerrit.wikimedia.org/r/1056993 (owner: 10Dzahn) [17:37:29] (03PS1) 10Dzahn: wikistats: drop min_gb size hint for cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1056994 [17:38:33] (03CR) 10Dzahn: [C:03+2] wikistats: drop min_gb size hint for cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1056994 (owner: 10Dzahn) [17:43:19] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1055491/3427/contint1002.wikimedia.org/change.contint1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:48:35] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10015823 (10Arnoldokoth) @Jclark-ctr I've updated Puppet. The desired RAID level is 10. [17:51:52] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Apart from a bunch of whitespace errors this applies cleanly." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1056058 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [17:54:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:55:56] (03CR) 10Dzahn: "Adding the "wmflib::role::hosts('gerrit').filter " part here was a nice idea but it leads to a chicken-egg problem. When you have a new pu" [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [17:56:07] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [17:56:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [18:00:04] dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T1800). [18:00:24] (03PS1) 10Dzahn: gerrit: drop gerrit-replica-new.wikimedia.org from list of replicas [puppet] - 10https://gerrit.wikimedia.org/r/1056996 (https://phabricator.wikimedia.org/T243027) [18:01:38] (03CR) 10Dzahn: [C:03+1] "also the current code means it doesn't use the list from Hiera, but I want to change that for reasons mentioned at the bottom of https://g" [puppet] - 10https://gerrit.wikimedia.org/r/1056996 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [18:02:13] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056997 (https://phabricator.wikimedia.org/T366960) [18:02:14] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056997 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [18:02:53] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056997 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [18:04:17] (03PS1) 10Dzahn: gerrit: use list of replicas from hiera again, don't do puppet DB lookup [puppet] - 10https://gerrit.wikimedia.org/r/1056998 [18:04:49] (03CR) 10CI reject: [V:04-1] gerrit: use list of replicas from hiera again, don't do puppet DB lookup [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [18:05:50] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [18:06:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10015870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye [18:06:40] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:03] (03CR) 10Dzahn: [C:03+2] "turns out it's not possible to let puppet mount a cinder volume of 1 GB size:" [puppet] - 10https://gerrit.wikimedia.org/r/1056994 (owner: 10Dzahn) [18:07:51] !log pt1979@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [18:10:26] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1004.eqiad.wmnet with reason: host reimage [18:12:26] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.15 refs T366960 [18:12:48] T366960: 1.43.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T366960 [18:26:46] !log pt1979@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1002" [18:27:02] (03PS1) 10Dzahn: cinderutils: allow floating point numbers for min_gb and max_gb [puppet] - 10https://gerrit.wikimedia.org/r/1057000 [18:33:49] (03PS1) 10Michael Große: Ignore help-links with no title configured [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057001 (https://phabricator.wikimedia.org/T370941) [18:34:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057001 (https://phabricator.wikimedia.org/T370941) (owner: 10Michael Große) [18:37:19] (03CR) 10Ottomata: "K, lemme know when this one is out, and I can restart eventgate-main. Or, you are welcome to as well :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [18:39:22] (03PS1) 10Dzahn: wikistats: add systemd timer to copy backups to external cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1057002 [18:59:30] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [19:12:08] !log pt1979@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1002" [19:12:09] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [19:12:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10016000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye co... [19:24:08] (03PS9) 10Amire80: planet: add various feeds, reorganize [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [19:24:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10016019 (10Papaul) a:05Papaul→03Jclark-ctr @Jclark-ctr looking at 1004 i realized that com2 was not set ` System BIOS System BIOS Settings > Se... [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T2000) [20:00:05] MichaelG_WMF: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] o/ [20:08:39] (03CR) 10Scott French: "Thanks for finding this!" [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [20:15:53] (03PS1) 10Scott French: P:mediawiki::php::restarts: fix no-LVS case [puppet] - 10https://gerrit.wikimedia.org/r/1057010 [20:16:17] (03CR) 10CI reject: [V:04-1] P:mediawiki::php::restarts: fix no-LVS case [puppet] - 10https://gerrit.wikimedia.org/r/1057010 (owner: 10Scott French) [20:21:23] (03PS2) 10Scott French: P:mediawiki::php::restarts: fix no-LVS case [puppet] - 10https://gerrit.wikimedia.org/r/1057010 [20:25:39] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057010 (owner: 10Scott French) [20:26:52] (03PS1) 10Umherirrender: Revert "Use expression builder to avoid IDatabase::makeList" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057012 (https://phabricator.wikimedia.org/T371052) [20:33:04] (03CR) 10Scott French: "I think something like https://gerrit.wikimedia.org/r/1057010 will do it, if the desired state is to switch the script over to just using " [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10ClĂ©ment Goubert) [20:38:54] (03CR) 10CI reject: [V:04-1] Revert "Use expression builder to avoid IDatabase::makeList" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057012 (https://phabricator.wikimedia.org/T371052) (owner: 10Umherirrender) [20:40:09] (03CR) 10Umherirrender: "recheck" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057012 (https://phabricator.wikimedia.org/T371052) (owner: 10Umherirrender) [20:51:51] (03CR) 10Dzahn: [C:03+2] "thanks! lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [20:52:12] (03CR) 10Dzahn: [C:03+2] "thanks !:)" [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [20:52:58] (03PS1) 10Catrope: Revert "beta: Work around T370517 by remapping the affected i18n message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057015 [20:53:05] (03PS2) 10Catrope: Revert "beta: Work around T370517 by remapping the affected i18n message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057015 [20:53:26] (03Abandoned) 10Catrope: Revert "beta: Work around T370517 by remapping the affected i18n message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057015 (owner: 10Catrope) [20:53:56] (03CR) 10Dzahn: "the -1 from CI is due to the very thing this is supposed to fix!" [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [20:55:54] (03CR) 10Dzahn: "before this can be applied need a fix for the current puppet run like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1057000" [puppet] - 10https://gerrit.wikimedia.org/r/1057002 (owner: 10Dzahn) [20:59:27] ... [21:14:17] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:14:47] (03PS1) 10Ladsgroup: Add CSS class to watchlist pending notice [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057016 (https://phabricator.wikimedia.org/T191156) [21:17:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2221 to codfw - jhancock@cumin2002" [21:18:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2221 to codfw - jhancock@cumin2002" [21:18:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:18:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2221.mgmt.codfw.wmnet with reboot policy FORCED [21:18:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2222.mgmt.codfw.wmnet with reboot policy FORCED [21:18:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2223.mgmt.codfw.wmnet with reboot policy FORCED [21:18:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2224.mgmt.codfw.wmnet with reboot policy FORCED [21:18:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2225.mgmt.codfw.wmnet with reboot policy FORCED [21:19:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2222.mgmt.codfw.wmnet with reboot policy FORCED [21:19:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2225.mgmt.codfw.wmnet with reboot policy FORCED [21:21:31] (03PS1) 10GergesShamon: [eswiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057018 (https://phabricator.wikimedia.org/T370158) [21:21:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2222.mgmt.codfw.wmnet with reboot policy FORCED [21:22:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2225.mgmt.codfw.wmnet with reboot policy FORCED [21:22:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057018 (https://phabricator.wikimedia.org/T370158) (owner: 10GergesShamon) [21:23:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) (owner: 10GergesShamon) [21:24:01] (03PS1) 10Dzahn: wikistats: temp comment out mounting of the cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1057019 [21:31:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2224.mgmt.codfw.wmnet with reboot policy FORCED [21:32:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2221.mgmt.codfw.wmnet with reboot policy FORCED [21:32:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2223.mgmt.codfw.wmnet with reboot policy FORCED [21:32:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2222.mgmt.codfw.wmnet with reboot policy FORCED [21:33:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2225.mgmt.codfw.wmnet with reboot policy FORCED [21:34:47] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2221'] [21:34:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2222'] [21:34:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2223'] [21:35:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2224'] [21:35:06] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2225'] [21:35:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2221'] [21:35:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2222'] [21:35:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2223'] [21:35:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2224'] [21:35:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2225'] [21:35:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2221'] [21:36:02] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2222'] [21:36:10] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2223'] [21:36:15] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2224'] [21:36:20] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2225'] [21:37:44] (03PS1) 10JHathaway: remove mx{1001,2001) as MX servers [dns] - 10https://gerrit.wikimedia.org/r/1057020 (https://phabricator.wikimedia.org/T325409) [21:41:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2221'] [21:41:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2222'] [21:41:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2223'] [21:42:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2225'] [21:42:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2224'] [21:46:59] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10016447 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/32; - member ge-1/0/32; [ed... [21:48:22] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#10016450 (10Papaul) [21:49:03] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#10016451 (10Papaul) [21:49:29] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#10016457 (10Papaul) [21:49:30] (03CR) 10Dzahn: [C:03+2] wikistats: temp comment out mounting of the cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1057019 (owner: 10Dzahn) [21:52:19] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:53:21] (03CR) 10Dzahn: [C:03+2] "unfortunately there is a problem with the pt feed that ended in https://phabricator.wikimedia.org/T371064 being created" [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [21:54:13] !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade for T370973 [21:54:20] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2240 to codfw - jhancock@cumin2002" [21:55:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2240 to codfw - jhancock@cumin2002" [21:55:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:59:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2240.mgmt.codfw.wmnet with reboot policy FORCED [22:00:37] !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade for T370973 [22:01:00] (03CR) 10Dzahn: [C:03+2] wikistats: add systemd timer to copy backups to external cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/1057002 (owner: 10Dzahn) [22:02:00] (03CR) 10Dzahn: [C:03+2] "but then when I repeat the update command I don't get the same error.. intermittent? was there any maintenance on wikimedia.pt maybe?" [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [22:03:57] !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade for T370973 [22:03:57] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade for T370973 [22:04:32] !log eoghan@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade for T370973 [22:06:40] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:46] !log eoghan@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade for T370973 [22:11:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2240.mgmt.codfw.wmnet with reboot policy FORCED [22:13:29] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10016559 (10Papaul) ` [edit interfaces interface-range disabled] - member ge-0/0/29; - member ge-0/0/30; - member ge-0/0/31; - member ge-1/0/2... [22:17:03] (03CR) 10Dzahn: [C:03+2] "works fine" [puppet] - 10https://gerrit.wikimedia.org/r/1057002 (owner: 10Dzahn) [22:33:46] (03CR) 10Ladsgroup: [C:03+2] Revert "Use expression builder to avoid IDatabase::makeList" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057012 (https://phabricator.wikimedia.org/T371052) (owner: 10Umherirrender) [22:38:57] (03PS1) 10Papaul: Add new frack nodes to DNS files [dns] - 10https://gerrit.wikimedia.org/r/1057023 [22:44:16] (03Merged) 10jenkins-bot: Revert "Use expression builder to avoid IDatabase::makeList" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057012 (https://phabricator.wikimedia.org/T371052) (owner: 10Umherirrender) [22:46:07] !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1057012|Revert "Use expression builder to avoid IDatabase::makeList" (T371052)]] [22:46:11] T371052: FlaggedRevs: RecentChanges filters no longer work - https://phabricator.wikimedia.org/T371052 [22:48:30] !log ladsgroup@deploy1002 ladsgroup, umherirrender: Backport for [[gerrit:1057012|Revert "Use expression builder to avoid IDatabase::makeList" (T371052)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:50:37] !log ladsgroup@deploy1002 ladsgroup, umherirrender: Continuing with sync [22:51:53] (03CR) 10Ladsgroup: [C:03+2] Add CSS class to watchlist pending notice [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057016 (https://phabricator.wikimedia.org/T191156) (owner: 10Ladsgroup) [22:56:15] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1057012|Revert "Use expression builder to avoid IDatabase::makeList" (T371052)]] (duration: 10m 08s) [22:56:20] T371052: FlaggedRevs: RecentChanges filters no longer work - https://phabricator.wikimedia.org/T371052 [22:57:36] (03CR) 10Papaul: [C:03+2] Add new frack nodes to DNS files [dns] - 10https://gerrit.wikimedia.org/r/1057023 (owner: 10Papaul) [23:01:20] (03Merged) 10jenkins-bot: Add CSS class to watchlist pending notice [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057016 (https://phabricator.wikimedia.org/T191156) (owner: 10Ladsgroup) [23:02:07] 06SRE, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10016660 (10Papaul) [23:02:46] 06SRE, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10016663 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt all your's [23:03:08] !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1057016|Add CSS class to watchlist pending notice (T191156)]] [23:03:13] T191156: Convert FlaggedRevisions to Codex - https://phabricator.wikimedia.org/T191156 [23:04:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10016668 (10Papaul) [23:05:26] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1057016|Add CSS class to watchlist pending notice (T191156)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:05:54] 06SRE, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10016675 (10Papaul) [23:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 10.74% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:07:16] 06SRE, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10016672 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt all your's [23:08:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:09:33] (03PS1) 10Jdlrobson: Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) [23:09:42] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [23:13:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:14:21] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:14:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:14:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:16:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mobileapps.svc.codfw.wmnet:4102 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:17:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:18:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:19:21] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:19:42] it's not just the api [23:19:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:19:57] FIRING: [6x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:21:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [23:21:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:21:51] FIRING: [8x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:22:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:22:23] you got to be kidding me [23:23:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:24:21] FIRING: [19x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:24:57] RESOLVED: [18x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:25:36] FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [23:25:38] FIRING: [19x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:25:50] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [23:26:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [23:26:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:26:51] RESOLVED: [10x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:27:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.82% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:28:15] RESOLVED: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:29:21] FIRING: [17x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:30:36] RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [23:30:42] RESOLVED: [13x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:30:50] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [23:31:03] RESOLVED: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:31:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:32:15] RESOLVED: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.44% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:38:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057029 [23:38:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057029 (owner: 10TrainBranchBot) [23:43:47] FIRING: HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus