[00:04:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063080 (owner: 10TrainBranchBot) [00:32:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:25] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:55:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [02:59:25] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [03:02:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10068420 (10Andrew) (meanwhile I am draining and rebuilding cloudcephosd1035 because it was built with improper drive assignments.) [04:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:18:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:15] (03PS1) 10Marostegui: installserver: Do not format db2224 [puppet] - 10https://gerrit.wikimedia.org/r/1063085 [05:51:20] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1063085 (owner: 10Marostegui) [05:52:22] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2224 [puppet] - 10https://gerrit.wikimedia.org/r/1063085 (owner: 10Marostegui) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240816T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:28:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10068492 (10ABran-WMF) 05Open→03Resolved host is fully repooled with no issue. [06:38:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [06:42:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2152.codfw.wmnet with reason: Schema change [06:42:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2152.codfw.wmnet with reason: Schema change [06:43:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [06:49:17] (03PS1) 10JMeybohm: preseed: Switch reimaged kafka nodes to reuse receipts [puppet] - 10https://gerrit.wikimedia.org/r/1063086 (https://phabricator.wikimedia.org/T371423) [06:52:36] (03CR) 10JMeybohm: [C:03+2] preseed: Switch reimaged kafka nodes to reuse receipts [puppet] - 10https://gerrit.wikimedia.org/r/1063086 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm) [06:56:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool db2136 - running 10.11', diff saved to https://phabricator.wikimedia.org/P67345 and previous config saved to /var/cache/conftool/dbconfig/20240816-065606-marostegui.json [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240816T0700) [07:01:11] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye [07:01:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068522 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bu... [07:20:22] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage [07:23:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage [07:34:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068548 (10JMeybohm) >>! In T371423#10064707, @JMeybohm wrote: > They all failed because the installer tried to bring up some old mdadm arrays and failed doing so, May... [07:40:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2007.codfw.wmnet with OS bullseye [07:41:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068556 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye completed: - kafka-... [07:43:46] !log deploy pfw policy update 1723675086 - T372520 [07:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:31] (03CR) 10EoghanGaffney: "Substantively ok, but two comments below" [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [07:48:54] 06SRE, 10SRE-Access-Requests: Requesting access to for  - https://phabricator.wikimedia.org/T372445#10068576 (10JMeybohm) >>! In T372445#10064393, @colewhite wrote: > I can't recommend querying OpenSearch directly. Logs-api was made for services needing log... [07:54:39] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10068585 (10Ifeatu_Nnaobi_WMDE) Thanks Katie <3 [07:55:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068582 (10JMeybohm) 05Open→03Resolved All of these have been reimaged with raid10-6dev now, thanks! [07:59:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10068587 (10JMeybohm) From T371423#10068548 I did: - Partition the new disks in current state: sgdisk -R /dev/sde /dev/sda; sgdisk -R /dev/sdf /dev/sda; sgdisk -G... [08:00:21] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye [08:00:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye [08:01:33] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1007.eqiad.wmnet with OS bullseye [08:01:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068591 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye [08:02:38] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1008.eqiad.wmnet with OS bullseye [08:02:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1008.eqiad.wmnet with OS bullseye [08:03:50] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye [08:04:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068593 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [08:05:01] (03CR) 10Jelto: [C:03+1] "looks mostly good but I'm not able to test this locally. get_project_ids_from_trusted_list() is broken since https://gitlab.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/1063015 (owner: 10EoghanGaffney) [08:05:08] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [08:05:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068594 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [08:20:32] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:20:38] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:28:16] (03CR) 10Jelto: "looks mostly good, to questions in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062715 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [08:29:12] (03PS1) 10Marostegui: db2136: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1063151 [08:29:48] (03PS6) 10Ilias Sarantopoulos: ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 [08:29:53] (03CR) 10Marostegui: [C:03+2] db2136: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1063151 (owner: 10Marostegui) [08:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:39:40] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: New version 10.11.9 [software] - 10https://gerrit.wikimedia.org/r/1063153 (https://phabricator.wikimedia.org/T372551) [08:40:46] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10068643 (10eoghan) [08:45:04] (03PS7) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) [08:47:28] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1006.eqiad.wmnet with OS bullseye [08:47:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068647 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye executed with e... [08:48:41] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1007.eqiad.wmnet with OS bullseye [08:48:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye executed with e... [08:49:43] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1008.eqiad.wmnet with OS bullseye [08:49:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068649 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1008.eqiad.wmnet with OS bullseye executed with e... [08:50:30] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:50:46] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1009.eqiad.wmnet with OS bullseye [08:50:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye executed with e... [08:52:18] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1010.eqiad.wmnet with OS bullseye [08:52:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10068651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye executed with e... [08:52:25] (03CR) 10Jelto: [V:03+1] profile::firewall::nftables_throttling: add option for burst packets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:59:47] (03CR) 10Klausman: [C:03+1] admin_ng/LiftWing: add article-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062709 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [09:00:14] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: New version 10.11.9 [software] - 10https://gerrit.wikimedia.org/r/1063153 (https://phabricator.wikimedia.org/T372551) (owner: 10Marostegui) [09:00:43] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: New version 10.11.9 [software] - 10https://gerrit.wikimedia.org/r/1063153 (https://phabricator.wikimedia.org/T372551) (owner: 10Marostegui) [09:16:23] (03CR) 10JMeybohm: [C:03+1] mediawiki: consistently apply stats-global values via symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063031 (https://phabricator.wikimedia.org/T365265) (owner: 10Scott French) [09:16:45] (03CR) 10JMeybohm: Prometheus: Add recording rules for istio ingress metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055213 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [09:18:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:57] (03PS2) 10JMeybohm: Prometheus: Add recording rules computing commonly used envoy histograms [puppet] - 10https://gerrit.wikimedia.org/r/1055432 (https://phabricator.wikimedia.org/T369607) [09:21:05] (03CR) 10JMeybohm: Prometheus: Add recording rules computing commonly used envoy histograms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055432 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [09:22:31] (03CR) 10Ilias Sarantopoulos: [C:03+2] admin_ng/LiftWing: add article-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062709 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [09:23:44] !log pfischer@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:23:53] !log pfischer@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:25:20] (03PS1) 10Klausman: hiera/k8s: Update ML Swift secrets sections for consistency [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 [09:26:07] (03Merged) 10jenkins-bot: admin_ng/LiftWing: add article-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062709 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [09:26:19] (03PS1) 10JMeybohm: preseed: Move all new kafka-main nodes to reuse receipts [puppet] - 10https://gerrit.wikimedia.org/r/1063163 (https://phabricator.wikimedia.org/T371423) [09:27:34] (03CR) 10Klausman: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 (owner: 10Klausman) [09:29:09] (03CR) 10JMeybohm: [C:03+2] preseed: Move all new kafka-main nodes to reuse receipts [puppet] - 10https://gerrit.wikimedia.org/r/1063163 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm) [09:29:25] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:30:01] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:33:41] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye [09:33:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10068778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye [09:34:03] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1007.eqiad.wmnet with OS bullseye [09:34:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10068779 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye [09:34:36] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1008.eqiad.wmnet with OS bullseye [09:34:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10068780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1008.eqiad.wmnet with OS bullseye [09:35:03] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye [09:35:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10068783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [09:35:34] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [09:35:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10068784 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [09:41:34] (03PS1) 10Ayounsi: Netbox: set RQ_DEFAULT_TIMEOUT back to default of 300 [puppet] - 10https://gerrit.wikimedia.org/r/1063168 (https://phabricator.wikimedia.org/T341843) [09:43:36] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:44:27] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:46:45] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [09:46:53] (03CR) 10EoghanGaffney: [C:03+1] "Thanks for fixing those, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:49:27] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10068918 (10Gehel) [09:50:09] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [09:50:41] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage [09:51:07] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1007.eqiad.wmnet with reason: host reimage [09:51:16] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10068943 (10eoghan) a:05eoghan→03odimitrijevic Hi @odimitrijevic, could you please look at this as an approver for the `analytics-privatedata-users` group? Tha... [09:51:26] (03PS6) 10JMeybohm: Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) [09:51:26] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1008.eqiad.wmnet with reason: host reimage [09:51:30] (03PS6) 10JMeybohm: Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928) [09:51:58] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [09:52:13] (03CR) 10JMeybohm: "The Makefile is kind of a mess indeed - but I'm using it for tests, building as well as running the end-to-end tests, so it seems generall" [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [09:53:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage [09:53:29] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage [09:55:20] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10068962 (10eoghan) a:03eoghan [09:56:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1008.eqiad.wmnet with reason: host reimage [09:57:34] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:58:17] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:58:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage [10:02:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [10:05:08] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1007.eqiad.wmnet with reason: host reimage [10:09:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rq-netbox.service on netbox2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1006.eqiad.wmnet with OS bullseye [10:10:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10068995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye completed: - kafka-... [10:14:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1008.eqiad.wmnet with OS bullseye [10:14:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10069004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1008.eqiad.wmnet with OS bullseye completed: - kafka-... [10:16:08] (03PS5) 10JMeybohm: Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) [10:16:08] (03PS3) 10JMeybohm: Add policy to allow only SYS_PTRACE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) [10:16:08] (03PS3) 10JMeybohm: Add policy to allow GeoIP hostPath volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054905 (https://phabricator.wikimedia.org/T368251) [10:16:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1010.eqiad.wmnet with OS bullseye [10:16:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:16:27] (03CR) 10JMeybohm: "> Question: What is the plan if Kyverno changes in the future in a way that doesn't remain translatable into "plain k8s" policies? Stick w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [10:16:47] (03CR) 10JMeybohm: "Agreed and renamed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [10:16:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10069019 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye completed: - kafka-... [10:19:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1009.eqiad.wmnet with OS bullseye [10:20:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10069027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye completed: - kafka-... [10:21:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1007.eqiad.wmnet with OS bullseye [10:22:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10069031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye completed: - kafka-... [10:23:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10069033 (10JMeybohm) 05Open→03Resolved a:03VRiley-WMF All of these have been reimaged with raid10-6dev now, thanks! [10:34:48] (03CR) 10TChin: [C:03+1] EventStreamConfig: remove webrequest_frontend. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062679 (https://phabricator.wikimedia.org/T372456) (owner: 10Gmodena) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240816T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240816T1100). Please do the needful. [11:07:17] (03PS1) 10Ayounsi: Netbox: remove prefer_ipv4 flag [puppet] - 10https://gerrit.wikimedia.org/r/1063178 [11:07:17] (03PS1) 10Ayounsi: Netbox: disable rq-netbox on secodary node [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) [11:07:49] (03PS2) 10Ayounsi: Netbox: disable rq-netbox on secondary node [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) [11:07:52] (03CR) 10CI reject: [V:04-1] Netbox: disable rq-netbox on secondary node [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [11:08:17] (03CR) 10CI reject: [V:04-1] Netbox: disable rq-netbox on secondary node [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [11:09:32] (03PS3) 10Ayounsi: Netbox: disable rq-netbox on secondary node [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) [11:11:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:14:18] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [11:21:46] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:32:03] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:40:24] S4 again =( [11:43:47] yep [11:44:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 14.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:49:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:49:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 17.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:54:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:15:54] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy articlequality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063182 (https://phabricator.wikimedia.org/T360455) [12:17:46] (03PS2) 10Ilias Sarantopoulos: ml-services: deploy articlequality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063182 (https://phabricator.wikimedia.org/T360455) [12:27:31] (03PS1) 10Ilias Sarantopoulos: hiera/deployment-server: create article-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1063183 (https://phabricator.wikimedia.org/T360455) [12:32:48] (03PS4) 10Stevemunene: idp-test: Register airflow-test-k8s IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) [12:39:13] (03CR) 10Elukey: "LGTM, but what is the default value? Maybe we could leave this option commented with a reference to this issue, could be valuable when sea" [puppet] - 10https://gerrit.wikimedia.org/r/1063168 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [12:40:57] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3670/co" [puppet] - 10https://gerrit.wikimedia.org/r/1063178 (owner: 10Ayounsi) [12:42:52] (03CR) 10Elukey: [V:03+1 C:03+1] Netbox: remove prefer_ipv4 flag [puppet] - 10https://gerrit.wikimedia.org/r/1063178 (owner: 10Ayounsi) [12:46:30] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 (owner: 10Ilias Sarantopoulos) [12:47:27] (03Merged) 10jenkins-bot: ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 (owner: 10Ilias Sarantopoulos) [12:49:02] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:54:11] (03CR) 10Elukey: Netbox: disable rq-netbox on secondary node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [12:58:58] (03PS8) 10Arnaudb: mariadb: cookbook to safely upgrade and reboot santarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) [13:05:46] (03CR) 10Ayounsi: Netbox: disable rq-netbox on secondary node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [13:08:04] (03CR) 10Ayounsi: "Default is 300 and documented in https://netboxlabs.com/docs/netbox/en/stable/configuration/miscellaneous/#rq_default_timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1063168 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [13:08:49] (03CR) 10Kevin Bazira: [C:03+1] ml-services: deploy articlequality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063182 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [13:10:16] (03CR) 10Klausman: [C:03+1] hiera/deployment-server: create article-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1063183 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [13:11:12] (03CR) 10Klausman: [C:03+1] ml-services: deploy articlequality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063182 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [13:12:09] (03CR) 10Klausman: [C:03+2] hiera/deployment-server: create article-models config/roles [puppet] - 10https://gerrit.wikimedia.org/r/1063183 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [13:15:48] (03CR) 10Elukey: [C:03+1] Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [13:18:18] (03CR) 10Elukey: Netbox: disable rq-netbox on secondary node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [13:18:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:20] (03CR) 10Ayounsi: [C:03+2] Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [13:26:39] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye [13:26:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10069443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1035.eq... [13:32:10] (03PS14) 10Btullis: Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [13:32:34] (03CR) 10CI reject: [V:04-1] Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [13:41:12] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [13:41:16] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3 [13:43:27] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Reimaging clouddb1017 T365424 [13:43:40] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Reimaging clouddb1017 T365424 [13:43:56] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [13:45:09] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [13:48:03] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [13:51:44] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1017.eqiad.wmnet with OS bookworm [13:52:11] (03PS15) 10Btullis: Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [13:52:53] (03PS1) 10Bking: WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) [13:53:48] (03CR) 10CI reject: [V:04-1] WIP: airflow: implement SSO auth [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063195 (https://phabricator.wikimedia.org/T368760) (owner: 10Bking) [13:53:48] (03CR) 10Bking: [C:03+1] site.pp: Add ml-labs machine entries for setup [puppet] - 10https://gerrit.wikimedia.org/r/1062667 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [13:53:54] (03PS4) 10Ayounsi: Netbox: disable rq-netbox on secondary node [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) [13:53:59] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3672/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [13:54:13] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [13:54:22] (03CR) 10Klausman: [C:03+2] site.pp: Add ml-labs machine entries for setup [puppet] - 10https://gerrit.wikimedia.org/r/1062667 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [13:54:37] (03CR) 10Ayounsi: Netbox: disable rq-netbox on secondary node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [13:55:37] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:56:09] (03PS1) 10MVernon: cephadm: separate templates for zonegroup setup and rgw placement [puppet] - 10https://gerrit.wikimedia.org/r/1063196 (https://phabricator.wikimedia.org/T279621) [13:58:43] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10069519 (10VPuffetMichel) As David's manager, I approve this request. [13:58:44] (03PS1) 10Ilias Sarantopoulos: ml-services: payload logging in revscoring-mp-articlequality in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063198 [13:58:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2036 to codfw - jhancock@cumin2002" [13:58:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2036 to codfw - jhancock@cumin2002" [13:58:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:58:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2035 [13:59:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2035 [13:59:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2036 [13:59:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2036 [14:02:21] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1063196 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:03:25] (03CR) 10Elukey: [C:04-1] "I had a chat with Ben on IRC, I'd temporarily set a -1 due to https://phabricator.wikimedia.org/T370203#10069535." [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [14:03:56] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bullseye [14:04:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10069542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1035.eqiad.... [14:04:29] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1017.eqiad.wmnet with reason: host reimage [14:07:03] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1017.eqiad.wmnet with reason: host reimage [14:09:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rq-netbox.service on netbox2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:44] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:14:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_rq-netbox.service on netbox2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with reboot policy FORCED [14:16:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:16:22] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2037 [14:16:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti2037 [14:17:02] 06SRE, 10conftool, 06Traffic: confd causes soft lockup when you are tailing a file with -F and the state is updated - https://phabricator.wikimedia.org/T372646 (10ssingh) 03NEW [14:18:03] (03CR) 10Klausman: [C:03+2] ml-services: payload logging in revscoring-mp-articlequality in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063198 (owner: 10Ilias Sarantopoulos) [14:18:12] (03CR) 10Klausman: [C:03+1] ml-services: payload logging in revscoring-mp-articlequality in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063198 (owner: 10Ilias Sarantopoulos) [14:21:03] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:21:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2004-dev.mgmt.codfw.wmnet with reboot policy FORCED [14:21:39] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [14:21:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2004-dev'] [14:23:36] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10069607 (10Jhancock.wm) [14:24:06] (03PS3) 10Klausman: hiera/manifest/partman: Add configuration for new ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1057177 (https://phabricator.wikimedia.org/T366521) [14:24:21] (03PS4) 10Klausman: hiera/manifest/partman: Add configuration for new ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1057177 (https://phabricator.wikimedia.org/T366521) [14:24:23] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy articlequality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063182 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [14:25:24] (03Merged) 10jenkins-bot: ml-services: deploy articlequality to prod in new ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063182 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [14:25:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2037 to codfw - jhancock@cumin2002" [14:25:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2037 to codfw - jhancock@cumin2002" [14:25:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:25:54] (03PS1) 10Hnowlan: changeprop-jobqueue: reduce refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063205 (https://phabricator.wikimedia.org/T370304) [14:25:59] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: payload logging in revscoring-mp-articlequality in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063198 (owner: 10Ilias Sarantopoulos) [14:26:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2037 [14:26:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2037 [14:27:08] 06SRE, 10conftool, 06Traffic: confd causes soft lockup when you are tailing a file with -F and the state is updated - https://phabricator.wikimedia.org/T372646#10069610 (10ssingh) ` [Thu Aug 15 14:36:19 2024] RIP: 0010:__fsnotify_update_child_dentry_flags+0xc6/0x110 ` From the same host, `slabtop` output:... [14:27:32] (03Merged) 10jenkins-bot: ml-services: payload logging in revscoring-mp-articlequality in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063198 (owner: 10Ilias Sarantopoulos) [14:27:46] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [14:28:22] (03CR) 10Elukey: [C:03+1] Remove custom_script_proxy.py and getstats.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060079 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [14:30:55] (03CR) 10Kamila Součková: changeprop-jobqueue: reduce refreshLinks concurrency (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063205 (https://phabricator.wikimedia.org/T370304) (owner: 10Hnowlan) [14:31:28] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:31:58] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: sre.hosts.reimage failing due to mkfs.ext4 taking to long - https://phabricator.wikimedia.org/T372648 (10JMeybohm) 03NEW [14:32:27] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063205 (https://phabricator.wikimedia.org/T370304) (owner: 10Hnowlan) [14:34:24] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [14:34:24] (03PS2) 10Hnowlan: changeprop-jobqueue: reduce refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063205 (https://phabricator.wikimedia.org/T370304) [14:34:41] (03CR) 10Hnowlan: changeprop-jobqueue: reduce refreshLinks concurrency (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063205 (https://phabricator.wikimedia.org/T370304) (owner: 10Hnowlan) [14:34:42] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2038 to codfw - jhancock@cumin2002" [14:34:42] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [14:34:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2038 to codfw - jhancock@cumin2002" [14:34:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:34:52] (03CR) 10Elukey: [C:03+1] check_netbox_report.py: reports -> scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [14:34:55] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: remove webrequest_frontend. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062679 (https://phabricator.wikimedia.org/T372456) (owner: 10Gmodena) [14:35:14] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2038 [14:35:25] (03CR) 10Elukey: [C:03+1] Update wheels to pickup new pynetbox version [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1062356 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi) [14:36:18] (03CR) 10Elukey: [C:03+1] Netbox: set RQ_DEFAULT_TIMEOUT back to default of 300 [puppet] - 10https://gerrit.wikimedia.org/r/1063168 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [14:36:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2038 [14:36:59] (03CR) 10Kamila Součková: [C:03+1] changeprop-jobqueue: reduce refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063205 (https://phabricator.wikimedia.org/T370304) (owner: 10Hnowlan) [14:37:32] (03CR) 10Elukey: [C:03+1] Netbox: disable rq-netbox on secondary node [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [14:39:26] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:11] (03CR) 10Elukey: [C:03+1] "It looks good from me on the Python side, but I have to say that my understanding of why this is needed is limited (due to my ignorance). " [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [14:42:16] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:44:55] (03CR) 10Elukey: [C:03+1] Network report, remove clusters from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062665 (owner: 10Ayounsi) [14:45:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2039 to codfw - jhancock@cumin2002" [14:45:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2039 to codfw - jhancock@cumin2002" [14:45:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:49] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2039 [14:45:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2039 [14:47:17] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:48:38] (03CR) 10Elukey: [C:03+1] hiera/manifest/partman: Add configuration for new ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1057177 (https://phabricator.wikimedia.org/T366521) (owner: 10Klausman) [14:48:43] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1017.eqiad.wmnet with OS bookworm [14:48:46] 06SRE, 10SRE-Access-Requests: Requesting access to ldap/wmf for divec - https://phabricator.wikimedia.org/T372369#10069679 (10eoghan) 05Open→03Resolved I've added the user to the wmf group. @dchan, I'm going to close this now, let me know if anything seems missing! [14:49:35] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3 [14:51:20] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2040 to codfw - jhancock@cumin2002" [14:51:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2040 to codfw - jhancock@cumin2002" [14:51:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:51:53] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [14:52:26] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2040 [14:52:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2040 [14:53:58] (03PS1) 10Ilias Sarantopoulos: httpbb: add article-models namespace tests for articlequality [puppet] - 10https://gerrit.wikimedia.org/r/1063213 (https://phabricator.wikimedia.org/T360455) [14:54:36] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:55:20] (03PS1) 10Scott French: Update confd template error state globs [cookbooks] - 10https://gerrit.wikimedia.org/r/1063215 (https://phabricator.wikimedia.org/T363924) [14:58:23] (03CR) 10Klausman: [C:03+2] hiera/manifest/partman: Add configuration for new ML hosts [puppet] - 10https://gerrit.wikimedia.org/r/1057177 (https://phabricator.wikimedia.org/T366521) (owner: 10Klausman) [14:58:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2041 to codfw - jhancock@cumin2002" [14:58:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2041 to codfw - jhancock@cumin2002" [14:58:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:57] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2041 [14:59:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2041 [14:59:24] (03CR) 10Kamila Součková: [C:03+1] Add policy to allow only SYS_PTRACE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [14:59:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:07] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:01] (03CR) 10Hnowlan: [C:03+2] changeprop-jobqueue: reduce refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063205 (https://phabricator.wikimedia.org/T370304) (owner: 10Hnowlan) [15:03:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2042 to codfw - jhancock@cumin2002" [15:03:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2042 to codfw - jhancock@cumin2002" [15:03:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2042 [15:03:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2042 [15:04:15] (03Merged) 10jenkins-bot: changeprop-jobqueue: reduce refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063205 (https://phabricator.wikimedia.org/T370304) (owner: 10Hnowlan) [15:04:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:50] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:05:39] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:06:17] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:07:17] (03PS1) 10Hnowlan: poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T372470) [15:07:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2043 to codfw - jhancock@cumin2002" [15:07:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2043 to codfw - jhancock@cumin2002" [15:07:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:08:10] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2043 [15:08:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2043 [15:10:02] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:10:44] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:11:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:11:59] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:12:25] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:13:29] (03PS2) 10Hnowlan: poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T372470) [15:24:03] (03CR) 10CI reject: [V:04-1] poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T372470) (owner: 10Hnowlan) [15:26:39] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:30:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding logging-sd2004 to codfw - jhancock@cumin2002" [15:30:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding logging-sd2004 to codfw - jhancock@cumin2002" [15:30:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-sd2001.mgmt.codfw.wmnet with reboot policy FORCED [15:30:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-sd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:30:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-sd2003.mgmt.codfw.wmnet with reboot policy FORCED [15:33:30] (03CR) 10Elukey: mariadb: cookbook to safely upgrade and reboot santarium hosts (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) (owner: 10Arnaudb) [15:36:58] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10069835 (10Jhancock.wm) [15:41:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd2001.mgmt.codfw.wmnet with reboot policy FORCED [15:42:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd2003.mgmt.codfw.wmnet with reboot policy FORCED [15:45:43] (03CR) 10Eevans: [C:03+1] "Eyeballed 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1063196 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:48:18] (03PS1) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [15:50:55] (03CR) 10Klausman: [C:03+1] httpbb: add article-models namespace tests for articlequality [puppet] - 10https://gerrit.wikimedia.org/r/1063213 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [15:51:12] (03PS1) 10Scott French: confd: fix error state file name in check [puppet] - 10https://gerrit.wikimedia.org/r/1063223 (https://phabricator.wikimedia.org/T363924) [15:51:41] (03CR) 10Klausman: [C:03+1] hiera/k8s: Update ML Swift secrets sections for consistency [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 (owner: 10Klausman) [15:51:45] (03CR) 10Klausman: [C:03+2] hiera/k8s: Update ML Swift secrets sections for consistency [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 (owner: 10Klausman) [15:51:47] (03CR) 10Klausman: [V:03+2 C:03+2] hiera/k8s: Update ML Swift secrets sections for consistency [labs/private] - 10https://gerrit.wikimedia.org/r/1063162 (owner: 10Klausman) [15:54:11] (03PS1) 10Ottomata: Rewrite mediawiki.org/beacon/event to /beacon/event/index.php [puppet] - 10https://gerrit.wikimedia.org/r/1063224 (https://phabricator.wikimedia.org/T353817) [15:54:41] (03PS2) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [15:55:08] (03Abandoned) 10Ottomata: Remove docroot/mediawiki.org/beacon/event/index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055443 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:55:15] (03PS3) 10Hnowlan: poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T372470) [15:55:59] (03CR) 10CI reject: [V:04-1] poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T372470) (owner: 10Hnowlan) [15:56:22] (03PS1) 10Ilias Sarantopoulos: APIGW: Add configuration to expose LW isvc articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063225 (https://phabricator.wikimedia.org/T360455) [15:58:21] (03PS4) 10Hnowlan: poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T372470) [16:08:01] (03CR) 10Klausman: [C:03+1] APIGW: Add configuration to expose LW isvc articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063225 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [16:17:15] (03PS1) 10Máté Szabó: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063227 (https://phabricator.wikimedia.org/T370502) [16:27:57] (03PS1) 10Hnowlan: thumbor: add allowlist to thumbor to address internal rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063228 (https://phabricator.wikimedia.org/T370304) [16:30:32] (03PS2) 10Hnowlan: thumbor: add allowlist to thumbor to address internal rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063228 (https://phabricator.wikimedia.org/T370304) [16:30:54] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10070003 (10elukey) Currently blocked by T372485 [16:40:26] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [16:46:43] (03PS1) 10SBassett: Add new image tag for miscweb:security.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063230 (https://phabricator.wikimedia.org/T372570) [17:06:29] (03CR) 10Mmartorana: [C:03+2] Add new image tag for miscweb:security.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063230 (https://phabricator.wikimedia.org/T372570) (owner: 10SBassett) [17:06:50] (03CR) 10Mmartorana: [V:03+2 C:03+2] Add new image tag for miscweb:security.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063230 (https://phabricator.wikimedia.org/T372570) (owner: 10SBassett) [17:18:11] !log sbassett@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:18:27] !log sbassett@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:18:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:08] Anybody doing miscweb work right now? I’m seeing tag changes for 9 additional sites in various config files than I was expecting… [17:21:27] I was just trying to deploy a very simple static change to security.wikimedia.org [17:28:24] (03PS1) 10Andrea Denisse: alert: Ensure the alert[12]001 hosts use the spare::system role [puppet] - 10https://gerrit.wikimedia.org/r/1063231 (https://phabricator.wikimedia.org/T372607) [17:33:59] (03PS1) 10Andrea Denisse: alert: Remove the alert[12]001 hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1063233 (https://phabricator.wikimedia.org/T372607) [17:57:23] (03PS2) 10Andrea Denisse: alert: Ensure alert1002 is the active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) [18:06:01] (03PS1) 10Andrea Denisse: alert: Remove the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) [18:09:15] (03PS2) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) [18:14:50] (03PS1) 10Andrea Denisse: alert: Update alertmanager tests hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1063235 (https://phabricator.wikimedia.org/T372418) [18:15:44] (03CR) 10Scott French: "Thanks in advance for the review, Sukhbir! My main question for you is whether you have a preference for using the *full* name of the erro" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063215 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [18:26:39] (03CR) 10Ssingh: "Thanks Scott." [cookbooks] - 10https://gerrit.wikimedia.org/r/1063215 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [18:31:52] (03PS1) 10Bking: stat (dse) hosts: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) [18:32:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) (owner: 10Bking) [18:37:03] (03PS2) 10Bking: stat (dse) hosts: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) [18:37:51] (03CR) 10CI reject: [V:04-1] stat (dse) hosts: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) (owner: 10Bking) [18:39:24] (03PS3) 10Bking: stat (dse) hosts: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) [18:41:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) (owner: 10Bking) [18:43:38] 06SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for jhathaway - https://phabricator.wikimedia.org/T372663 (10jhathaway) 03NEW [18:46:29] 06SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for jhathaway - https://phabricator.wikimedia.org/T372663#10070348 (10thcipriani) Approved from the keeper of jenkins/`ciadmin` group side. Useful for managing the puppet compiler instances. [18:49:14] (03PS4) 10Bking: stat (dse) hosts: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) [18:50:06] (03PS2) 10Scott French: Update confd template error state globs [cookbooks] - 10https://gerrit.wikimedia.org/r/1063215 (https://phabricator.wikimedia.org/T363924) [18:51:01] (03CR) 10Scott French: "Thanks, Sukhbir! Great, in that case, let's go for the full path prefix." [cookbooks] - 10https://gerrit.wikimedia.org/r/1063215 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [18:52:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) (owner: 10Bking) [18:54:14] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Puppet 8 readiness - https://phabricator.wikimedia.org/T366900#10070356 (10jhathaway) [18:55:09] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664 (10jhathaway) 03NEW [18:55:40] (03CR) 10Ssingh: [C:03+1] Update confd template error state globs [cookbooks] - 10https://gerrit.wikimedia.org/r/1063215 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [18:57:31] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666 (10jhathaway) 03NEW [18:58:09] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667 (10jhathaway) 03NEW [18:58:46] (03PS1) 10JHathaway: mtail: remove undefined var [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) [19:04:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:11:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:26:23] (03CR) 10Bking: "Puppet 5 PCC is the one that failed, but Puppet 7 PCC (the only one we care about) is fine." [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) (owner: 10Bking) [19:29:27] (03CR) 10Ryan Kemper: [C:03+1] stat (dse) hosts: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) (owner: 10Bking) [19:30:11] (03CR) 10Bking: [C:03+2] stat (dse) hosts: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1063237 (https://phabricator.wikimedia.org/T362922) (owner: 10Bking) [19:33:30] (03PS1) 10JHathaway: pcc: news bullseye hosts puppet 5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1063244 (https://phabricator.wikimedia.org/T367547) [19:34:13] (03CR) 10JHathaway: [C:03+2] pcc: news bullseye hosts puppet 5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1063244 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [19:42:05] (03CR) 10Scott French: [C:03+2] Update confd template error state globs [cookbooks] - 10https://gerrit.wikimedia.org/r/1063215 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [19:43:06] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:55:48] (03Merged) 10jenkins-bot: Update confd template error state globs [cookbooks] - 10https://gerrit.wikimedia.org/r/1063215 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [19:59:21] !log eevans@deploy1003 Started deploy [restbase/deploy@f696b76] (beta): (no justification provided) [19:59:54] !log eevans@deploy1003 Finished deploy [restbase/deploy@f696b76] (beta): (no justification provided) (duration: 00m 33s) [20:01:24] !log eevans@deploy1003 Started deploy [restbase/deploy@f696b76] (beta): (no justification provided) [20:01:45] !log eevans@deploy1003 deploy aborted: (no justification provided) (duration: 00m 20s) [20:04:08] !log eevans@deploy1003 Started deploy [restbase/deploy@f696b76] (beta): (no justification provided) [20:04:19] !log eevans@deploy1003 deploy aborted: (no justification provided) (duration: 00m 11s) [20:10:48] !log eevans@deploy1003 Started deploy [restbase/deploy@f696b76] (beta): deploy to beta [20:11:16] !log eevans@deploy1003 deploy aborted: deploy to beta (duration: 00m 28s) [20:11:21] !log eevans@deploy1003 Started deploy [restbase/deploy@f696b76] (beta): deploy to beta [20:12:26] !log eevans@deploy1003 Finished deploy [restbase/deploy@f696b76] (beta): deploy to beta (duration: 01m 05s) [20:14:48] !log eevans@deploy1003 Started deploy [cassandra/logstash-logback-encoder@42653e6] (beta): Beta deploy [20:15:20] !log eevans@deploy1003 Finished deploy [cassandra/logstash-logback-encoder@42653e6] (beta): Beta deploy (duration: 00m 31s) [20:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:21:09] (03PS1) 10Dwisehaupt: icinga: Add payments2004 and payments2005 to service [puppet] - 10https://gerrit.wikimedia.org/r/1063247 (https://phabricator.wikimedia.org/T369942) [20:22:09] (03CR) 10Dwisehaupt: "Hosts are up and fully functional. This is ready to roll out as soon as it's approved." [puppet] - 10https://gerrit.wikimedia.org/r/1063247 (https://phabricator.wikimedia.org/T369942) (owner: 10Dwisehaupt) [20:23:49] (03PS3) 10Dwisehaupt: Remove entries for payments2001 and payments2002 [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630) [20:25:05] (03CR) 10Dwisehaupt: [C:03+2] Remove entries for payments2001 and payments2002 [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630) (owner: 10Dwisehaupt) [20:26:18] !log eevans@deploy1003 Started deploy [cassandra/logstash-logback-encoder@42653e6] (aqs): Test [20:26:51] !log eevans@deploy1003 Finished deploy [cassandra/logstash-logback-encoder@42653e6] (aqs): Test (duration: 00m 32s) [20:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:44:28] (03PS1) 10JHathaway: pcc: new bullseye hosts puppet 5 hosts (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/1063248 (https://phabricator.wikimedia.org/T367547) [20:45:24] (03CR) 10JHathaway: [C:03+2] pcc: new bullseye hosts puppet 5 hosts (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/1063248 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [20:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 22.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:18:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:30:02] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [22:49:28] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [23:04:20] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [23:04:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:11:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:27:21] !log eevans@deploy1003 Started deploy [cassandra/logstash-logback-encoder@42653e6] (beta): Test deploy [23:27:46] !log eevans@deploy1003 deploy aborted: Test deploy (duration: 00m 25s) [23:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063258 [23:38:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063258 (owner: 10TrainBranchBot)