[00:00:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:00:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2003.codfw.wmnet with OS bookworm [00:00:18] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334212 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host maps-test2003.codfw.wmnet with OS bookworm completed: - maps-test2... [00:03:22] !log removing 1 file for legal compliance [00:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2005.codfw.wmnet with reason: host reimage [00:10:28] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:14:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:14:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2004.codfw.wmnet with OS bookworm [00:14:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2005.codfw.wmnet with reason: host reimage [00:14:28] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334219 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host maps-test2004.codfw.wmnet with OS bookworm completed: - maps-test2... [00:14:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2006.codfw.wmnet with OS bookworm [00:15:06] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334220 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host maps-test2006.codfw.wmnet with OS bookworm [00:17:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:18:29] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [00:18:37] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-jumbo1017.eqiad.wmnet with OS bullseye executed with errors:... [00:18:51] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [00:18:59] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [00:34:49] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092358 [00:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092358 (owner: 10TrainBranchBot) [00:38:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:39:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2006.codfw.wmnet with reason: host reimage [00:39:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1016.eqiad.wmnet with OS bullseye [00:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10334273 (10phaultfinder) [00:39:42] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334274 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-jumbo1016.eqiad.wmnet with OS bullseye executed with errors:... [00:41:43] !log removing 1 file for legal compliance [00:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2006.codfw.wmnet with reason: host reimage [00:44:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS bullseye [00:44:19] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334275 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-jumbo1016.eqiad.wmnet with OS bullseye [00:51:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:51:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2005.codfw.wmnet with OS bookworm [00:51:16] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host maps-test2005.codfw.wmnet with OS bookworm completed: - maps-test2... [00:51:47] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334291 (10Papaul) p:05Triage→03Medium a:03Papaul [00:53:52] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bookworm [00:54:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10334293 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm ex... [00:54:46] !log removing 1 file for legal compliance [00:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:23] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1016.eqiad.wmnet with reason: host reimage [00:58:27] (03PS1) 10BCornwall: ncmonitor: Add "main" WMF domains to ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) [01:00:33] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4549/co" [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) (owner: 10BCornwall) [01:02:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1016.eqiad.wmnet with reason: host reimage [01:03:02] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:05:22] (03PS1) 10BCornwall: ncmonitor: Add pywikibot.org to domain ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092362 [01:06:24] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4550/co" [puppet] - 10https://gerrit.wikimedia.org/r/1092362 (owner: 10BCornwall) [01:06:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [01:06:31] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-jumbo1017.eqiad.wmnet with OS bullseye executed with errors:... [01:06:32] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-jumbo1018.eqiad.wmnet with OS bullseye executed with errors:... [01:07:18] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [01:07:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [01:07:26] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334305 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [01:07:28] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334306 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [01:07:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [01:08:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092363 [01:08:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092363 (owner: 10TrainBranchBot) [01:11:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092358 (owner: 10TrainBranchBot) [01:11:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [01:12:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:12:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2006.codfw.wmnet with OS bookworm [01:12:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [01:12:50] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334312 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host maps-test2006.codfw.wmnet with OS bookworm completed: - maps-test2... [01:13:29] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334313 (10Papaul) 05Open→03Resolved @MoritzKlenk this is done. I am closing this task when you done with the testing you can just open a decommission task and ref... [01:15:59] (03PS1) 10Ammarpad: Fix Letter of intent field label [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092364 (https://phabricator.wikimedia.org/T375392) [01:17:28] (03PS2) 10Ammarpad: affcom contactapges: Fix Letter of intent and logo field labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092364 (https://phabricator.wikimedia.org/T375392) [01:17:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [01:18:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092364 (https://phabricator.wikimedia.org/T375392) (owner: 10Ammarpad) [01:21:09] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1017.eqiad.wmnet with reason: host reimage [01:24:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1017.eqiad.wmnet with reason: host reimage [01:24:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:37:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:38:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092363 (owner: 10TrainBranchBot) [01:40:55] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:47:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:50:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:50:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [01:50:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:50:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1016.eqiad.wmnet with OS bullseye [01:51:02] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334396 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-jumbo1017.eqiad.wmnet with OS bullseye completed: - kafka-jum... [01:51:03] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334397 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-jumbo1016.eqiad.wmnet with OS bullseye completed: - kafka-jum... [01:54:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [01:54:17] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-jumbo1018.eqiad.wmnet with OS bullseye executed with errors:... [01:54:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [01:54:47] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [01:57:11] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/567b4e54a15841556e4b4826ae097807e67c81d1578d6e9f7bdfff723744cd26/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:05:07] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:08:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.4 [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092389 (https://phabricator.wikimedia.org/T375663) [02:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.4 [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092389 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [02:08:58] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1018.eqiad.wmnet with reason: host reimage [02:12:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:12:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1018.eqiad.wmnet with reason: host reimage [02:17:12] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:30:29] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:30:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:30:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [02:31:08] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-jumbo1018.eqiad.wmnet with OS bullseye completed: - kafka-jum... [02:33:09] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334474 (10Jclark-ctr) [02:33:25] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334475 (10Jclark-ctr) 05Open→03Resolved [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:01] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.4 [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092389 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T0300) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:30:55] (03PS1) 10Andrew Bogott: cloudvirt1062 -> ovs [puppet] - 10https://gerrit.wikimedia.org/r/1092412 (https://phabricator.wikimedia.org/T364457) [03:31:32] (03CR) 10CI reject: [V:04-1] cloudvirt1062 -> ovs [puppet] - 10https://gerrit.wikimedia.org/r/1092412 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [03:33:20] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bookworm [03:33:56] (03PS2) 10Andrew Bogott: cloudvirt1062 -> ovs [puppet] - 10https://gerrit.wikimedia.org/r/1092412 (https://phabricator.wikimedia.org/T364457) [03:34:42] (03CR) 10Andrew Bogott: [C:03+2] cloudvirt1062 -> ovs [puppet] - 10https://gerrit.wikimedia.org/r/1092412 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [03:48:14] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage [03:51:31] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T0400) [04:01:16] (03PS1) 10Andrew Bogott: Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) [04:01:56] (03CR) 10CI reject: [V:04-1] Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [04:02:05] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092417 (https://phabricator.wikimedia.org/T375663) [04:02:06] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092417 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [04:02:52] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092417 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [04:03:16] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10334553 (10Krd) I'd say this or any such problem should not occur again, as we definitely lost tickets, and the actual imp... [04:03:19] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.4 refs T375663 [04:03:22] T375663: 1.44.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T375663 [04:03:37] (03PS2) 10Andrew Bogott: Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) [04:04:14] (03CR) 10CI reject: [V:04-1] Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [04:05:21] (03PS3) 10Andrew Bogott: Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) [04:05:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [04:13:54] (03PS4) 10Andrew Bogott: Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) [04:14:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [04:16:03] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1062.eqiad.wmnet with OS bookworm [04:17:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [04:18:40] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:23:49] (03PS5) 10Andrew Bogott: Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) [04:23:49] (03PS1) 10Andrew Bogott: Neutron: remove linuxbridge from mechanism_drivers [puppet] - 10https://gerrit.wikimedia.org/r/1092425 (https://phabricator.wikimedia.org/T326373) [04:25:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [04:26:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [04:28:40] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:33:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092425 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [04:42:28] (03Abandoned) 10Andrew Bogott: trivial/test patch [puppet] - 10https://gerrit.wikimedia.org/r/1056233 (owner: 10Andrew Bogott) [04:52:20] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.4 refs T375663 (duration: 49m 01s) [04:52:23] T375663: 1.44.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T375663 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T0500) [05:01:20] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.1 (duration: 01m 18s) [05:11:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:17:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [05:52:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092257 (https://phabricator.wikimedia.org/T375301) (owner: 10KartikMistry) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T0700) [07:00:05] marostegui, Amir1, and arnaudb: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T0700). [07:00:23] (03CR) 10Brouberol: [C:03+1] dse-k8s: add ingress config for net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [07:01:22] (03CR) 10Brouberol: [C:03+1] "10GB of memory seems excessive for a Flask app, but we have memory in spades, so let's not be tightfisted :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092311 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:16:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [07:21:45] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete package::builder role [puppet] - 10https://gerrit.wikimedia.org/r/1091729 (owner: 10Muehlenhoff) [07:22:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10334705 (10MoritzMuehlenhoff) [07:23:15] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [07:24:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1016.eqiad.wmnet [07:24:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10334706 (10ops-monitoring-bot) Draining ganeti1016.eqiad.wmnet of running VMs [07:31:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1016.eqiad.wmnet [07:32:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1016.eqiad.wmnet [07:32:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10334720 (10ops-monitoring-bot) Draining ganeti1016.eqiad.wmnet of running VMs [07:40:14] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for remaining ServiceOps roles [puppet] - 10https://gerrit.wikimedia.org/r/1091599 (owner: 10Muehlenhoff) [07:40:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: T374215 - hw maintenance [07:41:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: T374215 - hw maintenance [07:45:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: sad [07:45:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: sad [07:49:06] (03PS3) 10Ammarpad: affcom contactpages: Fix Letter of intent and logo field labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092364 (https://phabricator.wikimedia.org/T375392) [07:56:17] (03PS1) 10Michael Große: fix tours by finishing partial variable rename [extensions/GuidedTour] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092741 (https://phabricator.wikimedia.org/T380071) [07:57:18] (03PS1) 10Abijeet Patro: Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092740 (https://phabricator.wikimedia.org/T372386) [07:57:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GuidedTour] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092741 (https://phabricator.wikimedia.org/T380071) (owner: 10Michael Große) [07:57:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092740 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:58:09] (03PS2) 10Abijeet Patro: Enable message group subscription feature for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092740 (https://phabricator.wikimedia.org/T372386) [07:59:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T0800). [08:00:05] wangombe_g, pfischer, urbanecm, Ammar, and MichaelG_WMF: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:29] i can deploy today [08:00:52] wangombe_g: pfischer: Ammar: MichaelG_WMF: around? [08:01:35] (03CR) 10Urbanecm: [C:03+2] [GrowthExperiments] Add virtual domain config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [08:01:44] urbanecm: yes [08:01:48] wangombe_g: around? [08:01:57] yes [08:02:15] morning all [08:02:18] yes [08:02:24] (03Merged) 10jenkins-bot: [GrowthExperiments] Add virtual domain config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [08:02:29] (03PS2) 10Wangombe: Translate Event Logging: Enable using $wgTranslateEnableEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082726 (https://phabricator.wikimedia.org/T364460) [08:02:35] (03PS3) 10Urbanecm: CirrusSearch: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [08:02:37] (03CR) 10Urbanecm: [C:03+2] Translate Event Logging: Enable using $wgTranslateEnableEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082726 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:02:41] (03CR) 10Urbanecm: [C:03+2] CirrusSearch: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [08:03:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082726 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:03:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [08:03:29] (03Merged) 10jenkins-bot: Translate Event Logging: Enable using $wgTranslateEnableEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082726 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:03:32] (03Merged) 10jenkins-bot: CirrusSearch: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [08:04:16] hi MichaelG_WMF! [08:04:20] * MichaelG_WMF is here too :) [08:04:28] (03CR) 10Urbanecm: [C:03+2] fix tours by finishing partial variable rename [extensions/GuidedTour] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092741 (https://phabricator.wikimedia.org/T380071) (owner: 10Michael Große) [08:04:34] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1082726|Translate Event Logging: Enable using $wgTranslateEnableEventLogging (T364460)]], [[gerrit:1092258|CirrusSearch: enable offloading weighted tags via EventBus (T378983 T377150)]], [[gerrit:1091197|[GrowthExperiments] Add virtual domain config (T354939)]] [08:04:41] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [08:04:42] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [08:04:42] T377150: Config: enable CirrusSearchEnableEventBusWeightedTags - https://phabricator.wikimedia.org/T377150 [08:04:43] T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939 [08:06:21] (03CR) 10Muehlenhoff: [C:03+2] Add ferm macro/nftables set for aux pods like for other k8s installations [puppet] - 10https://gerrit.wikimedia.org/r/1092283 (owner: 10Muehlenhoff) [08:10:51] (03PS1) 10Jon Harald Søby: Add nowiki to commonsuploads dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092743 (https://phabricator.wikimedia.org/T380252) [08:12:13] !log urbanecm@deploy2002 urbanecm, wangombe, pfischer: Backport for [[gerrit:1082726|Translate Event Logging: Enable using $wgTranslateEnableEventLogging (T364460)]], [[gerrit:1092258|CirrusSearch: enable offloading weighted tags via EventBus (T378983 T377150)]], [[gerrit:1091197|[GrowthExperiments] Add virtual domain config (T354939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:12:20] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [08:12:21] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [08:12:21] T377150: Config: enable CirrusSearchEnableEventBusWeightedTags - https://phabricator.wikimedia.org/T377150 [08:12:21] T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939 [08:12:44] wangombe_g: can you test? [08:12:47] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add mc-gp200[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092282 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [08:13:02] pfischer: I assume based on past attempts we can't really do much at mwdebug [08:13:13] +we are fairly confident from testwiki [08:13:17] (03PS2) 10Effie Mouzeli: memcached: add mc-gp200[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092282 (https://phabricator.wikimedia.org/T377033) [08:13:21] Hi folks! Am I too late to add a very small patch to the current window? This one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1092743 [08:13:29] urbanecm: right [08:13:41] Jhs: hey! Was wondering if you want to deploy it too. Can you add it to the calendar? [08:14:04] urbanecm, sure [08:14:54] (03CR) 10Effie Mouzeli: [C:03+1] memcached: add mc-gp200[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092282 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [08:16:12] Currently testing [08:19:00] wangombe_g: how is the testing going? [08:20:27] (03PS1) 10Slyngshede: Upgraded Bitu LDAP library. [dns] - 10https://gerrit.wikimedia.org/r/1092745 [08:21:34] Testing on Special:Translate is complete which was the primary place my change was affecting [08:22:05] wangombe_g: and how is it looking? [08:22:16] We're good. [08:22:22] !log urbanecm@deploy2002 urbanecm, wangombe, pfischer: Continuing with sync [08:22:33] thanks, proceeding [08:23:06] (03PS2) 10Jon Harald Søby: Add nowiki to commonsuploads dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092743 (https://phabricator.wikimedia.org/T380252) [08:23:08] (03CR) 10Urbanecm: [C:03+2] Add nowiki to commonsuploads dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092743 (https://phabricator.wikimedia.org/T380252) (owner: 10Jon Harald Søby) [08:23:52] (03Merged) 10jenkins-bot: Add nowiki to commonsuploads dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092743 (https://phabricator.wikimedia.org/T380252) (owner: 10Jon Harald Søby) [08:23:56] (03Merged) 10jenkins-bot: fix tours by finishing partial variable rename [extensions/GuidedTour] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092741 (https://phabricator.wikimedia.org/T380071) (owner: 10Michael Große) [08:24:17] (03PS4) 10Ammarpad: affcom contactpages: Fix Letter of intent and logo field labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092364 (https://phabricator.wikimedia.org/T375392) [08:24:20] (03CR) 10Urbanecm: [C:03+2] affcom contactpages: Fix Letter of intent and logo field labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092364 (https://phabricator.wikimedia.org/T375392) (owner: 10Ammarpad) [08:25:14] (03Merged) 10jenkins-bot: affcom contactpages: Fix Letter of intent and logo field labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092364 (https://phabricator.wikimedia.org/T375392) (owner: 10Ammarpad) [08:25:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [08:29:16] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082726|Translate Event Logging: Enable using $wgTranslateEnableEventLogging (T364460)]], [[gerrit:1092258|CirrusSearch: enable offloading weighted tags via EventBus (T378983 T377150)]], [[gerrit:1091197|[GrowthExperiments] Add virtual domain config (T354939)]] (duration: 24m 42s) [08:29:23] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [08:29:23] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [08:29:24] T377150: Config: enable CirrusSearchEnableEventBusWeightedTags - https://phabricator.wikimedia.org/T377150 [08:29:24] T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939 [08:30:20] pfischer: wangombe_g: should be live [08:31:11] urbanecm: thanks! [08:31:28] @pfischer: also, there is now a CirrusSearch alert few lines above? That feels...concerning [08:34:04] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1092741|fix tours by finishing partial variable rename (T380071)]], [[gerrit:1092364|affcom contactpages: Fix Letter of intent and logo field labels (T375392)]], [[gerrit:1092743|Add nowiki to commonsuploads dblist (T380252)]] [08:34:11] T380071: [wmf.3] GuidedTours are broken (originally: Homepage: No intro tours for new accounts) - https://phabricator.wikimedia.org/T380071 [08:34:11] T375392: Modify the user group application form on Meta - https://phabricator.wikimedia.org/T375392 [08:34:17] T380252: Add nowiki to commonsuploads dblist - https://phabricator.wikimedia.org/T380252 [08:35:52] urbanecm thanks [08:39:41] !log urbanecm@deploy2002 ammarpad, migr, jhsoby, urbanecm: Backport for [[gerrit:1092741|fix tours by finishing partial variable rename (T380071)]], [[gerrit:1092364|affcom contactpages: Fix Letter of intent and logo field labels (T375392)]], [[gerrit:1092743|Add nowiki to commonsuploads dblist (T380252)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:39:46] T380071: [wmf.3] GuidedTours are broken (originally: Homepage: No intro tours for new accounts) - https://phabricator.wikimedia.org/T380071 [08:39:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1092745 (owner: 10Slyngshede) [08:39:47] T375392: Modify the user group application form on Meta - https://phabricator.wikimedia.org/T375392 [08:39:47] T380252: Add nowiki to commonsuploads dblist - https://phabricator.wikimedia.org/T380252 [08:39:48] MichaelG_WMF: Ammar: Jhs: can you test? [08:39:56] * MichaelG_WMF is testing now [08:40:43] urbanecm, it's working half way… the "upload file" links now point to Commons (with safemode=1), but I still have the option to upload files in Special:Upload (from an unprivileged account) [08:40:54] Maybe I misunderstood the effects of the commonsuploads dblist [08:41:01] Jhs: that...shouldn't really happen [08:41:42] urbanecm: mine works as expected and fixes the tours on enwiki (and presumably everywhere else, too) 👍 [08:41:49] MichaelG_WMF: great news! [08:41:53] urbanecm I think it's not possible to test mine without syncing with someone who has access to the private (to receive the test email). But RamzyM already confirmed on phab task that they did receive the field value, only the label was incorrect. I think this will fix that [08:42:08] Jhs: aha, nowiki is messing with the config more than it should [08:42:17] * to the private list (correction) [08:42:22] (03PS1) 10Muehlenhoff: os-reports: Also open up rsync to the aux k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092749 (https://phabricator.wikimedia.org/T350794) [08:42:27] (03CR) 10Slyngshede: [C:03+2] Upgraded Bitu LDAP library. [dns] - 10https://gerrit.wikimedia.org/r/1092745 (owner: 10Slyngshede) [08:42:35] Jhs: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/core-Permissions.php#L2010 is the problem [08:42:59] (03CR) 10CI reject: [V:04-1] os-reports: Also open up rsync to the aux k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092749 (https://phabricator.wikimedia.org/T350794) (owner: 10Muehlenhoff) [08:43:00] at the top of the file (line 18), the commonsuploads is used to remove upload permissions for wikis that are in it [08:43:25] but, nowiki is configured as `nowiki` (rather `+nowiki`), and the missing `+` means "drop config from dblists and replace with what is here" [08:43:27] should it have a +? [08:43:37] ^_^ [08:43:40] while with a +, it would merge the commonsuploads and the nowiki-specific config [08:43:44] Jhs: yes :). can you upload a patch? [08:43:48] sure [08:43:49] !log urbanecm@deploy2002 ammarpad, migr, jhsoby, urbanecm: Continuing with sync [08:43:50] a new one? [08:43:56] yep, i already merged the previous [08:44:14] i'll sync in the meantime, as it is not breaking anything, and we'll do the second half in a minute [08:45:08] Ammar: well, label can be tested via Special:Contact. but anyway, doesn't hurt to sync this particular one. [08:45:25] (03PS1) 10Jon Harald Søby: Add + to nowiki in core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092752 [08:45:38] urbanecm, ^ [08:45:48] (03PS2) 10Urbanecm: Add + to nowiki in core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092752 (https://phabricator.wikimedia.org/T380252) (owner: 10Jon Harald Søby) [08:45:50] thanks! [08:45:57] (03CR) 10Urbanecm: [C:03+2] Add + to nowiki in core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092752 (https://phabricator.wikimedia.org/T380252) (owner: 10Jon Harald Søby) [08:45:59] thanks the same! [08:46:04] i just attached it to T380252 [08:46:04] T380252: Add nowiki to commonsuploads dblist - https://phabricator.wikimedia.org/T380252 [08:46:43] (03Merged) 10jenkins-bot: Add + to nowiki in core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092752 (https://phabricator.wikimedia.org/T380252) (owner: 10Jon Harald Søby) [08:46:45] (03PS2) 10Muehlenhoff: os-reports: Also open up rsync to the aux k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092749 (https://phabricator.wikimedia.org/T350794) [08:46:56] I have to run to buy some supplies for all of the chapter CEOs who are in our office today, so I can't test that until in around ~10 minutes, sorry [08:47:05] urbanecm: That's not the label being shown on Special:Contact/affcomusergroup. If you're seeing it, then something is wrong. [08:47:22] Ammar: ah, email label. okay, then ignore me :) [08:47:47] Jhs: say hi to Klára! ;) thanks for the info, no worries. [08:48:34] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092741|fix tours by finishing partial variable rename (T380071)]], [[gerrit:1092364|affcom contactpages: Fix Letter of intent and logo field labels (T375392)]], [[gerrit:1092743|Add nowiki to commonsuploads dblist (T380252)]] (duration: 14m 29s) [08:48:40] T380071: [wmf.3] GuidedTours are broken (originally: Homepage: No intro tours for new accounts) - https://phabricator.wikimedia.org/T380071 [08:48:40] T375392: Modify the user group application form on Meta - https://phabricator.wikimedia.org/T375392 [08:48:46] MichaelG_WMF: Ammar: synced [08:49:01] Thank you! [08:49:14] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1092752|Add + to nowiki in core-Permissions.php (T380252)]] [08:49:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092749 (https://phabricator.wikimedia.org/T350794) (owner: 10Muehlenhoff) [08:49:54] confirmed that tours work on enwiki without mwdebug again ✅ [08:50:45] great! [08:54:47] !log urbanecm@deploy2002 urbanecm, jhsoby: Backport for [[gerrit:1092752|Add + to nowiki in core-Permissions.php (T380252)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:54:51] T380252: Add nowiki to commonsuploads dblist - https://phabricator.wikimedia.org/T380252 [08:54:52] !log urbanecm@deploy2002 urbanecm, jhsoby: Continuing with sync [08:54:58] tested on my end, works like a charm [08:56:51] 06SRE, 06collaboration-services: gitlab runners don't have the apt.wikimedia.org key - https://phabricator.wikimedia.org/T380164#10334895 (10Jelto) Thanks for opening this task! Generally, the Runners—both the standard and the Cloud Runners used by the linked job—should have access to the Wikimedia APT reposi... [08:59:08] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136#10334896 (10Volans) Full list of hosts without AAAA records for `A:owner-infrastructure-foundations` ` ganeti[2017-2024].codfw.wmnet,ganeti[1009,1011-... [08:59:31] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092752|Add + to nowiki in core-Permissions.php (T380252)]] (duration: 10m 17s) [08:59:47] (03PS1) 10Brouberol: spark3.5/build: define a maven settings file to make it use webroxy to connect to central [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092754 (https://phabricator.wikimedia.org/T380035) [08:59:48] and here we go [08:59:54] Jhs: ^^, should be now fully done (in production) [09:00:05] andre and brennen: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T0900) [09:00:17] right on time :) [09:01:57] hehe, morning. Backporting all done, I assume? [09:03:13] andre: yep yep! all yours :) [09:03:18] happy training (?) [09:03:24] Thanks (also for backporting)! [09:03:24] (train-ing) [09:03:26] let's see [09:06:30] I will now start promoting group0 wikis to 1.44.0-wmf.4 [09:06:41] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092760 (https://phabricator.wikimedia.org/T375663) [09:06:43] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092760 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [09:07:31] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092760 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot) [09:13:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1091602 (owner: 10Slyngshede) [09:16:51] (03CR) 10Arturo Borrero Gonzalez: "both the diff and PCC looks good to me expect for a doubt I have, in an inlined comment." [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [09:17:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [09:18:19] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad [09:18:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad [09:19:37] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.4 refs T375663 [09:19:40] T375663: 1.44.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T375663 [09:19:57] (03PS1) 10Muehlenhoff: testreduce: Enable profile::auto_restarts::service for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1092764 (https://phabricator.wikimedia.org/T135991) [09:22:30] (03PS1) 10Muehlenhoff: Correctly mark restbase* hosts as handled by Data Persistence [puppet] - 10https://gerrit.wikimedia.org/r/1092765 [09:26:24] (03PS1) 10DCausse: cirrus-streaming-updater: bump producer taskmanager to 3g [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092767 [09:29:56] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump producer taskmanager to 3g [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092767 (owner: 10DCausse) [09:31:04] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump producer taskmanager to 3g [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092767 (owner: 10DCausse) [09:32:39] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:32:43] (03CR) 10Btullis: [C:03+1] "Thanks for this. It will probably break builds on a workstation, where webproxy is not available, but I guess that's OK." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092754 (https://phabricator.wikimedia.org/T380035) (owner: 10Brouberol) [09:32:47] 10SRE-tools, 06cloud-services-team, 06Infrastructure-Foundations, 07IPv6: Some WMCS clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271139#10334985 (10Volans) I think it can be resolved, list updated as of today: ` an-redacteddb1001.eqiad.wmnet,clouddb2002-dev.codfw.wmnet,cloud... [09:33:30] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:35:21] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad [09:35:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [09:36:20] (03CR) 10Brouberol: [C:03+2] spark3.5/build: define a maven settings file to make it use webroxy to connect to central [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092754 (https://phabricator.wikimedia.org/T380035) (owner: 10Brouberol) [09:36:23] (03CR) 10Brouberol: [V:03+2 C:03+2] spark3.5/build: define a maven settings file to make it use webroxy to connect to central [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092754 (https://phabricator.wikimedia.org/T380035) (owner: 10Brouberol) [09:37:45] RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [09:38:48] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:38:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad [09:39:14] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:39:15] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad [09:39:19] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:39:38] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:39:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad [09:41:53] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin [09:41:54] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin [09:42:09] !log upgrade haproxy on cp-text|upload_eqsin (T379891) [09:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:13] T379891: Upgrade haproxy to 2.8.12 on cp hosts - https://phabricator.wikimedia.org/T379891 [09:42:15] (03PS1) 10DCausse: cirrus-streaming-updater: fix producer-staging mem usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092773 [09:43:15] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [09:46:09] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: fix producer-staging mem usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092773 (owner: 10DCausse) [09:47:25] (03Merged) 10jenkins-bot: cirrus-streaming-updater: fix producer-staging mem usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092773 (owner: 10DCausse) [09:49:08] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw [09:49:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw [09:51:40] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:51:51] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:52:14] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw [09:55:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw [09:56:53] (03CR) 10Jbond: "lgtm but agree with taavi that we should make this a fail" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [09:57:08] (03CR) 10Slyngshede: [C:03+2] Permissions: automatically attempt request validation on creation [software/bitu] - 10https://gerrit.wikimedia.org/r/1091602 (owner: 10Slyngshede) [09:58:10] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [09:58:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw [09:59:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090798 (owner: 10Muehlenhoff) [09:59:28] (03Merged) 10jenkins-bot: Permissions: automatically attempt request validation on creation [software/bitu] - 10https://gerrit.wikimedia.org/r/1091602 (owner: 10Slyngshede) [09:59:28] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [09:59:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw [10:00:00] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [10:00:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw [10:02:00] !log installing openssl security updates [10:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:19] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add mc-gp200[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092282 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [10:09:37] (03PS1) 10Muehlenhoff: Add umbrella Cumin alias for wikikube staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092776 [10:13:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 5%: repool', diff saved to https://phabricator.wikimedia.org/P71089 and previous config saved to /var/cache/conftool/dbconfig/20241119-101350-arnaudb.json [10:14:46] (03PS1) 10Slyngshede: Netbox alerting: Add remaining netbox alert. [alerts] - 10https://gerrit.wikimedia.org/r/1092779 (https://phabricator.wikimedia.org/T350694) [10:15:47] (03PS1) 10Jelto: rake_modules: also lint charts against 1.31.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092780 (https://phabricator.wikimedia.org/T379919) [10:16:06] !log restart spamd on vrts to pick up openssl updates [10:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:20] (03CR) 10CI reject: [V:04-1] Netbox alerting: Add remaining netbox alert. [alerts] - 10https://gerrit.wikimedia.org/r/1092779 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:17:56] 06SRE, 06collaboration-services: gitlab runners don't have the apt.wikimedia.org key - https://phabricator.wikimedia.org/T380164#10335127 (10MatthewVernon) Hm, yes, I took the path from a production host, where the key is installed into `/etc/apt/keyrings` by puppet (`apt::package_from_component`); you're righ... [10:24:10] (03PS2) 10Slyngshede: Netbox alerting: Add remaining netbox alert. [alerts] - 10https://gerrit.wikimedia.org/r/1092779 (https://phabricator.wikimedia.org/T350694) [10:25:44] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [10:27:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [10:28:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 10%: repool', diff saved to https://phabricator.wikimedia.org/P71090 and previous config saved to /var/cache/conftool/dbconfig/20241119-102855-arnaudb.json [10:30:35] (03CR) 10JMeybohm: [C:03+1] Add replacement kafka nodes to kafka_brokers_main on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1089822 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [10:31:05] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10335164 (10dcaro) Currently not supported by the pyyaml https://github.com/yaml/pyyaml/issues/90 [10:32:39] (03PS1) 10Muehlenhoff: airflow_platform_eng: Add missing stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1092785 (https://phabricator.wikimedia.org/T378443) [10:36:38] (03CR) 10JMeybohm: [C:03+1] "kubeconform being happy with 1.31.2 is great news!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092780 (https://phabricator.wikimedia.org/T379919) (owner: 10Jelto) [10:37:13] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin [10:41:52] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin [10:42:23] (03CR) 10Slyngshede: [C:03+1] "Lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/1092785 (https://phabricator.wikimedia.org/T378443) (owner: 10Muehlenhoff) [10:42:41] (03CR) 10Muehlenhoff: [C:03+2] Revise Envoy firewall options [puppet] - 10https://gerrit.wikimedia.org/r/1090798 (owner: 10Muehlenhoff) [10:43:08] (03PS4) 10Volans: sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 [10:43:08] (03PS3) 10Volans: Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 [10:44:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 15%: repool', diff saved to https://phabricator.wikimedia.org/P71091 and previous config saved to /var/cache/conftool/dbconfig/20241119-104401-arnaudb.json [10:47:16] (03CR) 10Slyngshede: [C:03+2] Netfilter: Route alerts for cloud hosts to WMCS. [alerts] - 10https://gerrit.wikimedia.org/r/1087434 (owner: 10Slyngshede) [10:48:28] (03Merged) 10jenkins-bot: Netfilter: Route alerts for cloud hosts to WMCS. [alerts] - 10https://gerrit.wikimedia.org/r/1087434 (owner: 10Slyngshede) [10:49:28] (03CR) 10CI reject: [V:04-1] Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [10:49:42] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [10:52:41] (03PS1) 10Volans: cookbooks.sre.switchdc.databases: improve desc [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 [10:57:45] (03CR) 10Effie Mouzeli: [C:03+2] Add replacement kafka nodes to kafka_brokers_main on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1089822 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [10:58:40] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2004.codfw.wmnet [10:59:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: repool', diff saved to https://phabricator.wikimedia.org/P71092 and previous config saved to /var/cache/conftool/dbconfig/20241119-105906-arnaudb.json [10:59:13] (03CR) 10CI reject: [V:04-1] cookbooks.sre.switchdc.databases: improve desc [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 (owner: 10Volans) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1100) [11:03:20] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 207947 [11:03:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 207947 [11:03:53] (03CR) 10Vgutierrez: [C:03+1] docker_registry_ha: limit /v2/_catalog to internal IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [11:05:00] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2004.codfw.wmnet [11:14:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: repool', diff saved to https://phabricator.wikimedia.org/P71093 and previous config saved to /var/cache/conftool/dbconfig/20241119-111411-arnaudb.json [11:20:49] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] airflow_platform_eng: Add missing stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1092785 (https://phabricator.wikimedia.org/T378443) (owner: 10Muehlenhoff) [11:24:19] (03CR) 10Klausman: [C:03+2] ml-staging/experimental: bump max container/pod size to 75G/80G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091289 (owner: 10Klausman) [11:27:40] (03Merged) 10jenkins-bot: ml-staging/experimental: bump max container/pod size to 75G/80G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091289 (owner: 10Klausman) [11:29:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: repool', diff saved to https://phabricator.wikimedia.org/P71094 and previous config saved to /var/cache/conftool/dbconfig/20241119-112917-arnaudb.json [11:33:46] (03PS1) 10Muehlenhoff: Re-add Envoy firewall config for phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1092796 [11:36:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092796 (owner: 10Muehlenhoff) [11:39:08] (03PS2) 10Clément Goubert: testreduce: Enable profile::auto_restarts::service for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1092764 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:39:09] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092764 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:40:12] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw [11:40:13] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw [11:40:58] (03CR) 10Muehlenhoff: [C:04-1] "The aphlict hosts are still on Bullseye, I'd avoid to use systemd::sysuser there since you might run into T256098." [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [11:44:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: repool', diff saved to https://phabricator.wikimedia.org/P71095 and previous config saved to /var/cache/conftool/dbconfig/20241119-114422-arnaudb.json [11:44:37] (03CR) 10Clément Goubert: [C:03+1] testreduce: Enable profile::auto_restarts::service for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1092764 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:46:16] (03PS8) 10Muehlenhoff: peopleweb: limit envoy srange to CACHES and DEPLOYMENT_SERVERS [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [11:46:43] (03PS1) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) [11:47:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [11:47:20] (03PS2) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) [11:47:25] (03CR) 10Clément Goubert: [C:04-1] "The aliases need to be 'or'd" [puppet] - 10https://gerrit.wikimedia.org/r/1092776 (owner: 10Muehlenhoff) [11:48:38] (03CR) 10Muehlenhoff: Add umbrella Cumin alias for wikikube staging cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092776 (owner: 10Muehlenhoff) [11:49:31] (03PS1) 10JMeybohm: kubernetes::master: Don't override sa certificates on reimage [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) [11:50:01] (03PS3) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) [11:51:16] (03CR) 10Muehlenhoff: "With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1090798 merged, this should now be working correctly." [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [11:52:38] (03CR) 10Elukey: "Need to test-cookbook it, but let me know if the idea is ok" [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [11:54:37] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [11:57:36] (03PS2) 10JMeybohm: kubernetes::master: Don't override sa certificates on reimage [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) [11:58:22] (03PS2) 10Btullis: Canary cephosd1001 to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) [11:59:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1016.eqiad.wmnet [11:59:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:59:50] (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [12:02:05] (03CR) 10CDanis: [C:03+1] "overall lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [12:04:26] (03PS5) 10Muehlenhoff: aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [12:05:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10335368 (10Clement_Goubert) Thanks @Jhancock.wm :) [12:05:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [12:07:40] (03CR) 10Muehlenhoff: kubernetes::master: Don't override sa certificates on reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [12:12:18] (03PS1) 10Arnaudb: bashrc: adds a function to capture tmux pane output [puppet] - 10https://gerrit.wikimedia.org/r/1092811 [12:12:26] (03CR) 10Arnaudb: [C:03+2] bashrc: adds a function to capture tmux pane output [puppet] - 10https://gerrit.wikimedia.org/r/1092811 (owner: 10Arnaudb) [12:13:58] (03PS1) 10Clément Goubert: wikikube: Add wikikube-worker21[36-55] [puppet] - 10https://gerrit.wikimedia.org/r/1092814 (https://phabricator.wikimedia.org/T377028) [12:15:20] (03CR) 10Muehlenhoff: [C:03+2] testreduce: Enable profile::auto_restarts::service for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1092764 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:18:58] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad [12:19:52] (03CR) 10Jelto: [C:03+2] rake_modules: also lint charts against 1.31.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092780 (https://phabricator.wikimedia.org/T379919) (owner: 10Jelto) [12:20:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad [12:22:59] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad [12:23:04] (03PS1) 10Clément Goubert: wikikube: Add wikikube-worker21[56-70] [puppet] - 10https://gerrit.wikimedia.org/r/1092816 (https://phabricator.wikimedia.org/T376966) [12:23:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad [12:23:55] Hi my name is Vaibhav . I am new here and I am very excited to contribute to the open source . Please explain how to commit the code and GitHub repositories containing issues (preferably good first issue) . Thanks [12:24:08] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Mailing list Delivery Mode set to None - https://phabricator.wikimedia.org/T368134#10335435 (10Aklapper) 05Open→03Declined Boldly closing per last comment. [12:25:45] (03PS3) 10JMeybohm: kubernetes::master: Don't override sa certificates on reimage [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) [12:25:59] (03CR) 10JMeybohm: kubernetes::master: Don't override sa certificates on reimage (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [12:26:07] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading: Unable to obtain exclusive write permission. Someone else is doing something with this file. - https://phabricator.wikimedia.org/T379234#10335446 (10MBH) 05Open→03Resolved a:03MBH I have uploaded this file successfully. If I got this error agai... [12:27:06] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw [12:27:31] (03PS4) 10JMeybohm: kubernetes::master: Don't override sa certificates on reimage [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) [12:30:50] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw [12:31:59] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, 10GitLab (Infrastructure): Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922#10335459 (10Jelto) [12:32:16] (03CR) 10JMeybohm: [C:03+1] wikikube: Add wikikube-worker21[36-55] [puppet] - 10https://gerrit.wikimedia.org/r/1092814 (https://phabricator.wikimedia.org/T377028) (owner: 10Clément Goubert) [12:33:27] (03CR) 10JMeybohm: [C:03+1] wikikube: Add wikikube-worker21[56-70] [puppet] - 10https://gerrit.wikimedia.org/r/1092816 (https://phabricator.wikimedia.org/T376966) (owner: 10Clément Goubert) [12:33:55] (03CR) 10JMeybohm: [C:03+2] k8s.reimage-stacked-control-plane: Ask for the management password early [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:35:02] !log removing ganeti1016 from active Ganeti nodes T378921 [12:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:06] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [12:36:17] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw [12:36:28] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading: Unable to obtain exclusive write permission. Someone else is doing something with this file. - https://phabricator.wikimedia.org/T379234#10335475 (10MatthewVernon) Glad to hear it uploaded right now :) There's no need to ping me specifically, I get a... [12:36:33] (03PS1) 10Muehlenhoff: Update site.pp for ganeti1016 [puppet] - 10https://gerrit.wikimedia.org/r/1092819 (https://phabricator.wikimedia.org/T378921) [12:36:51] (03CR) 10Gergő Tisza: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [12:36:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10335480 (10MoritzMuehlenhoff) [12:37:19] (03Merged) 10jenkins-bot: rake_modules: also lint charts against 1.31.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092780 (https://phabricator.wikimedia.org/T379919) (owner: 10Jelto) [12:38:04] PROBLEM - ganeti-noded running on ganeti1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:38:04] PROBLEM - ganeti-confd running on ganeti1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:38:27] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from eqiad to codfw [12:39:10] FIRING: ProbeDown: Service ganeti1016:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:39:44] (03Merged) 10jenkins-bot: k8s.reimage-stacked-control-plane: Ask for the management password early [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:40:03] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw [12:40:52] (03CR) 10Muehlenhoff: [C:03+2] Update site.pp for ganeti1016 [puppet] - 10https://gerrit.wikimedia.org/r/1092819 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [12:41:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw [12:41:47] jouncebot: nowandnext [12:41:47] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [12:41:47] In 0 hour(s) and 18 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1300) [12:42:56] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [12:43:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw [12:48:15] (03CR) 10Gergő Tisza: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [12:52:04] RESOLVED: ProbeDown: Service ganeti1016:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:36] (03PS1) 10Muehlenhoff: rt: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092821 [12:53:55] (03CR) 10Gergő Tisza: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [12:54:56] 10ops-codfw, 06DC-Ops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265 (10Clement_Goubert) 03NEW [12:55:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:55:28] 10ops-codfw, 06DC-Ops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10335530 (10Clement_Goubert) [12:55:35] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10335531 (10SLyngshede-WMF) After trying, and failing, to register a passkey, I've been digging through CAS and the java-webauthn-server source code. If we want passkeys we'll nee... [12:56:26] (03CR) 10Clément Goubert: [C:03+2] wikikube: Add wikikube-worker21[36-55] [puppet] - 10https://gerrit.wikimedia.org/r/1092814 (https://phabricator.wikimedia.org/T377028) (owner: 10Clément Goubert) [12:57:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092333 (https://phabricator.wikimedia.org/T379811) (owner: 10Gergő Tisza) [12:57:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:28] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10335539 (10Clement_Goubert) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1300) [13:01:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092821 (owner: 10Muehlenhoff) [13:05:00] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 21574 [13:05:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 21574 [13:05:26] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 53180 [13:05:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10335562 (10Ruthven) Hi, I've got the information that someone wrote to `permissions-it@wikimedia.org` on 15/11/2024-11-15... [13:05:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 53180 [13:06:04] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 266631 [13:06:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 266631 [13:06:28] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 262979 [13:06:45] (03PS1) 10Muehlenhoff: planet: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092823 [13:06:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262979 [13:06:54] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 201838 [13:07:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092823 (owner: 10Muehlenhoff) [13:07:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 201838 [13:07:36] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 267521 [13:08:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 267521 [13:08:19] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 266098 [13:08:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 266098 [13:12:24] (03PS1) 10Muehlenhoff: doc: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092825 [13:14:31] (03PS1) 10Muehlenhoff: miscweb: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092827 [13:17:11] (03PS1) 10Muehlenhoff: add-ldap-group: Allow passing a description [puppet] - 10https://gerrit.wikimedia.org/r/1092828 [13:19:37] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10335590 (10revi) >>! In T380009#10335562, @Ruthven wrote: > Hi, > I've got the information that someone wrote to `permissi... [13:21:54] jouncebot: nowandnext [13:21:54] For the next 0 hour(s) and 38 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1300) [13:21:54] In 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1400) [13:23:29] (03PS2) 10DCausse: rdf-streaming-updater: produce rdf_change v2 events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092191 (https://phabricator.wikimedia.org/T374919) [13:26:57] (03PS1) 10Muehlenhoff: Remove profile::ldap::bitu from Cumin role [puppet] - 10https://gerrit.wikimedia.org/r/1092829 [13:27:00] (03CR) 10DCausse: "the version 0.3.150 (which contains Ie43716dfa789815f5d7021ecc20f113513292e08) has been deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [13:27:20] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs [13:27:22] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs [13:27:27] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: produce rdf_change v2 events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092191 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [13:28:30] (03CR) 10DCausse: [C:04-2] rdf-streaming-updater: produce rdf_change v2 events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092191 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [13:28:56] (03CR) 10DCausse: [C:04-1] "some wcqs nodes are still running an old version of the updater" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092191 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [13:30:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092827 (owner: 10Muehlenhoff) [13:30:39] (03CR) 10Alexandros Kosiaris: [C:03+1] os-reports: Also open up rsync to the aux k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092749 (https://phabricator.wikimedia.org/T350794) (owner: 10Muehlenhoff) [13:33:48] (03CR) 10AOkoth: [C:03+1] os-reports: Also open up rsync to the aux k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092749 (https://phabricator.wikimedia.org/T350794) (owner: 10Muehlenhoff) [13:34:29] (03CR) 10CDanis: [C:03+1] kubernetes::master: Don't override sa certificates on reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [13:34:37] (03CR) 10Muehlenhoff: [C:03+2] os-reports: Also open up rsync to the aux k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092749 (https://phabricator.wikimedia.org/T350794) (owner: 10Muehlenhoff) [13:34:38] (03CR) 10DCausse: [C:04-1] "actually some nodes (I think mainly wcqs nodes) are running an old version of the artifacts (0.3.147 for the main blazegraph service and 0" [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [13:35:18] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10335639 (10jcrespo) With BBU: ` root@backup1012:~$ ./storcli64 show all J { "Controllers":[ {... [13:35:21] (03CR) 10CDanis: [C:03+1] kubernetes::master: Don't override sa certificates on reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [13:37:00] (03PS1) 10Arnaudb: mariadb: basic script to analyse general-log-file [software] - 10https://gerrit.wikimedia.org/r/1092832 (https://phabricator.wikimedia.org/T377451) [13:37:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10335646 (10jcrespo) @VRiley-WMF I am a bit confused with this task, did you install** a battery module to the existing RAID card**? That's what the OS... [13:38:29] (03CR) 10Alexandros Kosiaris: [C:03+1] docker_registry_ha: limit /v2/_catalog to internal IPs [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [13:55:52] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10335712 (10jcrespo) Can someone tell me what I just tested on backup1012 before I share my results? [13:56:40] (03CR) 10CDanis: [C:03+2] chromium-render: Add cli flag to avoid flooding with crashpad processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088271 (https://phabricator.wikimedia.org/T376438) (owner: 10Jgiannelos) [13:57:57] (03Merged) 10jenkins-bot: chromium-render: Add cli flag to avoid flooding with crashpad processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088271 (https://phabricator.wikimedia.org/T376438) (owner: 10Jgiannelos) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1400). [14:00:05] kart_, abijeet, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:31] * kart_ is here and will deploy first and second patch.. [14:00:40] kart_, hello [14:01:04] !log ihurbain@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [14:01:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092257 (https://phabricator.wikimedia.org/T375301) (owner: 10KartikMistry) [14:01:48] !log ihurbain@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:02:13] (03Merged) 10jenkins-bot: Enable the Contribute menu in 3rd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092257 (https://phabricator.wikimedia.org/T375301) (owner: 10KartikMistry) [14:02:31] !log ihurbain@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [14:02:46] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1092257|Enable the Contribute menu in 3rd group of Wikis (T375301)]] [14:02:49] T375301: Enable the Contribute menu in 3rd group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T375301 [14:03:06] !log ihurbain@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:04:01] !log ihurbain@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [14:04:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092825 (owner: 10Muehlenhoff) [14:05:50] !log ihurbain@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:06:04] !log ihurbain@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [14:06:36] !log joal@deploy2002 Started deploy [analytics/refinery@295d5a4]: Regular analytics weekly train [analytics/refinery@295d5a44] [14:07:16] !log ihurbain@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:08:27] (03PS2) 10Arnaudb: mariadb: basic script to analyse general-log-file [software] - 10https://gerrit.wikimedia.org/r/1092832 (https://phabricator.wikimedia.org/T377451) [14:09:22] o/ [14:10:10] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1290.eqiad.wmnet [14:10:11] !log kartik@deploy2002 kartik: Backport for [[gerrit:1092257|Enable the Contribute menu in 3rd group of Wikis (T375301)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:10:22] T375301: Enable the Contribute menu in 3rd group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T375301 [14:10:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1290.eqiad.wmnet [14:10:50] (03CR) 10Muehlenhoff: [C:03+2] rt: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092821 (owner: 10Muehlenhoff) [14:11:05] !log kartik@deploy2002 kartik: Continuing with sync [14:12:30] tgr|away: I'll ping you once my changes are deployed. [14:13:29] 06SRE, 06collaboration-services, 13Patch-For-Review: gitlab runners don't have the apt.wikimedia.org key - https://phabricator.wikimedia.org/T380164#10335790 (10MatthewVernon) 05Open→03Resolved a:05Jelto→03MatthewVernon That MR has fixed it ([[ https://gitlab.wikimedia.org/repos/data_persistence/... [14:15:32] !log joal@deploy2002 Finished deploy [analytics/refinery@295d5a4]: Regular analytics weekly train [analytics/refinery@295d5a44] (duration: 08m 56s) [14:16:30] (03PS1) 10Muehlenhoff: vrts: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092838 [14:17:53] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092257|Enable the Contribute menu in 3rd group of Wikis (T375301)]] (duration: 15m 07s) [14:17:57] T375301: Enable the Contribute menu in 3rd group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T375301 [14:18:12] (03PS1) 10Muehlenhoff: etherpad: Update Envoy firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1092839 [14:18:39] (03PS1) 10Alexandros Kosiaris: wikikube: Add wikikube-ctrl1004 [puppet] - 10https://gerrit.wikimedia.org/r/1092840 (https://phabricator.wikimedia.org/T379790) [14:18:43] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs [14:19:10] abijeet: your patch is next! [14:19:18] (03CR) 10CI reject: [V:04-1] wikikube: Add wikikube-ctrl1004 [puppet] - 10https://gerrit.wikimedia.org/r/1092840 (https://phabricator.wikimedia.org/T379790) (owner: 10Alexandros Kosiaris) [14:20:08] kart_, okie [14:20:46] (03PS1) 10Jaime Nuche: phabricator: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) [14:20:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092740 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:21:12] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs [14:21:30] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [14:21:38] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10335878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7007.magru.wmnet with OS bullseye [14:21:57] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad [14:21:58] (03Merged) 10jenkins-bot: Enable message group subscription feature for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092740 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:22:16] (03PS1) 10Muehlenhoff: idm: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1092842 [14:22:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad [14:22:30] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1092740|Enable message group subscription feature for MediaWiki.org (T372386)]] [14:22:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10335885 (10Papaul) @jcrespo correct we didn't replace the raid controller we just added the battery to the existing raid controller [14:22:43] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:22:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:23:09] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad [14:23:09] (03CR) 10CI reject: [V:04-1] phabricator: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [14:23:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:23:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad [14:24:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092838 (owner: 10Muehlenhoff) [14:24:55] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad [14:25:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad [14:26:15] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad [14:26:23] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad [14:28:05] !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1092740|Enable message group subscription feature for MediaWiki.org (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:09] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:28:12] kart_, testing [14:28:45] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw [14:28:45] cool! [14:29:28] (03CR) 10Elukey: [C:03+2] docker_registry_ha: limit /v2/_catalog to internal IPs [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [14:29:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw [14:30:22] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [14:30:57] kart_, looks good. we can proceed with deployment [14:31:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw [14:31:51] Nice! [14:31:57] !log kartik@deploy2002 kartik, abi: Continuing with sync [14:33:45] (03CR) 10Eevans: [C:03+2] Correctly mark restbase* hosts as handled by Data Persistence [puppet] - 10https://gerrit.wikimedia.org/r/1092765 (owner: 10Muehlenhoff) [14:33:51] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad [14:34:00] (03PS2) 10Jaime Nuche: phabricator: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) [14:34:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad [14:34:28] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10335946 (10jcrespo) Papaul answered me here: https://phabricator.wikimedia.org/T371416#10335885 [14:34:52] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad [14:35:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092842 (owner: 10Muehlenhoff) [14:35:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad [14:36:20] (03CR) 10CI reject: [V:04-1] phabricator: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:51] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092740|Enable message group subscription feature for MediaWiki.org (T372386)]] (duration: 16m 21s) [14:38:55] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:39:30] !log limit /v2/_catalog to internal IPs only for all Docker Registry nodes - T378618 [14:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092839 (owner: 10Muehlenhoff) [14:39:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10335969 (10jcrespo) Thanks, that works for me, I just was confused. I also checked and there is some integrated ram on chip. Will soon share a summary... [14:39:56] (03PS3) 10Jaime Nuche: phabricator: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) [14:40:18] (03PS2) 10Eevans: restbase: commission restbase203[6-8] [puppet] - 10https://gerrit.wikimedia.org/r/1092345 (https://phabricator.wikimedia.org/T380236) [14:40:19] (03CR) 10MVernon: [C:03+1] "LGTM, but this is all rather not-DRY if you see what I mean? A lot of this stuff feels like things that could/should be more templated..." [puppet] - 10https://gerrit.wikimedia.org/r/1092345 (https://phabricator.wikimedia.org/T380236) (owner: 10Eevans) [14:40:56] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2136.codfw.wmnet with OS bookworm [14:41:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:41:20] tgr|away: you can go ahead.. [14:41:26] (03CR) 10MVernon: [C:03+1] restbase: commission restbase203[6-8] [puppet] - 10https://gerrit.wikimedia.org/r/1092345 (https://phabricator.wikimedia.org/T380236) (owner: 10Eevans) [14:41:41] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092345 (https://phabricator.wikimedia.org/T380236) (owner: 10Eevans) [14:41:48] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2137.codfw.wmnet with OS bookworm [14:42:09] (03CR) 10CI reject: [V:04-1] phabricator: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [14:42:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2138.codfw.wmnet with OS bookworm [14:42:33] thx kart_ [14:43:17] (03PS1) 10Ssingh: wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) [14:43:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2139.codfw.wmnet with OS bookworm [14:43:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092333 (https://phabricator.wikimedia.org/T379811) (owner: 10Gergő Tisza) [14:44:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2141.codfw.wmnet with OS bookworm [14:44:06] (03CR) 10CI reject: [V:04-1] wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [14:44:20] (03Merged) 10jenkins-bot: Use 'auth' rather than 'sso' as cookie prefix on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092333 (https://phabricator.wikimedia.org/T379811) (owner: 10Gergő Tisza) [14:44:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2142.codfw.wmnet with OS bookworm [14:44:49] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1092333|Use 'auth' rather than 'sso' as cookie prefix on the auth domain (T379811)]] [14:44:53] T379811: Update URL structure for SUL3 shared domain - https://phabricator.wikimedia.org/T379811 [14:45:20] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2140.codfw.wmnet with OS bookworm [14:46:58] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7007.magru.wmnet with reason: host reimage [14:48:48] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw [14:49:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw [14:49:55] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [14:50:05] (03PS2) 10Ssingh: wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) [14:50:21] !log tgr@deploy2002 tgr: Backport for [[gerrit:1092333|Use 'auth' rather than 'sso' as cookie prefix on the auth domain (T379811)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:50:25] T379811: Update URL structure for SUL3 shared domain - https://phabricator.wikimedia.org/T379811 [14:50:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw [14:50:27] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7007.magru.wmnet with reason: host reimage [14:50:55] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1092828 (owner: 10Muehlenhoff) [14:51:28] (03CR) 10Slyngshede: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1092829 (owner: 10Muehlenhoff) [14:51:34] (03CR) 10Slyngshede: [C:03+1] Remove profile::ldap::bitu from Cumin role [puppet] - 10https://gerrit.wikimedia.org/r/1092829 (owner: 10Muehlenhoff) [14:52:25] !log tgr@deploy2002 tgr: Continuing with sync [14:52:33] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1092842 (owner: 10Muehlenhoff) [14:52:44] (03CR) 10Vgutierrez: [C:03+1] docker_registry_ha: limit /v2/_catalog to internal IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [14:53:17] (03PS5) 10Volans: sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 [14:53:17] (03PS4) 10Volans: Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 [14:53:17] (03PS2) 10Volans: cookbooks.sre.switchdc.databases: improve desc [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 [14:59:05] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092333|Use 'auth' rather than 'sso' as cookie prefix on the auth domain (T379811)]] (duration: 14m 16s) [14:59:08] T379811: Update URL structure for SUL3 shared domain - https://phabricator.wikimedia.org/T379811 [14:59:11] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [14:59:32] (03CR) 10Volans: wmflib: add facter data for lshw -class memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [14:59:49] (03CR) 10CI reject: [V:04-1] cookbooks.sre.switchdc.databases: improve desc [cookbooks] - 10https://gerrit.wikimedia.org/r/1092787 (owner: 10Volans) [15:00:22] (03CR) 10CI reject: [V:04-1] Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [15:00:43] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::ldap::bitu from Cumin role [puppet] - 10https://gerrit.wikimedia.org/r/1092829 (owner: 10Muehlenhoff) [15:03:07] (03PS1) 10Muehlenhoff: profile::ldap::bitu: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1092845 [15:03:30] !log UTC afternoon deploys done [15:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10336049 (10Jclark-ctr) [15:05:11] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278 (10bking) 03NEW [15:05:52] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad [15:06:00] (03CR) 10Ssingh: wmflib: add facter data for lshw -class memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [15:06:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:47] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from codfw to eqiad [15:07:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from codfw to eqiad [15:09:21] (03PS1) 10MVernon: regex: apply disks_by_path to the new thanos nodes [puppet] - 10https://gerrit.wikimedia.org/r/1092847 (https://phabricator.wikimedia.org/T368445) [15:11:04] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad [15:13:06] (03CR) 10Eevans: [C:03+1] regex: apply disks_by_path to the new thanos nodes [puppet] - 10https://gerrit.wikimedia.org/r/1092847 (https://phabricator.wikimedia.org/T368445) (owner: 10MVernon) [15:13:11] (03CR) 10Volans: wmflib: add facter data for lshw -class memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [15:13:35] (03CR) 10MVernon: [C:03+2] regex: apply disks_by_path to the new thanos nodes [puppet] - 10https://gerrit.wikimedia.org/r/1092847 (https://phabricator.wikimedia.org/T368445) (owner: 10MVernon) [15:13:45] (03CR) 10Ssingh: wmflib: add facter data for lshw -class memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [15:14:01] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad [15:15:29] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7007.magru.wmnet with OS bullseye [15:15:38] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10336157 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7007.magru.wmnet with OS bullseye completed: -... [15:17:31] (03PS5) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [15:17:50] (03CR) 10Clément Goubert: mediawiki: Add mwcron feature (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [15:17:59] (03PS6) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [15:18:58] (03PS1) 10Sergio Gimeno: [DNM] PoC: newcomer task stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092848 (https://phabricator.wikimedia.org/T377097) [15:19:31] (03CR) 10David Caro: "From a chat on irc, I think the original goal of the check was to test the given ips from all the sites, not only the site it's defined in" [puppet] - 10https://gerrit.wikimedia.org/r/1079531 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [15:19:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:21:04] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2136.codfw.wmnet with OS bookworm [15:21:08] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2137.codfw.wmnet with OS bookworm [15:21:20] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2141.codfw.wmnet with OS bookworm [15:21:25] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2142.codfw.wmnet with OS bookworm [15:22:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2136.codfw.wmnet with OS bookworm [15:23:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [15:23:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:24:18] (03PS1) 10DLynch: Revert "editcheck: Remove try/catch around transaction squashing" [extensions/VisualEditor] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092850 (https://phabricator.wikimedia.org/T333710) [15:24:32] (03PS1) 10DLynch: Revert "editcheck: Remove try/catch around transaction squashing" [extensions/VisualEditor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092851 (https://phabricator.wikimedia.org/T333710) [15:25:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/VisualEditor] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092850 (https://phabricator.wikimedia.org/T333710) (owner: 10DLynch) [15:25:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/VisualEditor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092851 (https://phabricator.wikimedia.org/T333710) (owner: 10DLynch) [15:25:38] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2138.codfw.wmnet with OS bookworm [15:25:53] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2139.codfw.wmnet with OS bookworm [15:26:13] (03CR) 10Krinkle: "Past examples where we varied frontend cache for MobileFrontend opt-in features:" [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [15:27:37] (03PS6) 10Muehlenhoff: aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [15:27:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [15:28:05] (03CR) 10Muehlenhoff: [C:03+2] add-ldap-group: Allow passing a description [puppet] - 10https://gerrit.wikimedia.org/r/1092828 (owner: 10Muehlenhoff) [15:28:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2137.codfw.wmnet with OS bookworm [15:28:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2138.codfw.wmnet with OS bookworm [15:29:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2139.codfw.wmnet with OS bookworm [15:29:12] jouncebot: nowandnext [15:29:12] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [15:29:12] In 0 hour(s) and 30 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1600) [15:29:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2141.codfw.wmnet with OS bookworm [15:29:31] (03PS1) 10Dreamy Jazz: ExperimentUserDefaultsManager: Decrease log severity to debug [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092856 (https://phabricator.wikimedia.org/T380271) [15:29:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2142.codfw.wmnet with OS bookworm [15:29:55] (03CR) 10Dreamy Jazz: [C:03+2] ExperimentUserDefaultsManager: Decrease log severity to debug [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092856 (https://phabricator.wikimedia.org/T380271) (owner: 10Dreamy Jazz) [15:29:58] (03CR) 10Muehlenhoff: [C:03+1] "With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1090798 merged, this should not be good to deploy. Remember to reboot after enab" [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:31:05] (03CR) 10Dreamy Jazz: [C:03+2] "Deploying this on the request of Martin." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092856 (https://phabricator.wikimedia.org/T380271) (owner: 10Dreamy Jazz) [15:31:19] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278#10336255 (10RobH) These shipped with (2) 400-AKJG : 200GB Solid State Drive SATA W rite Intensive 6Gbps 2.5in Hot -plug Drive, S3710 per host. Any of the lo... [15:32:06] (03CR) 10Muehlenhoff: [C:03+1] "With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1090798 merged, this is now good to go." [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [15:32:20] (03CR) 10JHathaway: wmflib: add facter data for lshw -class memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [15:33:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [15:33:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:35:26] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T380024#10336291 (10isarantopoulos) a:03klausman [15:36:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092856 (https://phabricator.wikimedia.org/T380271) (owner: 10Dreamy Jazz) [15:37:11] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T380024#10336313 (10isarantopoulos) p:05Triage→03Medium [15:39:14] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10336325 (10MoritzMuehlenhoff) [15:40:17] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T380024#10336333 (10klausman) This is two things: - service updates in the std namespace - a broken change for limitRanges. I will push the former... [15:40:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2136.codfw.wmnet with reason: host reimage [15:41:02] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380228#10336340 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated cable on pdu side [15:41:16] (03CR) 10Muehlenhoff: [C:03+2] "python3-bitu-ldap have been removed from the Cumin hosts after rollout" [puppet] - 10https://gerrit.wikimedia.org/r/1092829 (owner: 10Muehlenhoff) [15:42:50] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [15:43:17] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 10317MiB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [15:44:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2136.codfw.wmnet with reason: host reimage [15:45:05] !log installing libheif security updates [15:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2137.codfw.wmnet with reason: host reimage [15:46:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2138.codfw.wmnet with reason: host reimage [15:47:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2139.codfw.wmnet with reason: host reimage [15:47:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2141.codfw.wmnet with reason: host reimage [15:48:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2142.codfw.wmnet with reason: host reimage [15:49:29] (03PS1) 10Muehlenhoff: snapshot: Update Cumin alias with dumper_fillin_wd role [puppet] - 10https://gerrit.wikimedia.org/r/1092861 [15:49:50] (03PS3) 10Ssingh: wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) [15:50:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2137.codfw.wmnet with reason: host reimage [15:51:45] (03CR) 10Andrew Bogott: Remove support for neutron linuxbridge driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [15:51:45] (03CR) 10Ssingh: wmflib: add facter data for lshw -class memory (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [15:53:11] (03Merged) 10jenkins-bot: ExperimentUserDefaultsManager: Decrease log severity to debug [extensions/GrowthExperiments] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092856 (https://phabricator.wikimedia.org/T380271) (owner: 10Dreamy Jazz) [15:53:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2138.codfw.wmnet with reason: host reimage [15:53:43] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1092856|ExperimentUserDefaultsManager: Decrease log severity to debug (T380271)]] [15:53:46] T380271: GrowthExperiments\ExperimentUserDefaultsManager::shouldAssignBucket failed to get a central user ID - https://phabricator.wikimedia.org/T380271 [15:54:28] (03PS6) 10Andrew Bogott: Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) [15:54:28] (03PS2) 10Andrew Bogott: Neutron: remove linuxbridge from mechanism_drivers [puppet] - 10https://gerrit.wikimedia.org/r/1092425 (https://phabricator.wikimedia.org/T326373) [15:54:42] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2140.codfw.wmnet with OS bookworm [15:54:51] (03CR) 10Andrew Bogott: Remove support for neutron linuxbridge driver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [15:54:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [15:54:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10336410 (10Jhancock.wm) @Papaul you might need to check the switch. I looked in the idrac and the link shows as up. physically up as well. [15:55:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2140.codfw.wmnet with OS bookworm [15:56:37] (03PS1) 10Muehlenhoff: ci: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1092863 [15:57:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2141.codfw.wmnet with reason: host reimage [15:57:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092863 (owner: 10Muehlenhoff) [15:57:30] (03CR) 10CI reject: [V:04-1] Remove support for neutron linuxbridge driver [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott) [15:59:34] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1092856|ExperimentUserDefaultsManager: Decrease log severity to debug (T380271)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:59:38] T380271: GrowthExperiments\ExperimentUserDefaultsManager::shouldAssignBucket failed to get a central user ID - https://phabricator.wikimedia.org/T380271 [15:59:51] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [16:00:04] eoghan, jelto, arnoldokoth, and mutante: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1600). [16:00:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2139.codfw.wmnet with reason: host reimage [16:03:16] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [16:03:16] (03PS1) 10AikoChou: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092866 (https://phabricator.wikimedia.org/T378939) [16:03:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2136.codfw.wmnet with OS bookworm [16:04:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2142.codfw.wmnet with reason: host reimage [16:06:51] (03CR) 10Ssingh: wmflib: add facter data for lshw -class memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:06:59] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092856|ExperimentUserDefaultsManager: Decrease log severity to debug (T380271)]] (duration: 13m 16s) [16:07:03] T380271: GrowthExperiments\ExperimentUserDefaultsManager::shouldAssignBucket failed to get a central user ID - https://phabricator.wikimedia.org/T380271 [16:07:10] Done my deployments [16:09:36] (03PS4) 10Ssingh: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) [16:09:36] (03CR) 10Ssingh: [V:03+1] "This will fail until Id5fe57103c786528feb69a6850901b5829d450c2 is merged." [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:09:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2137.codfw.wmnet with OS bookworm [16:10:08] (03PS5) 10Ssingh: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) [16:10:42] (03CR) 10Ssingh: "See I4e5896f99c32b26646cd7d9943d050f8b1ef1996 on the use case for this." [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:12:37] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:13:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2138.codfw.wmnet with OS bookworm [16:13:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1313.eqiad.wmnet with OS bookworm [16:13:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1314.eqiad.wmnet with OS bookworm [16:14:00] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1313.eqiad.wmnet with OS bookworm [16:14:04] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1314.eqiad.wmnet with OS bookworm [16:15:20] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1315.eqiad.wmnet with OS bookworm [16:15:30] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1315.eqiad.wmnet with OS bookworm [16:15:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1316.eqiad.wmnet with OS bookworm [16:15:36] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1316.eqiad.wmnet with OS bookworm [16:15:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1317.eqiad.wmnet with OS bookworm [16:15:42] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336599 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1317.eqiad.wmnet with OS bookworm [16:16:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2141.codfw.wmnet with OS bookworm [16:17:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1318.eqiad.wmnet with OS bookworm [16:17:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1321.eqiad.wmnet with OS bookworm [16:17:13] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1320.eqiad.wmnet with OS bookworm [16:17:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1319.eqiad.wmnet with OS bookworm [16:19:06] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336606 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1318.eqiad.wmnet with OS bookworm [16:19:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm [16:19:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm [16:19:13] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336610 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1319.eqiad.wmnet with OS bookworm [16:19:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2139.codfw.wmnet with OS bookworm [16:22:53] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10336641 (10jcrespo) We just extrated a disk and put it back in, the host was able to keep writing at all moments. I will force... [16:24:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2142.codfw.wmnet with OS bookworm [16:26:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2110.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:26:40] (03CR) 10Vgutierrez: P:hardware::check: add profile to check HW configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:26:56] (03CR) 10Brouberol: "I'm guessing you meant "this is *now* good to deploy"? Just to make sure :)" [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:28:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2110.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:29:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2110.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:29:47] (03CR) 10Ssingh: P:hardware::check: add profile to check HW configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:30:59] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1313.eqiad.wmnet with reason: host reimage [16:31:43] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1314.eqiad.wmnet with reason: host reimage [16:32:33] (03CR) 10JHathaway: "in testing it appears that `lshw` gives back bogus json on bullseye hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:33:10] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1315.eqiad.wmnet with reason: host reimage [16:33:11] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1317.eqiad.wmnet with reason: host reimage [16:33:13] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1316.eqiad.wmnet with reason: host reimage [16:34:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1313.eqiad.wmnet with reason: host reimage [16:34:48] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1319.eqiad.wmnet with reason: host reimage [16:34:53] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1318.eqiad.wmnet with reason: host reimage [16:35:01] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1321.eqiad.wmnet with reason: host reimage [16:36:01] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: /srv 4278 MB (3% inode=60%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [16:36:25] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [16:36:48] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1320.eqiad.wmnet with reason: host reimage [16:37:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1315.eqiad.wmnet with reason: host reimage [16:39:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2110.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:40:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1317.eqiad.wmnet with reason: host reimage [16:43:06] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10336729 (10Fabfur) As preliminary test before tomorrow's work, we reimaged cp7007 and verified all runs fine. It ran fine. [16:43:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1316.eqiad.wmnet with reason: host reimage [16:44:24] (03CR) 10Fabfur: "Had confirmation from DE that an unattended haproxykafka restart (like for new cert/keys) isn't an issue anymore. I think we can proceed o" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [16:46:37] 06SRE-OnFire, 10Incident Tooling: corto: only operate on applicable phabricator issues - https://phabricator.wikimedia.org/T380293 (10Eevans) 03NEW [16:46:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1321.eqiad.wmnet with reason: host reimage [16:50:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1320.eqiad.wmnet with reason: host reimage [16:52:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:52:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:52:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1313.eqiad.wmnet with OS bookworm [16:52:37] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1313.eqiad.wmnet with OS bookworm completed: - w... [16:53:10] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [16:53:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1319.eqiad.wmnet with reason: host reimage [16:55:38] (03CR) 10Ssingh: "Seems like it broken in the version of lshw in bullseye, which is what the cp hosts are on :( https://bugs.debian.org/cgi-bin/bugreport.cg" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [16:55:48] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:56:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:56:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1315.eqiad.wmnet with OS bookworm [16:56:38] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1315.eqiad.wmnet with OS bookworm completed: - w... [16:56:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1318.eqiad.wmnet with reason: host reimage [16:56:54] (03CR) 10Bking: [C:03+2] Fixing an improper merge of values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092340 (owner: 10Aleksandar Mastilovic) [16:58:17] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:58:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:58:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1317.eqiad.wmnet with OS bookworm [16:58:42] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336795 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1317.eqiad.wmnet with OS bookworm completed: - w... [17:00:03] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2110'] [17:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1700). [17:00:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1314.eqiad.wmnet with reason: host reimage [17:00:05] MatmaRex: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2110'] [17:00:17] hi [17:00:22] (03PS1) 10Daimona Eaytoy: Prevent ce_event_wikis query when feature flag is off [extensions/CampaignEvents] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092875 (https://phabricator.wikimedia.org/T380288) [17:00:37] MatmaRex: hello! is this tested on Beta already? [17:00:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2110.codfw.wmnet with OS bullseye [17:00:45] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1322.eqiad.wmnet with OS bookworm [17:00:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1324.eqiad.wmnet with OS bookworm [17:00:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1323.eqiad.wmnet with OS bookworm [17:00:51] (canonically I think by cherrypicking to the Beta puppetmaster, but I'm not super familiar) [17:00:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10336804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host elastic2110.co... [17:00:55] rzl: no. but it should only affect beta [17:01:05] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336806 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm [17:01:07] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336807 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm [17:01:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336808 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm [17:01:37] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.024e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [17:01:39] MatmaRex: yeah, from the hiera path I agree with that -- I'm happy to merge it blind if you're confident and that's what you'd like [17:01:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:02:14] rzl: yeah, as confident as i can be without actually testing it. i'm not really familiar with how that'd be done either [17:02:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:02:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1316.eqiad.wmnet with OS bookworm [17:02:23] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1316.eqiad.wmnet with OS bookworm completed: - w... [17:02:47] rzl: i'd rather just have it merged and test afterwards, seems easier this way. is that okay? [17:03:02] no skin off my nose :) going ahead [17:03:12] (i would be more thoughtful about it if it wasn't beta-only) [17:03:13] (03CR) 10RLazarus: [C:03+2] Rename sso.wikimedia.beta.wmflabs.org to auth.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1091843 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [17:03:24] thanks [17:04:28] merge complete at the prod puppetserver [17:04:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:04:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:04:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1321.eqiad.wmnet with OS bookworm [17:04:54] go forth and do whatever beta stuff happens next :P if you need a followup for any reason, ping me, I'll be around (though in meetings starting at :30 and maybe slower to respond) [17:04:54] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm completed: - wikikube-worker1321... [17:05:36] rzl: i guess i need to wait for a DNS update or something? i'll be able to test things once i can access the domain [17:05:54] (03CR) 10Gergő Tisza: "Thanks! That does seem like an easier way of handling it." [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [17:06:30] MatmaRex: this isn't DNS, it's an Apache config [17:07:13] oh, so the puppet run just hasn't happened yet, right? [17:07:42] I might have misunderstood, I thought you were going to complete the deploy after I merged it -- that will require running puppet (or waiting for it) and might require an apache config reload [17:08:07] (to be clear: i am looking at https://auth.wikimedia.beta.wmflabs.org and it says "Domain not configured") [17:08:12] right. that's fine [17:08:33] sorry, i'm just not familiar with the process. but i have access to the beta cluster, so i should be able to do the rest [17:08:47] assuming i can find the docs. or i'll just wait :) [17:08:55] thanks for merging [17:08:55] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:09:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:09:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1320.eqiad.wmnet with OS bookworm [17:09:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336833 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm completed: - wikikube-worker1320... [17:10:08] okay, good luck :) I'll be around if you need any help from someone who doesn't know the infrastructure you're working on [17:10:45] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1325.eqiad.wmnet with OS bookworm [17:10:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1327.eqiad.wmnet with OS bookworm [17:10:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1326.eqiad.wmnet with OS bookworm [17:10:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm [17:10:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm [17:11:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm [17:11:07] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:11:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:11:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1319.eqiad.wmnet with OS bookworm [17:11:44] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336841 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1319.eqiad.wmnet with OS bookworm completed: - wikikube-worker1319... [17:14:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:14:39] (03PS5) 10Ebernhardson: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) [17:15:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:15:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1318.eqiad.wmnet with OS bookworm [17:15:07] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10336882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1318.eqiad.wmnet with OS bookworm completed: - wikikube-worker1318... [17:15:35] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2140.codfw.wmnet with OS bookworm [17:16:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2110.codfw.wmnet with reason: host reimage [17:16:55] (03PS1) 10Ssingh: magru: use eqiad's installserver temporarily for testing [puppet] - 10https://gerrit.wikimedia.org/r/1092876 [17:18:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:18:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1322.eqiad.wmnet with reason: host reimage [17:18:25] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1324.eqiad.wmnet with reason: host reimage [17:18:28] Hey folks! Do backports of train blockers need to go in the normal deployment windows? It's been a while since I last break the wikis and I can't recall what the process is. [17:18:29] (03PS6) 10Ebernhardson: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) [17:18:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:18:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1314.eqiad.wmnet with OS bookworm [17:18:45] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1323.eqiad.wmnet with reason: host reimage [17:19:23] Daimona: nope, train blockers take priority [17:19:36] i can sling out https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1092875 [17:19:50] Okay thank you, great! So the next question: would someone be willing to deploy my backport? [17:19:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2110.codfw.wmnet with reason: host reimage [17:20:04] yep, i'll get it started [17:20:22] Ahhhhhh you beat me to it [17:20:26] Thank you! [17:20:33] 06SRE, 10Observability-Metrics, 05Goal, 13Patch-Needs-Improvement: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#10336912 (10Aklapper) a:05colewhite→03None @colewhite: Removing task assignee as this open task has been assigned for more than two years - See the email sen... [17:21:25] (03PS1) 10Ladsgroup: Bump ratio of new parsercache key spec to 12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092877 (https://phabricator.wikimedia.org/T373037) [17:21:40] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10336937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1314.eqiad.wmnet with OS bookworm completed: - w... [17:22:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/CampaignEvents] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092875 (https://phabricator.wikimedia.org/T380288) (owner: 10Daimona Eaytoy) [17:22:42] (03CR) 10Ssingh: [C:03+2] magru: use eqiad's installserver temporarily for testing [puppet] - 10https://gerrit.wikimedia.org/r/1092876 (owner: 10Ssingh) [17:23:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1324.eqiad.wmnet with reason: host reimage [17:24:08] jouncebot: nowandnext [17:24:09] For the next 0 hour(s) and 35 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1700) [17:24:09] In 0 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1800) [17:24:19] Daimona: want me to deploy it? [17:24:32] (03CR) 10Ladsgroup: [C:03+2] Bump ratio of new parsercache key spec to 12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092877 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [17:25:03] Amir1: i'm on it [17:25:11] Amir1: thank you for the offer (that I would generally not refuse), but Brennen is on it [17:25:26] (03PS1) 10Papaul: Change insrallserver in magru to point to eqiad insrall server [homer/public] - 10https://gerrit.wikimedia.org/r/1092878 (https://phabricator.wikimedia.org/T376737) [17:25:34] awesome, is it merging? if so, I will quickly push my change then :D [17:25:44] (03Merged) 10jenkins-bot: Bump ratio of new parsercache key spec to 12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092877 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [17:26:00] 06SRE, 10conftool, 13Patch-Needs-Improvement: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581#10336960 (10Aklapper) a:05RLazarus→03None @RLazarus: Removing task assignee as this open task has been assigned for more than two years - See the em... [17:26:07] Amir1: it's been +2'd [17:26:08] Yep, it's currently in the process of figuring out what random CI failure would be best suited [17:26:14] haha [17:26:16] ah, you're holding the lock. That's fine, I can wait [17:26:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1322.eqiad.wmnet with reason: host reimage [17:26:56] It shouldn't take long, we don't have many ext dependencies [17:28:36] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1325.eqiad.wmnet with reason: host reimage [17:28:43] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1327.eqiad.wmnet with reason: host reimage [17:28:46] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1326.eqiad.wmnet with reason: host reimage [17:29:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1323.eqiad.wmnet with reason: host reimage [17:30:10] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10337014 (10jcrespo) ` root@backup2012:~$ ./storcli64 /c0/e251/s0 show all CLI Version = 007.3103.0000.0000 Aug 22, 2024 Operat... [17:32:49] (03CR) 10Ssingh: [C:03+1] Change insrallserver in magru to point to eqiad insrall server [homer/public] - 10https://gerrit.wikimedia.org/r/1092878 (https://phabricator.wikimedia.org/T376737) (owner: 10Papaul) [17:32:56] (03CR) 10JHathaway: "I would add:" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [17:32:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1325.eqiad.wmnet with reason: host reimage [17:34:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1183.eqiad.wmnet with OS bullseye [17:34:36] (03CR) 10Ssingh: "Thanks, I will add the additional confine. Also, sorry, do you mean that it's OK to run on all bare metal hosts? Or not?" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [17:34:45] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10337071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1183.eqiad.wmnet with OS bullseye [17:36:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1327.eqiad.wmnet with reason: host reimage [17:36:45] (03CR) 10JHathaway: "sorry for the confusion, I think it is *ok* to add to all bare metal hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [17:37:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:37:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:37:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2110.codfw.wmnet with OS bullseye [17:37:39] (03Merged) 10jenkins-bot: Prevent ce_event_wikis query when feature flag is off [extensions/CampaignEvents] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092875 (https://phabricator.wikimedia.org/T380288) (owner: 10Daimona Eaytoy) [17:37:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10337101 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host elastic2110.codfw.... [17:38:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10337106 (10Jhancock.wm) [17:38:16] Amir1: ok with your parsercache change going out same time? [17:38:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10337109 (10Jhancock.wm) [17:38:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1326.eqiad.wmnet with reason: host reimage [17:40:34] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:41:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:41:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1324.eqiad.wmnet with OS bookworm [17:41:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10337112 (10Jhancock.wm) 05Open→03Resolved @bking this is complete! [17:41:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10337127 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm completed: - wikikube-worker1324... [17:41:20] brennen: sure [17:41:21] thanks! [17:41:24] (03CR) 10Papaul: [C:03+2] Change insrallserver in magru to point to eqiad insrall server [homer/public] - 10https://gerrit.wikimedia.org/r/1092878 (https://phabricator.wikimedia.org/T376737) (owner: 10Papaul) [17:41:49] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [17:41:52] T371400: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400 [17:41:54] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1092875|Prevent ce_event_wikis query when feature flag is off (T380288)]] [17:41:58] T380288: Uncaught Wikimedia\Rdbms\DBQueryError: "Table 'testwiki.ce_event_wikis' doesn't exist" on event pages and CampaignEvents special pages - https://phabricator.wikimedia.org/T380288 [17:42:02] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [17:42:33] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wikikube-worker1290.eqiad.wmnet with reason: being moved to new port [17:42:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wikikube-worker1290.eqiad.wmnet with reason: being moved to new port [17:43:40] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:45:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:45:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1322.eqiad.wmnet with OS bookworm [17:45:16] 06SRE: Survey the third-party library market for UA policy compliance - https://phabricator.wikimedia.org/T313634#10337188 (10Aklapper) a:05CDanis→03None @CDanis: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on October 11th. Please a... [17:45:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10337196 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm completed: - wikikube-worker1322... [17:45:56] 06SRE, 06SRE-OnFire: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355#10337186 (10Aklapper) a:05CDanis→03None @CDanis: Removing task assignee as this open task has been assigned for more than two years - See the email sent to ta... [17:46:03] 06SRE: Survey the third-party library market for UA policy compliance - https://phabricator.wikimedia.org/T313634#10337205 (10CDanis) [17:47:24] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:47:34] !log brennen@deploy2002 daimona, brennen: Backport for [[gerrit:1092875|Prevent ce_event_wikis query when feature flag is off (T380288)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:47:38] T380288: Uncaught Wikimedia\Rdbms\DBQueryError: "Table 'testwiki.ce_event_wikis' doesn't exist" on event pages and CampaignEvents special pages - https://phabricator.wikimedia.org/T380288 [17:47:49] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1290 [17:47:57] Daimona, Amir1: anything to test for either of these? [17:47:59] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker1290 [17:47:59] 06SRE, 06Infrastructure-Foundations, 10netops: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635#10337207 (10Aklapper) a:05cmooney→03None @cmooney: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task as... [17:48:06] not for my case [17:48:23] only to check if Special:Random doesn't trigger fireworks [17:48:28] I can quickly test that opening those pages no longer results in the huge red error [17:48:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:48:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1323.eqiad.wmnet with OS bookworm [17:48:58] Daimona: kk, i'll wait for your go-ahead. [17:48:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10337255 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm completed: - wikikube-worker1323... [17:49:11] And it doesn't! All good. [17:50:02] syncing (also saw no fireworks form Special:Random) [17:50:06] !log brennen@deploy2002 daimona, brennen: Continuing with sync [17:50:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1183.eqiad.wmnet with OS bullseye [17:50:22] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:50:26] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10337274 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1183.eqiad.wmnet with OS bullseye executed with errors: - an... [17:50:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:50:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1325.eqiad.wmnet with OS bookworm [17:50:50] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10337277 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm completed: - wikikube-worker1325... [17:52:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1183.eqiad.wmnet with OS bullseye [17:52:52] Thank you <3 [17:52:56] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10337297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1183.eqiad.wmnet with OS bullseye [17:53:27] sure thing [17:53:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:53:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:54:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1327.eqiad.wmnet with OS bookworm [17:54:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10337319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm completed: - wikikube-worker1327... [17:54:14] Yayyyy thank you! [17:55:57] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:56:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:56:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1326.eqiad.wmnet with OS bookworm [17:56:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10337353 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm completed: - wikikube-worker1326... [17:57:02] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10337365 (10Jclark-ctr) [17:57:04] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092875|Prevent ce_event_wikis query when feature flag is off (T380288)]] (duration: 15m 10s) [17:57:07] T380288: Uncaught Wikimedia\Rdbms\DBQueryError: "Table 'testwiki.ce_event_wikis' doesn't exist" on event pages and CampaignEvents special pages - https://phabricator.wikimedia.org/T380288 [17:57:20] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10337371 (10Jclark-ctr) 05Open→03Resolved [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1800) [18:00:58] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 3 others: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10337359 (10cmooney) Ok. So I've tested the "[[ https://netbox.wikimed... [18:01:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.695s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:06:10] This one *might* be related to my parsercache patch and should recover automatically. [18:06:10] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307 (10RobH) 03NEW [18:06:12] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10337484 (10RobH) [18:06:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.695s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:06:17] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337485 (10RobH) [18:07:07] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [18:07:34] 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337492 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7007.magru.wmnet with... [18:11:03] (03PS5) 10JMeybohm: kubernetes::master: Don't override sa certificates on reimage [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) [18:11:39] (03CR) 10JMeybohm: kubernetes::master: Don't override sa certificates on reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [18:11:42] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm) [18:12:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio) [18:13:32] (03PS3) 10Ebernhardson: opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325 [18:13:32] (03PS3) 10Ebernhardson: opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 [18:13:32] (03PS4) 10Ebernhardson: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 [18:13:32] (03PS18) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [18:14:51] (03CR) 10CI reject: [V:04-1] [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [18:16:07] (03PS1) 10Fabfur: cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) [18:16:44] (03CR) 10CI reject: [V:04-1] cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [18:19:10] (03PS2) 10Fabfur: cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) [18:21:17] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278#10337662 (10RobH) p:05Triage→03High @Jclark-ctr or @VRiley-WMF if either of you can take a look at this ASAP and install some decom disks in here, it wou... [18:21:49] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [18:24:54] (03PS5) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) [18:29:53] (03PS3) 10Fabfur: cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) [18:32:13] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [18:32:16] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [18:32:17] T371400: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400 [18:34:13] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp7007.magru.wmnet with OS bullseye [18:34:26] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337748 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye executed with... [18:34:40] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:34:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [18:34:53] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:34:55] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337749 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye [18:36:01] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [18:36:16] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:39:10] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:41:49] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [18:42:55] FIRING: MaxConntrack: Max conntrack at 93.23% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:43:23] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 96 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:44:09] (03CR) 10Ssingh: [C:03+1] cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [18:47:37] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp7007.magru.wmnet with OS bullseye [18:47:44] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye executed with... [18:48:08] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [18:48:15] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye [18:51:23] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 78 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:51:47] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:52:46] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1183.eqiad.wmnet with OS bullseye [18:52:56] RESOLVED: MaxConntrack: Max conntrack at 91.51% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:52:56] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10337790 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1183.eqiad.wmnet with OS bullseye executed with errors: - an... [18:53:01] !log Import ncmonitor 1.3.0-1 into main apt repo [18:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1183.eqiad.wmnet with OS bullseye [18:53:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:34] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10337791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1183.eqiad.wmnet with OS bullseye [18:53:41] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:55] FIRING: MaxConntrack: Max conntrack at 92.67% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:55:54] (03PS1) 10Cathal Mooney: Potential script to assign fr-tech server IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [18:56:37] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278#10337804 (10VRiley-WMF) @Jclark-ctr has placed 2 900 GB drives for this unit into the spare slots. [18:57:40] (03CR) 10CI reject: [V:04-1] Potential script to assign fr-tech server IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [18:58:55] RESOLVED: MaxConntrack: Max conntrack at 92.67% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:00:04] andre and brennen: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T1900). nyaa~ [19:02:18] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10337813 (10jcrespo) So this is the summary of my tests, comparing the //Super Micro Computer Inc AOC-S3908L-H8iR RAID Adapter/... [19:05:07] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [19:05:10] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [19:05:17] T371400: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400 [19:05:34] (03PS2) 10Cathal Mooney: Potential script to assign fr-tech server IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [19:08:20] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1183.eqiad.wmnet with reason: host reimage [19:08:26] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7007.magru.wmnet with OS bullseye [19:08:34] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye executed with... [19:08:51] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [19:08:59] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye [19:12:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1183.eqiad.wmnet with reason: host reimage [19:12:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10337852 (10jcrespo) 05Open→03Resolved [19:13:15] PROBLEM - Host lsw1-c3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:14:04] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7007.magru.wmnet with OS bullseye [19:14:15] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye executed with... [19:14:17] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:15:20] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [19:15:28] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye [19:15:43] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:16:44] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@a4d0954]: mjolnir: T379045 Increase maxResultSize [19:16:48] T379045: mjolnir fails with: Partition not found in table 'labeled_query_page' database 'mjolnir' - https://phabricator.wikimedia.org/T379045 [19:16:49] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 81, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:17:11] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@a4d0954]: mjolnir: T379045 Increase maxResultSize (duration: 00m 26s) [19:24:02] (03PS3) 10CDanis: haproxy: bwlim-by-path: also roll out to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1087615 (https://phabricator.wikimedia.org/T317799) [19:24:05] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087615 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [19:30:49] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:31:28] (03CR) 10CDanis: [C:03+2] "vg verbal approval in meeting last week" [puppet] - 10https://gerrit.wikimedia.org/r/1087615 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [19:32:47] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:34:12] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:40:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp7007.magru.wmnet [19:41:15] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp7007.magru.wmnet with OS bullseye [19:41:22] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10337988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7007.magru.wmnet with OS bullseye executed with... [19:42:28] (03CR) 10Ssingh: "Traffic feedback seems to be that we should alert and not fail() Puppet runs. So while most of the logic remains, this needs to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [19:47:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp7007.magru.wmnet [19:56:32] (03CR) 10Muehlenhoff: cache: install lshw from bullseye-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur) [19:58:28] (03PS1) 10Ssingh: sre.hosts.reimage: test output of Netbox get_server() [cookbooks] - 10https://gerrit.wikimedia.org/r/1092907 [20:03:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:03:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1183.eqiad.wmnet with OS bullseye [20:03:55] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10338057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1183.eqiad.wmnet with OS bullseye completed: - an-worker1183... [20:04:49] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: test output of Netbox get_server() [cookbooks] - 10https://gerrit.wikimedia.org/r/1092907 (owner: 10Ssingh) [20:05:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm [20:08:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10338069 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm [20:08:41] (03PS2) 10Ssingh: sre.hosts.reimage: test output of Netbox get_server() [cookbooks] - 10https://gerrit.wikimedia.org/r/1092907 [20:10:23] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [20:10:26] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400 [20:10:28] T371400: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400 [20:11:15] (03PS3) 10Ssingh: sre.hosts.reimage: test output of Netbox get_server() [cookbooks] - 10https://gerrit.wikimedia.org/r/1092907 [20:17:13] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: test output of Netbox get_server() [cookbooks] - 10https://gerrit.wikimedia.org/r/1092907 (owner: 10Ssingh) [20:20:18] 06SRE, 06MediaWiki-Platform-Team, 10observability, 07Grafana, 13Patch-For-Review: Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105#10338121 (10Krinkle) [20:20:30] (03CR) 10Bernard Wang: [C:03+1] Reenable non-UI experiement quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (owner: 10Bernard Wang) [20:20:45] (03PS2) 10Bernard Wang: Reenable non-UI experiment quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) [20:23:07] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278#10338131 (10Jclark-ctr) 05Open→03Resolved Installed 2 x960gb ssd into slot 2/3 [20:23:10] 06SRE, 06MediaWiki-Platform-Team, 10observability, 07Grafana, 13Patch-For-Review: Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105#10338136 (10Krinkle) [20:23:49] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:23:51] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:24:02] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [20:24:10] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10338150 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [20:24:34] (03PS1) 10Bvibber: Separate cache key space for test & production JsonConfig data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092912 (https://phabricator.wikimedia.org/T380320) [20:24:48] 06SRE, 06MediaWiki-Platform-Team, 10observability, 07Grafana, 13Patch-For-Review: Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105#10338117 (10Krinkle) 05Open→03Declined We won't be needing a separate endpoint. Today, to bring data into Graphite, the `stat... [20:24:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2041.codfw.wmnet with OS bookworm [20:25:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10338158 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm executed... [20:25:11] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10338159 (10Jclark-ctr) [20:25:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092912 (https://phabricator.wikimedia.org/T380320) (owner: 10Bvibber) [20:29:12] (03PS4) 10Ssingh: sre.hosts.reimage: test output of Netbox get_server() [cookbooks] - 10https://gerrit.wikimedia.org/r/1092907 [20:29:21] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10338175 (10Jclark-ctr) a:03VRiley-WMF [20:29:42] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [20:29:46] (03CR) 10Jdlrobson: [C:03+1] Reenable non-UI experiment quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) (owner: 10Bernard Wang) [20:31:11] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278#10338184 (10xcollazo) Just wanted to pass by and say thank you for taking care of this emergency swiftly. [20:32:17] (03Abandoned) 10Ssingh: sre.hosts.reimage: test output of Netbox get_server() [cookbooks] - 10https://gerrit.wikimedia.org/r/1092907 (owner: 10Ssingh) [20:32:28] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7007.magru.wmnet with OS bullseye [20:34:00] (03PS1) 10CDanis: haproxy+requestctl: enable in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1092914 (https://phabricator.wikimedia.org/T370745) [20:34:54] (03CR) 10CDanis: "I picked codfw because I didn't want to interfere with haproxykafka testing in eqsin/ulsfo." [puppet] - 10https://gerrit.wikimedia.org/r/1092914 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis) [20:40:04] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye [20:40:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10338218 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [20:40:26] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [20:40:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10338225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [20:42:11] (03PS2) 10Bvibber: Separate cache key space for test & production JsonConfig data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092912 (https://phabricator.wikimedia.org/T380320) [20:50:03] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-be2005.codfw.wmnet with OS bullseye [20:50:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10338255 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [20:56:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm [20:56:20] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10338286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm [20:59:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10338294 (10phaultfinder) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241119T2100). [21:00:05] ksarabia, kemayo, and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:13] (03CR) 10Eevans: [C:03+2] restbase: commission restbase203[6-8] [puppet] - 10https://gerrit.wikimedia.org/r/1092345 (https://phabricator.wikimedia.org/T380236) (owner: 10Eevans) [21:00:40] Hi. I'm here and would appreciate help in deploying. Thank you [21:01:04] Both my patches don't have anything I can do to test them (they're error-catching for something we've not figured out how to reproduce), so feel free to merge-and-deploy without waiting on me. [21:02:28] I can deploy, with the caveat that I'm going to be trying the new spiderpig deployment, so it might be a little bumpy [21:03:07] o/ [21:03:45] my config patch is for an edge case, i can just test to confirm that it doesn't break the common case ;) [21:05:08] ty [21:05:29] OK, I'm getting setup with spiderpig. I'll ping y'all once I'm getting started. [21:05:50] kindrobot: oh neat :D I've played around with it locally, but haven't given it a go yet [21:14:59] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2036.codfw.wmnet with reason: Bootstrapping — T380236 [21:15:05] T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236 [21:15:14] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2036.codfw.wmnet with reason: Bootstrapping — T380236 [21:15:21] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2037.codfw.wmnet with reason: Bootstrapping — T380236 [21:15:35] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2037.codfw.wmnet with reason: Bootstrapping — T380236 [21:15:46] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2038.codfw.wmnet with reason: Bootstrapping — T380236 [21:16:00] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2038.codfw.wmnet with reason: Bootstrapping — T380236 [21:16:25] Bah, sorry team. I'm on a freshly imaged laptop and I'm having trouble getting my SSH onto the deployment box. RoanKattouw urbanecm cjming or TheresNoTime, any chance you could do it instead (while I'm figuring this out) [21:16:36] i can help [21:16:41] Thank you <3 [21:17:35] (03CR) 10Urbanecm: [C:03+2] Revert "editcheck: Remove try/catch around transaction squashing" [extensions/VisualEditor] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092850 (https://phabricator.wikimedia.org/T333710) (owner: 10DLynch) [21:17:41] (03CR) 10Urbanecm: [C:03+2] Revert "editcheck: Remove try/catch around transaction squashing" [extensions/VisualEditor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092851 (https://phabricator.wikimedia.org/T333710) (owner: 10DLynch) [21:19:10] FIRING: [18x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:19:16] bvibber: question... shouldn't there be an `unset` in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1092912 on the helper variable, to avoid it being distributed as a global var to all of mediawiki? [21:19:45] oops [21:19:48] yes lemme fix that [21:20:13] ty :) [21:20:30] (03PS3) 10Jdlrobson: Promote Vector 2022 as default on 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092296 (https://phabricator.wikimedia.org/T379765) [21:20:36] (03PS3) 10Bvibber: Separate cache key space for test & production JsonConfig data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092912 (https://phabricator.wikimedia.org/T380320) [21:20:51] urbanecm: updated. thx for the catch :D [21:20:59] (03CR) 10Urbanecm: [C:03+2] Promote Vector 2022 as default on 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092296 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson) [21:21:43] (03Merged) 10jenkins-bot: Promote Vector 2022 as default on 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092296 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson) [21:22:10] (03CR) 10Urbanecm: [C:03+2] Separate cache key space for test & production JsonConfig data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092912 (https://phabricator.wikimedia.org/T380320) (owner: 10Bvibber) [21:22:20] \o/ [21:22:25] no problem! this looks much better :) [21:22:55] (03Merged) 10jenkins-bot: Separate cache key space for test & production JsonConfig data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092912 (https://phabricator.wikimedia.org/T380320) (owner: 10Bvibber) [21:22:56] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092912 (https://phabricator.wikimedia.org/T380320) (owner: 10Bvibber) [21:23:24] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1092296|Promote Vector 2022 as default on 3 wikis (T379765)]], [[gerrit:1092912|Separate cache key space for test & production JsonConfig data (T380320)]] [21:23:29] T379765: Nov 19: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379765 [21:23:29] T380320: JsonConfig cache key space overlaps between testcommonswiki & commonswiki - https://phabricator.wikimedia.org/T380320 [21:23:34] I submitted another config change for a backport few mins back .. let me know if it is possible to get that in as well. [21:23:55] subbu: sure thing, thanks for the info! [21:24:10] FIRING: [18x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:24:11] ty [21:27:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278#10338367 (10bking) [21:28:49] RECOVERY - Host lsw1-c3-codfw.mgmt is UP: PING WARNING - Packet loss = 90%, RTA = 0.31 ms [21:29:01] PROBLEM - BGP status on lsw1-c3-codfw.mgmt is CRITICAL: BGP CRITICAL - No response from remote host 10.193.1.232 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:29:01] PROBLEM - Juniper alarms on lsw1-c3-codfw.mgmt is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.193.1.232 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [21:29:13] !log urbanecm@deploy2002 bvibber, jdlrobson, urbanecm: Backport for [[gerrit:1092296|Promote Vector 2022 as default on 3 wikis (T379765)]], [[gerrit:1092912|Separate cache key space for test & production JsonConfig data (T380320)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:29:26] whee [21:29:33] T379765: Nov 19: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379765 [21:29:34] T380320: JsonConfig cache key space overlaps between testcommonswiki & commonswiki - https://phabricator.wikimedia.org/T380320 [21:30:09] kimberly_sarabia: bvibber: can you test, please? [21:30:44] urbanecm: doesn't explode! that's as good as i can test :D [21:30:52] urbanecm: LGTM [21:30:55] ty [21:30:57] good enough for me bvibber! [21:30:59] thanks kimberly_sarabia [21:31:00] hehe [21:31:01] !log urbanecm@deploy2002 bvibber, jdlrobson, urbanecm: Continuing with sync [21:35:13] PROBLEM - Host lsw1-c3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:35:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278#10338394 (10bking) Thank you very much @Jclark-ctr ! Once the drives were added, I created a software RAID from them `parted... [21:35:29] (03PS1) 10Herron: site: add aux-k8s codfw insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/1092922 (https://phabricator.wikimedia.org/T378986) [21:35:39] (03Merged) 10jenkins-bot: Revert "editcheck: Remove try/catch around transaction squashing" [extensions/VisualEditor] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092850 (https://phabricator.wikimedia.org/T333710) (owner: 10DLynch) [21:35:40] (03PS19) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [21:35:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): High priority: Disk space expansion on an-launcher1002 - https://phabricator.wikimedia.org/T380278#10338401 (10bking) [21:38:00] (03PS1) 10Cathal Mooney: WIP: example config for Nokia SR-Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1084107 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [21:38:00] (03CR) 10Cathal Mooney: "Amazing stuff!! had a quick scan looks really nice.... I do sort of prefer creating autodicts and then setting the params in the messy['s" [homer/public] - 10https://gerrit.wikimedia.org/r/1084107 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [21:38:02] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092296|Promote Vector 2022 as default on 3 wikis (T379765)]], [[gerrit:1092912|Separate cache key space for test & production JsonConfig data (T380320)]] (duration: 14m 38s) [21:38:07] T379765: Nov 19: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379765 [21:38:07] T380320: JsonConfig cache key space overlaps between testcommonswiki & commonswiki - https://phabricator.wikimedia.org/T380320 [21:38:08] here we go! [21:38:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092341 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [21:38:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092850 (https://phabricator.wikimedia.org/T333710) (owner: 10DLynch) [21:38:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092851 (https://phabricator.wikimedia.org/T333710) (owner: 10DLynch) [21:38:18] (03Merged) 10jenkins-bot: Revert "editcheck: Remove try/catch around transaction squashing" [extensions/VisualEditor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1092851 (https://phabricator.wikimedia.org/T333710) (owner: 10DLynch) [21:39:08] (03Merged) 10jenkins-bot: Enable experimental Parsoid fragment support on labs and test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092341 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [21:39:39] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1092341|Enable experimental Parsoid fragment support on labs and test wikis (T374661)]], [[gerrit:1092850|Revert "editcheck: Remove try/catch around transaction squashing" (T333710 T380234)]], [[gerrit:1092851|Revert "editcheck: Remove try/catch around transaction squashing" (T333710 T380234)]] [21:39:47] T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661 [21:39:48] T333710: Fix TransactionSquasher crashes - https://phabricator.wikimedia.org/T333710 [21:39:48] T380234: Publish button hangs indefinitely without saving for some users - https://phabricator.wikimedia.org/T380234 [21:39:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2041.codfw.wmnet with OS bookworm [21:39:55] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:40:07] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10338423 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm executed... [21:41:52] (03PS20) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [21:43:07] (03PS1) 10Cathal Mooney: WIP: wmf-netbox - expose interfaces in a SR-Linux format [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1084105 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [21:43:07] (03CR) 10Cathal Mooney: "Code looks really good. beta +1. I do wonder if we aren't better off putting this inside modules/srlinux/interfaces.py however. Or at le" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1084105 (https://phabricator.wikimedia.org/T371088) (owner: 10Ayounsi) [21:44:31] (03PS21) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [21:44:55] RESOLVED: MaxConntrack: Max conntrack at 96.88% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:45:09] !log urbanecm@deploy2002 cscott, kemayo, urbanecm: Backport for [[gerrit:1092341|Enable experimental Parsoid fragment support on labs and test wikis (T374661)]], [[gerrit:1092850|Revert "editcheck: Remove try/catch around transaction squashing" (T333710 T380234)]], [[gerrit:1092851|Revert "editcheck: Remove try/catch around transaction squashing" (T333710 T380234)]] synced to the testservers (https://wikitech.wikimedia.or [21:45:09] g/wiki/Mwdebug) [21:45:24] T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661 [21:45:25] T333710: Fix TransactionSquasher crashes - https://phabricator.wikimedia.org/T333710 [21:45:25] T380234: Publish button hangs indefinitely without saving for some users - https://phabricator.wikimedia.org/T380234 [21:46:51] subbu: can you test your patch, please? [21:47:00] (03PS22) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [21:47:22] yes .. will do. [21:48:05] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4555/co" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (owner: 10Ebernhardson) [21:51:45] it works just fine on beta, but it doesn't work on testwiki ... [21:52:09] but, we can sync it ... it helps us test and debug and fix it. [21:52:48] it is only meant to be enabled on beta cluster & testwiki. [21:53:07] !log urbanecm@deploy2002 cscott, kemayo, urbanecm: Continuing with sync [21:53:07] (03CR) 10Bking: [C:03+2] dse-k8s: add ingress config for net-new service [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [21:53:10] sounds good [22:00:19] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092341|Enable experimental Parsoid fragment support on labs and test wikis (T374661)]], [[gerrit:1092850|Revert "editcheck: Remove try/catch around transaction squashing" (T333710 T380234)]], [[gerrit:1092851|Revert "editcheck: Remove try/catch around transaction squashing" (T333710 T380234)]] (duration: 20m 39s) [22:00:30] subbu: Kemayo: should be live! [22:00:31] T374661: Charts are not compatible with Parsoid - show as raw SVG - https://phabricator.wikimedia.org/T374661 [22:00:31] T333710: Fix TransactionSquasher crashes - https://phabricator.wikimedia.org/T333710 [22:00:32] T380234: Publish button hangs indefinitely without saving for some users - https://phabricator.wikimedia.org/T380234 [22:00:34] ty. [22:00:34] just in time [22:00:37] np [22:00:39] :) [22:00:39] urbanecm: Thanks! [22:01:21] whee [22:01:45] must have been something broken with mwdebug plugin on my end ... it works on testwiki as well after sync .. \o/ [22:02:04] yay! [22:04:40] (03CR) 10Cwhite: [C:03+2] logstash: upgrade phatality version to 2.7.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/1092343 (https://phabricator.wikimedia.org/T342476) (owner: 10Cwhite) [22:07:09] RECOVERY - Disk space on Hadoop worker on an-worker1088 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [22:08:10] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 [22:08:16] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092931 [22:18:25] (03CR) 10Cwhite: [C:03+1] "OK by me." [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (owner: 10Ebernhardson) [22:22:18] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [22:23:23] (03PS1) 10Ebernhardson: Repoint .gitreview at new repo [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1092935 [22:23:25] (03CR) 10BCornwall: "Updated the list to more accurately redirect where the user may have intention to go. Also removed a few bad domains that slipped through " [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [22:24:23] (03CR) 10Pppery: "Most of these seem to be paid editing domains that we agreed it was wiser to dead-park rather than redirecting at https://gerrit.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [22:24:55] (03Abandoned) 10Ebernhardson: Repoint .gitreview at new repo [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1092935 (owner: 10Ebernhardson) [22:25:35] (03PS1) 10Ebernhardson: Repoint .gitreview at new repo [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1092936 [22:32:26] (03PS1) 10Bking: opensearch: add components for bullseye and bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1092938 (https://phabricator.wikimedia.org/T372769) [22:32:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092938 (https://phabricator.wikimedia.org/T372769) (owner: 10Bking) [22:38:52] (03CR) 10Ryan Kemper: [C:03+1] "puppet 7 pcc succeeded" [puppet] - 10https://gerrit.wikimedia.org/r/1092938 (https://phabricator.wikimedia.org/T372769) (owner: 10Bking) [22:39:01] (03CR) 10Bking: [C:03+2] opensearch: add components for bullseye and bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1092938 (https://phabricator.wikimedia.org/T372769) (owner: 10Bking) [22:44:57] (03PS1) 10BCornwall: ncmonitor: Ignore pay-for-edit/scam domains [puppet] - 10https://gerrit.wikimedia.org/r/1092943 [22:45:28] (03CR) 10BCornwall: "Thanks, I've moved most of those over to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1092943 and will remove them here." [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [22:49:02] (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [22:50:15] (03PS2) 10BCornwall: ncmonitor: Ignore pay-for-edit/scam domains [puppet] - 10https://gerrit.wikimedia.org/r/1092943 [22:50:51] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@9927a5a] (wcqs): Deploy 0.3.150 to WCQS [22:52:06] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4556/co" [puppet] - 10https://gerrit.wikimedia.org/r/1092943 (owner: 10BCornwall) [22:54:31] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 193 MB (0% inode=98%): /tmp 193 MB (0% inode=98%): /var/tmp 193 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:56:06] (03PS2) 10BCornwall: ncmonitor: Add "main" WMF domains to ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) [22:56:06] (03PS2) 10BCornwall: ncmonitor: Add pywikibot.org to domain ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1092362 [22:56:06] (03PS3) 10BCornwall: ncmonitor: Ignore pay-for-edit/scam domains [puppet] - 10https://gerrit.wikimedia.org/r/1092943 [22:59:30] (03PS4) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [23:31:07] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092930 (owner: 10Ncmonitor) [23:32:49] PROBLEM - Disk space on Hadoop worker on an-worker1109 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [23:33:49] (03CR) 10Pppery: [C:03+1] "This also needs a change to dns to point them away from ncredir (since ncredir redirects all domains it doesn't know anything about to wik" [puppet] - 10https://gerrit.wikimedia.org/r/1092943 (owner: 10BCornwall) [23:41:43] (03CR) 10BCornwall: "Indeed, thanks! We're dealing with access issues at the moment but I've made a ticket to follow up with that: https://phabricator.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/1092943 (owner: 10BCornwall) [23:53:59] (03CR) 10Pppery: "Don't have anything useful to say here." [puppet] - 10https://gerrit.wikimedia.org/r/1092362 (owner: 10BCornwall)