[00:00:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:00:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2039.codfw.wmnet with OS bookworm [00:00:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083345 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2039.codfw.wmnet wit... [00:02:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1298.eqiad.wmnet with OS bullseye [00:02:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10083349 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1298.eqiad.wmnet with OS bullseye... [00:02:44] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:03:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P67529 and previous config saved to /var/cache/conftool/dbconfig/20240822-000315-ladsgroup.json [00:07:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:07:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:07:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2038.codfw.wmnet with OS bookworm [00:07:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2038.codfw.wmnet wit... [00:07:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1298.eqiad.wmnet with OS bullseye [00:08:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1298.eqiad.wmnet with OS bullseye [00:08:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10083351 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1298.eqiad.wmnet with OS bull... [00:08:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10083352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1298.eqiad.wmnet with OS bullseye... [00:10:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2040.codfw.wmnet with OS bookworm [00:10:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2041.codfw.wmnet with OS bookworm [00:10:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2042.codfw.wmnet with OS bookworm [00:10:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2043.codfw.wmnet with OS bookworm [00:10:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2040.codfw.wmnet with OS bookworm [00:10:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2041.codfw.wmnet with OS bookworm [00:10:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2042.codfw.wmnet with OS bookworm [00:10:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2040.codfw.wmnet... [00:10:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2044.codfw.wmnet with OS bookworm [00:10:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2041.codfw.wmnet... [00:10:29] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2043.codfw.wmnet with OS bookworm [00:10:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2042.codfw.wmnet... [00:10:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2043.codfw.wmnet... [00:10:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2040.codfw.wmnet wit... [00:10:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083360 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2042.codfw.wmnet wit... [00:10:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2041.codfw.wmnet wit... [00:10:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2044.codfw.wmnet with OS bookworm [00:10:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2044.codfw.wmnet... [00:10:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2043.codfw.wmnet wit... [00:10:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2044.codfw.wmnet wit... [00:12:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1298.eqiad.wmnet with OS bullseye [00:12:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10083364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1298.eqiad.wmnet with OS bull... [00:12:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1298.eqiad.wmnet with OS bullseye [00:12:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10083365 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1298.eqiad.wmnet with OS bullseye... [00:13:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2040.codfw.wmnet with OS bookworm [00:13:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2040.codfw.wmnet... [00:13:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2040.codfw.wmnet with OS bookworm [00:14:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083367 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2040.codfw.wmnet wit... [00:14:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1064495 (owner: 10TrainBranchBot) [00:14:27] FIRING: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2040.codfw.wmnet with OS bookworm [00:18:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P67530 and previous config saved to /var/cache/conftool/dbconfig/20240822-001823-ladsgroup.json [00:18:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083388 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2040.codfw.wmnet... [00:18:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2040.codfw.wmnet with OS bookworm [00:18:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2040.codfw.wmnet wit... [00:19:27] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:44] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:27] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:44] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:44] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T371742)', diff saved to https://phabricator.wikimedia.org/P67531 and previous config saved to /var/cache/conftool/dbconfig/20240822-003330-ladsgroup.json [00:33:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [00:33:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:33:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [00:33:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2129 (T371742)', diff saved to https://phabricator.wikimedia.org/P67532 and previous config saved to /var/cache/conftool/dbconfig/20240822-003352-ladsgroup.json [00:34:27] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2040.codfw.wmnet with OS bookworm [00:35:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2040.codfw.wmnet... [00:37:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2041.codfw.wmnet with OS bookworm [00:38:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083416 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2041.codfw.wmnet... [00:38:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2042.codfw.wmnet with OS bookworm [00:38:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2042.codfw.wmnet... [00:39:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2043.codfw.wmnet with OS bookworm [00:39:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083419 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2043.codfw.wmnet... [00:39:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2044.codfw.wmnet with OS bookworm [00:39:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083420 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2044.codfw.wmnet... [00:44:27] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:45:44] RESOLVED: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2040.codfw.wmnet with reason: host reimage [00:50:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2041.codfw.wmnet with reason: host reimage [00:52:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2040.codfw.wmnet with reason: host reimage [00:53:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2042.codfw.wmnet with reason: host reimage [00:53:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2043.codfw.wmnet with reason: host reimage [00:54:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2044.codfw.wmnet with reason: host reimage [00:55:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2041.codfw.wmnet with reason: host reimage [00:58:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2043.codfw.wmnet with reason: host reimage [00:58:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T371742)', diff saved to https://phabricator.wikimedia.org/P67533 and previous config saved to /var/cache/conftool/dbconfig/20240822-005857-ladsgroup.json [00:59:01] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:01:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2044.codfw.wmnet with reason: host reimage [01:05:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2042.codfw.wmnet with reason: host reimage [01:07:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:09:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:09:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2040.codfw.wmnet with OS bookworm [01:10:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2040.codfw.wmnet wit... [01:10:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:10:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:10:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2041.codfw.wmnet with OS bookworm [01:11:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2041.codfw.wmnet wit... [01:14:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P67534 and previous config saved to /var/cache/conftool/dbconfig/20240822-011405-ladsgroup.json [01:15:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:20:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:21:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:21:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2042.codfw.wmnet with OS bookworm [01:21:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2042.codfw.wmnet wit... [01:21:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:21:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:21:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2044.codfw.wmnet with OS bookworm [01:21:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2044.codfw.wmnet wit... [01:29:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P67535 and previous config saved to /var/cache/conftool/dbconfig/20240822-012912-ladsgroup.json [01:32:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [01:33:45] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:34:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:34:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2043.codfw.wmnet with OS bookworm [01:35:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2043.codfw.wmnet wit... [01:38:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10083507 (10Jhancock.wm) 05Open→03Resolved [01:42:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [01:44:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T371742)', diff saved to https://phabricator.wikimedia.org/P67536 and previous config saved to /var/cache/conftool/dbconfig/20240822-014419-ladsgroup.json [01:44:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [01:44:23] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:44:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [01:44:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T371742)', diff saved to https://phabricator.wikimedia.org/P67537 and previous config saved to /var/cache/conftool/dbconfig/20240822-014441-ladsgroup.json [01:51:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:53:45] FIRING: [2x] Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:58:45] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:02:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:03:45] FIRING: [4x] Primary outbound port utilisation over 80% #page: Device asw2-ulsfo.mgmt.ulsfo.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:07:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:08:45] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:09:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T371742)', diff saved to https://phabricator.wikimedia.org/P67538 and previous config saved to /var/cache/conftool/dbconfig/20240822-020930-ladsgroup.json [02:09:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:13:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:24:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P67539 and previous config saved to /var/cache/conftool/dbconfig/20240822-022437-ladsgroup.json [02:39:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P67540 and previous config saved to /var/cache/conftool/dbconfig/20240822-023944-ladsgroup.json [02:54:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T371742)', diff saved to https://phabricator.wikimedia.org/P67541 and previous config saved to /var/cache/conftool/dbconfig/20240822-025451-ladsgroup.json [02:54:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [02:54:55] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:55:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [02:55:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [02:55:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [02:55:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T371742)', diff saved to https://phabricator.wikimedia.org/P67542 and previous config saved to /var/cache/conftool/dbconfig/20240822-025529-ladsgroup.json [02:59:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:44] (03PS1) 10KartikMistry: Content Translation: Revert MT threshold to default for Portuguese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064510 (https://phabricator.wikimedia.org/T356356) [03:06:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:53] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372939#10083614 (10phaultfinder) [03:20:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T371742)', diff saved to https://phabricator.wikimedia.org/P67543 and previous config saved to /var/cache/conftool/dbconfig/20240822-032007-ladsgroup.json [03:20:12] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:22:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:27:31] (03PS1) 10KartikMistry: Enable Content/Section translation on WPs without MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064514 (https://phabricator.wikimedia.org/T361582) [03:27:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 22 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064510 (https://phabricator.wikimedia.org/T356356) (owner: 10KartikMistry) [03:29:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 22 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064514 (https://phabricator.wikimedia.org/T361582) (owner: 10KartikMistry) [03:35:01] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372939#10083629 (10phaultfinder) [03:35:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P67544 and previous config saved to /var/cache/conftool/dbconfig/20240822-033514-ladsgroup.json [03:50:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P67545 and previous config saved to /var/cache/conftool/dbconfig/20240822-035022-ladsgroup.json [03:52:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:54:51] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372939#10083631 (10phaultfinder) [04:05:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T371742)', diff saved to https://phabricator.wikimedia.org/P67546 and previous config saved to /var/cache/conftool/dbconfig/20240822-040529-ladsgroup.json [04:05:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [04:05:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:05:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [04:05:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T371742)', diff saved to https://phabricator.wikimedia.org/P67547 and previous config saved to /var/cache/conftool/dbconfig/20240822-040551-ladsgroup.json [04:07:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:07:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:30:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T371742)', diff saved to https://phabricator.wikimedia.org/P67548 and previous config saved to /var/cache/conftool/dbconfig/20240822-043013-ladsgroup.json [04:30:17] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:45:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P67549 and previous config saved to /var/cache/conftool/dbconfig/20240822-044520-ladsgroup.json [05:00:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P67550 and previous config saved to /var/cache/conftool/dbconfig/20240822-050027-ladsgroup.json [05:15:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T371742)', diff saved to https://phabricator.wikimedia.org/P67551 and previous config saved to /var/cache/conftool/dbconfig/20240822-051536-ladsgroup.json [05:15:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [05:15:40] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:15:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [05:15:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T371742)', diff saved to https://phabricator.wikimedia.org/P67552 and previous config saved to /var/cache/conftool/dbconfig/20240822-051547-ladsgroup.json [05:26:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T371742)', diff saved to https://phabricator.wikimedia.org/P67553 and previous config saved to /var/cache/conftool/dbconfig/20240822-052618-ladsgroup.json [05:26:22] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:41:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P67554 and previous config saved to /var/cache/conftool/dbconfig/20240822-054125-ladsgroup.json [05:54:27] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:56:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P67555 and previous config saved to /var/cache/conftool/dbconfig/20240822-055633-ladsgroup.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T0600) [06:00:05] marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T0600). [06:03:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T371742)', diff saved to https://phabricator.wikimedia.org/P67556 and previous config saved to /var/cache/conftool/dbconfig/20240822-061140-ladsgroup.json [06:11:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [06:11:44] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:11:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [06:12:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T371742)', diff saved to https://phabricator.wikimedia.org/P67557 and previous config saved to /var/cache/conftool/dbconfig/20240822-061202-ladsgroup.json [06:21:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T371742)', diff saved to https://phabricator.wikimedia.org/P67558 and previous config saved to /var/cache/conftool/dbconfig/20240822-062146-ladsgroup.json [06:21:50] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:36:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P67559 and previous config saved to /var/cache/conftool/dbconfig/20240822-063653-ladsgroup.json [06:40:32] (03CR) 10Ayounsi: [C:03+1] Allow the selection of any vlan in provision server script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064387 (https://phabricator.wikimedia.org/T365651) (owner: 10Cathal Mooney) [06:41:48] (03PS2) 10Anzx: knwikisource : Create flood flag and add file importer right to Admin user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064556 (https://phabricator.wikimedia.org/T373073) [06:41:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 40317 [06:42:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 22 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064556 (https://phabricator.wikimedia.org/T373073) (owner: 10Anzx) [06:42:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 40317 [06:46:39] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner package to 17.1 [puppet] - 10https://gerrit.wikimedia.org/r/1064638 (https://phabricator.wikimedia.org/T373074) [06:49:27] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:52:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P67560 and previous config saved to /var/cache/conftool/dbconfig/20240822-065201-ladsgroup.json [06:55:37] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 4637 [06:57:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4637 [07:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T0700). [07:00:05] kart_ and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] o/ [07:00:58] Here. I'll start my deployments.. [07:01:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064510 (https://phabricator.wikimedia.org/T356356) (owner: 10KartikMistry) [07:02:24] (03Merged) 10jenkins-bot: Content Translation: Revert MT threshold to default for Portuguese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064510 (https://phabricator.wikimedia.org/T356356) (owner: 10KartikMistry) [07:03:01] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1064510|Content Translation: Revert MT threshold to default for Portuguese Wikipedia (T356356)]] [07:03:05] T356356: Set the threshold of translation to 85% in the Portuguese Wikipedia. - https://phabricator.wikimedia.org/T356356 [07:05:19] !log kartik@deploy1003 kartik: Backport for [[gerrit:1064510|Content Translation: Revert MT threshold to default for Portuguese Wikipedia (T356356)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:06:30] !log kartik@deploy1003 kartik: Continuing with sync [07:07:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T371742)', diff saved to https://phabricator.wikimedia.org/P67561 and previous config saved to /var/cache/conftool/dbconfig/20240822-070708-ladsgroup.json [07:07:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [07:07:13] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:07:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [07:11:03] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064510|Content Translation: Revert MT threshold to default for Portuguese Wikipedia (T356356)]] (duration: 08m 01s) [07:11:06] T356356: Set the threshold of translation to 85% in the Portuguese Wikipedia. - https://phabricator.wikimedia.org/T356356 [07:12:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064514 (https://phabricator.wikimedia.org/T361582) (owner: 10KartikMistry) [07:14:34] (03PS2) 10KartikMistry: Enable Content/Section translation on WPs without MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064514 (https://phabricator.wikimedia.org/T361582) [07:15:49] (03CR) 10TrainBranchBot: "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064514 (https://phabricator.wikimedia.org/T361582) (owner: 10KartikMistry) [07:17:03] (03Merged) 10jenkins-bot: Enable Content/Section translation on WPs without MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064514 (https://phabricator.wikimedia.org/T361582) (owner: 10KartikMistry) [07:17:22] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1064514|Enable Content/Section translation on WPs without MT (T361582)]] [07:17:25] T361582: Enable Content and Section translation on Wikipedias without current machine translation support to facilitate the support in the future - https://phabricator.wikimedia.org/T361582 [07:17:42] 06SRE, 06DBA, 06serviceops, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10083764 (10Marostegui) >>! In T372943#1... [07:18:56] (03PS1) 10Marostegui: installserver: Do not reimage db2228 [puppet] - 10https://gerrit.wikimedia.org/r/1064647 [07:19:30] !log kartik@deploy1003 kartik: Backport for [[gerrit:1064514|Enable Content/Section translation on WPs without MT (T361582)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:20:47] !log kartik@deploy1003 kartik: Continuing with sync [07:22:04] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2228 [puppet] - 10https://gerrit.wikimedia.org/r/1064647 (owner: 10Marostegui) [07:22:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:22:42] anzx: are you deploying your patch? [07:23:31] I need to go afk after my patch is deployed (not out of the town as documentation mentioned!) [07:25:08] kart_: I need someone to deploy my patch, i can't do it myself, I can schedule it for later if no one is there for deployment [07:25:14] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064514|Enable Content/Section translation on WPs without MT (T361582)]] (duration: 07m 51s) [07:25:17] T361582: Enable Content and Section translation on Wikipedias without current machine translation support to facilitate the support in the future - https://phabricator.wikimedia.org/T361582 [07:25:39] Check if Amir1 or urbanecm are around.. [07:28:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [07:28:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [07:28:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T371742)', diff saved to https://phabricator.wikimedia.org/P67562 and previous config saved to /var/cache/conftool/dbconfig/20240822-072836-ladsgroup.json [07:28:40] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:33:08] (03CR) 10Slyngshede: [C:03+2] P:idp Remove CAS 6.6 test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064335 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [07:51:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T371742)', diff saved to https://phabricator.wikimedia.org/P67563 and previous config saved to /var/cache/conftool/dbconfig/20240822-075144-ladsgroup.json [07:51:48] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:00:05] andre and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T0800). [08:01:31] (03PS1) 10KartikMistry: Section Translation: Fix some language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696 [08:06:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P67564 and previous config saved to /var/cache/conftool/dbconfig/20240822-080651-ladsgroup.json [08:07:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:08:57] I will now start promoting group2 wikis to 1.43.0-wmf.19 [08:09:13] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064697 (https://phabricator.wikimedia.org/T366964) [08:09:15] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064697 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [08:09:59] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064697 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [08:12:25] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:44] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.19 refs T366964 [08:16:47] T366964: 1.43.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T366964 [08:17:25] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:17] is https://gerrit.wikimedia.org/ unreachable, or is it just me? [08:21:55] aaand back [08:21:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P67565 and previous config saved to /var/cache/conftool/dbconfig/20240822-082158-ladsgroup.json [08:32:25] RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:31] (03CR) 10Filippo Giunchedi: Create corto deployment/configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [08:36:33] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [08:36:49] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:37:01] (03PS1) 10Marostegui: Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1064702 [08:37:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T371742)', diff saved to https://phabricator.wikimedia.org/P67566 and previous config saved to /var/cache/conftool/dbconfig/20240822-083706-ladsgroup.json [08:37:10] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:37:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:29] (03CR) 10Marostegui: [C:03+2] Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1064702 (owner: 10Marostegui) [08:37:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:57] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp-test1002.wikimedia.org [08:40:18] (03PS1) 10Marostegui: Revert "db1236: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1064703 [08:41:13] (03CR) 10Marostegui: [C:03+2] Revert "db1236: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1064703 (owner: 10Marostegui) [08:41:29] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner package to 17.1 [puppet] - 10https://gerrit.wikimedia.org/r/1064638 (https://phabricator.wikimedia.org/T373074) (owner: 10Jelto) [08:41:36] (03PS1) 10Marostegui: Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1064706 [08:43:15] (03CR) 10Marostegui: [C:03+2] Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1064706 (owner: 10Marostegui) [08:44:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10083924 (10ABran-WMF) after a quick chat with @cmooney, I've taken inventory of the 87 servers to handle: |**rack**|**node**|**cluster**| |`C1`|db2207|... [08:44:33] (03CR) 10LSobanski: "A no-op comment." [puppet] - 10https://gerrit.wikimedia.org/r/1064638 (https://phabricator.wikimedia.org/T373074) (owner: 10Jelto) [08:44:42] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [08:47:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:54] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [08:48:34] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [08:48:34] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:48:34] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test1002.wikimedia.org [08:48:55] RESOLVED: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:41] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts idp-test2002.wikimedia.org [08:52:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:52:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:53:24] (03CR) 10LSobanski: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner package to 17.1 [puppet] - 10https://gerrit.wikimedia.org/r/1064638 (https://phabricator.wikimedia.org/T373074) (owner: 10Jelto) [08:53:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:53:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:54:28] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [08:55:43] (03PS1) 10JMeybohm: site.pp: Split node blocks of new kafka nodes into two [puppet] - 10https://gerrit.wikimedia.org/r/1064714 (https://phabricator.wikimedia.org/T363210) [08:55:45] (03PS1) 10JMeybohm: kafka-main: Replace kafka-main2001 with kafka-main2006 [puppet] - 10https://gerrit.wikimedia.org/r/1064715 (https://phabricator.wikimedia.org/T363210) [08:57:11] (03CR) 10JMeybohm: [C:03+2] Add policy to allow GeoIP hostPath volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054905 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [08:57:14] (03CR) 10JMeybohm: [C:03+2] Add policy to allow only SYS_PTRACE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [08:57:18] (03CR) 10JMeybohm: [C:03+2] Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [08:57:35] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [08:57:54] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [08:57:54] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:57:55] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test2002.wikimedia.org [08:58:30] (03Merged) 10jenkins-bot: Initial commit validating-admission-policies chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [08:58:31] (03Merged) 10jenkins-bot: Add policy to allow only SYS_PTRACE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [08:58:33] (03Merged) 10jenkins-bot: Add policy to allow GeoIP hostPath volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054905 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [08:59:45] (03CR) 10JMeybohm: [C:03+2] Prometheus: Add recording rules computing commonly used envoy histograms [puppet] - 10https://gerrit.wikimedia.org/r/1055432 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [08:59:47] (03CR) 10JMeybohm: [C:03+2] Prometheus: Add recording rules for istio ingress metrics [puppet] - 10https://gerrit.wikimedia.org/r/1055213 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [09:05:53] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081 (10LSobanski) 03NEW [09:08:16] (03PS1) 10JMeybohm: Revert "Prometheus: Add recording rules computing commonly used ..." [puppet] - 10https://gerrit.wikimedia.org/r/1064716 (https://phabricator.wikimedia.org/T369607) [09:08:20] (03PS1) 10JMeybohm: Revert "Prometheus: Add recording rules for istio ingress metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1064717 (https://phabricator.wikimedia.org/T369607) [09:13:20] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:17:28] (03PS1) 10JMeybohm: Prometheus: Add missing rules key for rules added in... [puppet] - 10https://gerrit.wikimedia.org/r/1064719 (https://phabricator.wikimedia.org/T369607) [09:17:56] (03Abandoned) 10JMeybohm: Revert "Prometheus: Add recording rules computing commonly used ..." [puppet] - 10https://gerrit.wikimedia.org/r/1064716 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [09:17:56] (03Abandoned) 10JMeybohm: Revert "Prometheus: Add recording rules for istio ingress metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1064717 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [09:19:39] (03CR) 10Filippo Giunchedi: [C:03+1] Prometheus: Add missing rules key for rules added in... [puppet] - 10https://gerrit.wikimedia.org/r/1064719 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [09:20:06] (03CR) 10JMeybohm: [C:03+2] Prometheus: Add missing rules key for rules added in... [puppet] - 10https://gerrit.wikimedia.org/r/1064719 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [09:20:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:23:43] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064715 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:24:26] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:26:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [09:26:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [09:26:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T370903)', diff saved to https://phabricator.wikimedia.org/P67567 and previous config saved to /var/cache/conftool/dbconfig/20240822-092631-ladsgroup.json [09:26:32] (03CR) 10Brouberol: kafka-main: Replace kafka-main2001 with kafka-main2006 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1064715 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:26:35] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [09:29:23] (03PS2) 10JMeybohm: kafka-main: Replace kafka-main2001 with kafka-main2006 [puppet] - 10https://gerrit.wikimedia.org/r/1064715 (https://phabricator.wikimedia.org/T363210) [09:29:49] (03CR) 10JMeybohm: kafka-main: Replace kafka-main2001 with kafka-main2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064715 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:30:00] (03CR) 10JMeybohm: kafka-main: Replace kafka-main2001 with kafka-main2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064715 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:32:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [09:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:29] !log start prometheus2006 bookworm upgrade - T326657 [09:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:32] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [09:39:23] (03CR) 10Brouberol: [C:03+1] kafka-main: Replace kafka-main2001 with kafka-main2006 [puppet] - 10https://gerrit.wikimedia.org/r/1064715 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:40:19] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Cannot move Commons File:Dhruve_Sehgal_in_2021.png - https://phabricator.wikimedia.org/T372924#10084073 (10MatthewVernon) I've looked at this on swift now. The existing object is present and correct in both DCs: ` root@ms-fe2009:/home/mvernon# swif... [09:41:33] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main2001.codfw.wmnet with reason: Hardware refresh [09:41:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main2001.codfw.wmnet with reason: Hardware refresh [09:44:27] FIRING: [4x] ProbeDown: Service puppetmaster2001:8141 has failed probes (http_puppetmaster2001_codfw_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:44] FIRING: [6x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:46:54] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064721 [09:49:08] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [09:49:22] Emperor, hnowlan: FYI: I'm going to start replacing nodes in kafka-main (T363210) [09:49:23] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [09:49:38] (03CR) 10JMeybohm: [C:03+2] site.pp: Split node blocks of new kafka nodes into two [puppet] - 10https://gerrit.wikimedia.org/r/1064714 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:49:41] (03CR) 10JMeybohm: [C:03+2] kafka-main: Replace kafka-main2001 with kafka-main2006 [puppet] - 10https://gerrit.wikimedia.org/r/1064715 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [09:51:39] jayme: ack, thanks [09:51:47] in codfw that is [09:53:04] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [09:53:35] 06SRE, 06Infrastructure-Foundations: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#10084113 (10ayounsi) Jelto made me aware of that task. I cleared the report's error by de-attaching the IP from the interface in Netbox, so it matches what we currently have confi... [09:54:27] RESOLVED: [6x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:57:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T370903)', diff saved to https://phabricator.wikimedia.org/P67568 and previous config saved to /var/cache/conftool/dbconfig/20240822-095730-ladsgroup.json [09:57:34] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [09:59:44] (03PS1) 10Hnowlan: Remove role::common::core_platform, change Core Platform references to ServiceOps [puppet] - 10https://gerrit.wikimedia.org/r/1064725 [09:59:48] (03PS1) 10Clément Goubert: httpbb: Add /api/ to appservers tests [puppet] - 10https://gerrit.wikimedia.org/r/1064724 (https://phabricator.wikimedia.org/T373048) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1000) [10:00:44] FIRING: [6x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:00:47] (03CR) 10Ladsgroup: [C:03+1] "very rich" [puppet] - 10https://gerrit.wikimedia.org/r/1064724 (https://phabricator.wikimedia.org/T373048) (owner: 10Clément Goubert) [10:01:18] (03CR) 10Hnowlan: [C:03+1] httpbb: Add /api/ to appservers tests [puppet] - 10https://gerrit.wikimedia.org/r/1064724 (https://phabricator.wikimedia.org/T373048) (owner: 10Clément Goubert) [10:02:21] (03PS1) 10Clément Goubert: mediawiki: Get rid of obsolete extract2.php redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064723 (https://phabricator.wikimedia.org/T373048) [10:02:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#10084141 (10Jelto) Thanks @ayounsi ! So the netbox report is unblocked again. I think the easiest way is to fix the netmask when we switched GitLab to... [10:03:35] (03CR) 10CI reject: [V:04-1] Remove role::common::core_platform, change Core Platform references to ServiceOps [puppet] - 10https://gerrit.wikimedia.org/r/1064725 (owner: 10Hnowlan) [10:03:54] (03CR) 10Clément Goubert: [C:03+2] httpbb: Add /api/ to appservers tests [puppet] - 10https://gerrit.wikimedia.org/r/1064724 (https://phabricator.wikimedia.org/T373048) (owner: 10Clément Goubert) [10:06:08] Expect some sad httpbb noises, they're expected [10:06:31] (03PS1) 10Btullis: cephosd: Remove MD RAID metadata from devices prior to install [puppet] - 10https://gerrit.wikimedia.org/r/1064727 (https://phabricator.wikimedia.org/T372783) [10:06:38] expect^2 [10:06:55] (03CR) 10CI reject: [V:04-1] cephosd: Remove MD RAID metadata from devices prior to install [puppet] - 10https://gerrit.wikimedia.org/r/1064727 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [10:07:15] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3719/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064727 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [10:07:31] (03CR) 10Btullis: cephosd: Remove MD RAID metadata from devices prior to install [puppet] - 10https://gerrit.wikimedia.org/r/1064727 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [10:08:17] (03PS2) 10Hnowlan: Remove role::common::core_platform, s/Core Platform/ServiceOps/g [puppet] - 10https://gerrit.wikimedia.org/r/1064725 [10:09:00] jayme: paged for kafka-main2006 [10:09:09] jayme: darn it, hnowlan is still quicker than me [10:09:12] sorry [10:09:19] seriously, though, is that expected/OK ? [10:09:20] all good? [10:09:26] yep [10:09:32] lemme downtime it [10:09:37] <3 [10:09:57] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main2006.codfw.wmnet with reason: Hardware refresh [10:10:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main2006.codfw.wmnet with reason: Hardware refresh [10:10:16] !sirenbot incidents [10:10:31] jouncebot: nowandnext [10:10:31] For the next 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1000) [10:10:31] In 1 hour(s) and 49 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1200) [10:10:40] !incidents [10:10:40] 5097 (ACKED) kafka-main2006/Kafka Broker Server (paged) [10:10:41] 5092 (RESOLVED) [2x] Primary outbound port utilisation over 80% (paged) global noc () [10:10:41] 5093 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [10:10:41] 5091 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [10:10:41] 5089 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [10:10:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:11:06] lol, thanks [10:11:11] already acked I see [10:11:14] !log cr1-eqiad> request vmhost snapshot recovery re0 - T372781 [10:11:16] jayme: yeah, I did that. [10:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:17] downtime is set [10:11:18] T372781: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781 [10:11:19] thanks [10:11:33] I'll leave it ack'd and it'll auto-resolve once you're done (hopefully ;) ) [10:11:36] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Get rid of obsolete extract2.php redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064723 (https://phabricator.wikimedia.org/T373048) (owner: 10Clément Goubert) [10:11:39] (03CR) 10Hnowlan: [C:03+1] mediawiki: Get rid of obsolete extract2.php redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064723 (https://phabricator.wikimedia.org/T373048) (owner: 10Clément Goubert) [10:11:52] (03PS2) 10Btullis: cephosd: Remove MD RAID metadata from devices prior to install [puppet] - 10https://gerrit.wikimedia.org/r/1064727 (https://phabricator.wikimedia.org/T372783) [10:12:32] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3720/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064727 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [10:12:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P67569 and previous config saved to /var/cache/conftool/dbconfig/20240822-101237-ladsgroup.json [10:14:04] (03Merged) 10jenkins-bot: mediawiki: Get rid of obsolete extract2.php redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064723 (https://phabricator.wikimedia.org/T373048) (owner: 10Clément Goubert) [10:15:28] !log cr1-eqiad> request vmhost snapshot recovery partition re0 - T372781 [10:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:29] (03PS2) 10Hnowlan: Use shellbox-video for videoscaling on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064389 (https://phabricator.wikimedia.org/T356241) [10:16:44] !log cr1-eqiad> request vmhost power-off other-routing-engine - T372781 [10:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:52] T372781: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781 [10:17:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064389 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [10:18:35] !log cgoubert@deploy1003 Started scap sync-world: mediawiki: Get rid of obsolete extract2.php redirect - 1064723 - T373048 [10:18:38] T373048: https://en.wikipedia.org/api/ 404 Not Found - https://phabricator.wikimedia.org/T373048 [10:18:49] (03PS1) 10JMeybohm: Add replacement kafka nodes to kafka_brokers_main [puppet] - 10https://gerrit.wikimedia.org/r/1064730 (https://phabricator.wikimedia.org/T363210) [10:18:56] !log cr1-eqiad> request vmhost power-on other-routing-engine - T372781 [10:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:15] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064730 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [10:19:39] !log cgoubert@deploy1003 cgoubert: mediawiki: Get rid of obsolete extract2.php redirect - 1064723 - T373048 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:20:25] httpbb checks ok on mwdebug, proceeding [10:20:39] !log cgoubert@deploy1003 cgoubert: Continuing with sync [10:22:17] (03PS1) 10Slyngshede: P:idp Remove old CAS 6.6 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064731 (https://phabricator.wikimedia.org/T372997) [10:24:09] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3721/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064731 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [10:24:15] (03CR) 10JMeybohm: [C:03+2] Add replacement kafka nodes to kafka_brokers_main [puppet] - 10https://gerrit.wikimedia.org/r/1064730 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [10:24:18] !log cgoubert@deploy1003 Finished scap sync-world: mediawiki: Get rid of obsolete extract2.php redirect - 1064723 - T373048 (duration: 05m 43s) [10:24:22] T373048: https://en.wikipedia.org/api/ 404 Not Found - https://phabricator.wikimedia.org/T373048 [10:24:53] (03CR) 10CI reject: [V:04-1] P:idp Remove old CAS 6.6 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064731 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [10:25:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:26:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:26:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T371742)', diff saved to https://phabricator.wikimedia.org/P67570 and previous config saved to /var/cache/conftool/dbconfig/20240822-102613-ladsgroup.json [10:26:19] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:26:35] (03CR) 10Slyngshede: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1064731 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [10:27:00] (03PS1) 10Jelto: Revert "gerrit: re-enable throttling over 1000 packets per minute" [puppet] - 10https://gerrit.wikimedia.org/r/1064732 (https://phabricator.wikimedia.org/T365259) [10:27:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P67571 and previous config saved to /var/cache/conftool/dbconfig/20240822-102744-ladsgroup.json [10:27:57] (03CR) 10Jelto: [C:03+2] Revert "gerrit: re-enable throttling over 1000 packets per minute" [puppet] - 10https://gerrit.wikimedia.org/r/1064732 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [10:29:35] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10084199 (10ayounsi) Disk is gone : `name=show vmhost hardware re0 re0: [...] Item Capacity Part number... [10:31:37] (03CR) 10Clément Goubert: [C:03+1] Remove role::common::core_platform, s/Core Platform/ServiceOps/g [puppet] - 10https://gerrit.wikimedia.org/r/1064725 (owner: 10Hnowlan) [10:31:44] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10084200 (10ayounsi) a:03ayounsi [10:32:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:32:32] hmmm [10:33:28] (03CR) 10Hnowlan: [C:03+1] mobileapps: Configure caching for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063765 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [10:35:39] hnowlan: heads up [10:36:06] claime: ack, ty [10:37:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:37:28] "Mcrouter never breaks™️, Memcached never breaks too™️. Except from when they do. " thanks, docs ~_~ [10:37:39] yeaah [10:40:35] not seeing anything specific in logstash for mw-mcrouter [10:42:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T370903)', diff saved to https://phabricator.wikimedia.org/P67572 and previous config saved to /var/cache/conftool/dbconfig/20240822-104252-ladsgroup.json [10:42:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [10:42:56] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [10:43:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [10:43:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T370903)', diff saved to https://phabricator.wikimedia.org/P67573 and previous config saved to /var/cache/conftool/dbconfig/20240822-104314-ladsgroup.json [10:43:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:43:25] (03CR) 10FNegri: [C:03+1] cephosd: Remove MD RAID metadata from devices prior to install [puppet] - 10https://gerrit.wikimedia.org/r/1064727 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [10:44:23] (03CR) 10Btullis: [V:03+1 C:03+2] cephosd: Remove MD RAID metadata from devices prior to install [puppet] - 10https://gerrit.wikimedia.org/r/1064727 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [10:48:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:49:50] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [11:05:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T371742)', diff saved to https://phabricator.wikimedia.org/P67574 and previous config saved to /var/cache/conftool/dbconfig/20240822-110526-ladsgroup.json [11:05:30] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:10:21] (03PS1) 10Btullis: cephosd: Disable swap devices prior to removing MD RAID metadata [puppet] - 10https://gerrit.wikimedia.org/r/1064735 (https://phabricator.wikimedia.org/T372783) [11:11:40] (03CR) 10Btullis: [C:03+2] cephosd: Disable swap devices prior to removing MD RAID metadata [puppet] - 10https://gerrit.wikimedia.org/r/1064735 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [11:12:05] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1005.eqiad.wmnet with OS bookworm [11:14:40] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2011.codfw.wmnet [11:15:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2011.codfw.wmnet [11:15:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T370903)', diff saved to https://phabricator.wikimedia.org/P67575 and previous config saved to /var/cache/conftool/dbconfig/20240822-111531-ladsgroup.json [11:15:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:19:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2011.codfw.wmnet with OS bullseye [11:19:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [11:19:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [11:20:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P67576 and previous config saved to /var/cache/conftool/dbconfig/20240822-112033-ladsgroup.json [11:21:26] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:22:21] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye [11:22:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull... [11:23:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye... [11:24:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1298.eqiad.wmnet with OS bullseye [11:24:54] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2011 - cgoubert@cumin1002" [11:24:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2011 - cgoubert@cumin1002" [11:24:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:24:58] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2011.codfw.wmnet 64.0.192.10.in-addr.arpa 4.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:25:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2011.codfw.wmnet 64.0.192.10.in-addr.arpa 4.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:25:02] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2011 [11:25:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1298.eqiad.wmnet with OS bull... [11:27:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2011 [11:28:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [11:30:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P67577 and previous config saved to /var/cache/conftool/dbconfig/20240822-113038-ladsgroup.json [11:35:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P67578 and previous config saved to /var/cache/conftool/dbconfig/20240822-113540-ladsgroup.json [11:36:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10084369 (10cmooney) a:05cmooney→03None [11:37:12] (03PS4) 10Cathal Mooney: lvs2013: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056563 (https://phabricator.wikimedia.org/T370927) [11:38:14] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [11:41:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084371 (10cmooney) [11:42:12] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1298.eqiad.wmnet with reason: host reimage [11:45:02] FIRING: ProbeDown: Service eventgate-main:4492 has failed probes (http_eventgate-main_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#eventgate-main:4492 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1298.eqiad.wmnet with reason: host reimage [11:45:11] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage [11:45:19] * Emperor here [11:45:29] here [11:45:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P67579 and previous config saved to /var/cache/conftool/dbconfig/20240822-114546-ladsgroup.json [11:46:07] looks like eventgate_main is sad in codfw [11:46:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:46:40] https://wikitech.wikimedia.org/wiki/Runbook#eventgate-main:4492 not so helpful [11:46:53] jinxer-wm: I suspect this might be related to your work if you're about [11:47:05] Encountered rdkafka error event: broker transport failure [11:47:05] jayme ^ [11:47:11] sigh [11:47:11] thanks [11:47:24] timeouts connecting to kafka-main2006.codfw.wmnet:9093 [11:47:35] hmm [11:47:59] where is this from ...ah eventhate [11:48:07] eheh *gate [11:48:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage [11:48:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventgate-main.svc.codfw.wmnet:4492 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:49:06] 2006 missing from the broker list in eventgate config [11:49:33] yeah...I suspect it get's the list of brokers from kafka [11:49:48] Is this a quick-fix, or should I open an incident doc? [11:49:49] and it does not (yet) have a network policy to connect to 2006 [11:50:11] Emperor: give me 30s, I think this is quick [11:50:17] 👍 [11:50:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:50:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T371742)', diff saved to https://phabricator.wikimedia.org/P67580 and previous config saved to /var/cache/conftool/dbconfig/20240822-115047-ladsgroup.json [11:50:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:50:51] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:51:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:51:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T371742)', diff saved to https://phabricator.wikimedia.org/P67581 and previous config saved to /var/cache/conftool/dbconfig/20240822-115108-ladsgroup.json [11:52:41] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:52:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [11:53:15] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:54:02] it'll need to be added to kafka_clusters in hiera to get picked up for network rules afaik [11:54:37] oh it's there [11:54:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [11:54:56] Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:57:34] eventgate-main probes in codfw still dead AFAICS [11:58:29] jayme: anything we can help with or look at? [11:59:09] hmm, I've deployed the new rule to codfw...oooh [11:59:23] maybe eventgate does not use the fancy new things [11:59:56] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1200) [12:00:13] which fancy new things? [12:00:16] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:00:27] hnowlan: the external-services helper [12:00:29] checking [12:00:39] (and deploying rules to all clusters) [12:00:46] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:00:48] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:00:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T370903)', diff saved to https://phabricator.wikimedia.org/P67582 and previous config saved to /var/cache/conftool/dbconfig/20240822-120053-ladsgroup.json [12:00:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1221.eqiad.wmnet with reason: Maintenance [12:00:59] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:01:00] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:01:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1221.eqiad.wmnet with reason: Maintenance [12:01:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:01:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:01:16] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:01:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T370903)', diff saved to https://phabricator.wikimedia.org/P67583 and previous config saved to /var/cache/conftool/dbconfig/20240822-120122-ladsgroup.json [12:01:33] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:01:34] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:02:06] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:02:08] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:02:14] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [12:02:21] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:02:22] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:02:24] hnowlan: according to deployment-charts it does. Where did you see the actual error? [12:02:32] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:02:33] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:02:45] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:02:46] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [12:02:49] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:02:50] jayme: the logs for eventgate-main (`kubectl logs eventgate-production-68576446dc-hk76t eventgate-main` for example) [12:02:50] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:02:53] in codfw [12:02:57] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:03:31] I assume it wouldn't need a roll_restart to pick up new rules? [12:04:43] no, it should not. But I think the issue is that it tries to connect to 2001 and fails [12:04:56] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:05:19] it may need both the external services and the change in configuration [12:06:29] I've killed one pod to see if it comes up again. Might already have a big exponential backoff [12:07:06] IMHO it should not need a config change immediately as it has 2001-2005 listed as brokers and 2002-2005 still work [12:07:16] but then...what do I know [12:07:36] jayme: in which case, would a roll-restart get things going again? [12:07:50] <-- knows very little about our k8s setups [12:07:51] last errors were at 11:47 [12:07:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095 (10cmooney) 03NEW p:05Triage→03Medium [12:07:56] so who knows what's going on [12:08:04] the pods are still in a bad state, just zero other log output [12:08:14] Emperor: yes, if the pod jayme killed comes back up, we would probably do a roll_restart [12:08:15] the new pod hasn't yet logged any errors [12:08:31] new pod is still unhealthy though [12:08:37] is it going to *hurt* to add the new broker? [12:08:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2011.codfw.wmnet with OS bullseye [12:08:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [12:08:58] hnowlan: unhealthy> OOI, how are you asking about its health? [12:09:05] (03PS1) 10Hnowlan: eventgate-main: add new kafka-main2006 instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064740 [12:09:14] Emperor: `kubectl get pods` [12:09:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10084515 (10cmooney) [12:09:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084516 (10cmooney) [12:09:29] and you can do `kubectl describe pod eventgate-production-68576446dc-bj726` to see more about the failures [12:09:33] You can see it has 1/2 Ready [12:10:00] eventgate-production-68576446dc-bj726 is the new one as has not spit out errors yet [12:10:11] hnowlan: i don't think it would hurt to change the brokers list, no [12:10:19] getting on the patch [12:10:24] ttps://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064740 [12:10:27] oh thanks [12:10:28] but I don't recall how long it takes to load the schema. someting rings in the back of my head that it sometimes takes ages [12:10:56] 2001 is gone jayme right? [12:11:21] yes [12:11:45] and 2006-2010 will be added...although I might have to reconsider the process because of this [12:11:46] (03CR) 10Clément Goubert: "Removing 2001, it's gone" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064740 (owner: 10Hnowlan) [12:12:14] there are probably a gazillion other paces with hardcoded connection strings for kafka [12:12:56] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1005.eqiad.wmnet with OS bookworm [12:13:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 (10cmooney) 03NEW p:05Triage→03Medium [12:13:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10084536 (10cmooney) [12:13:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084537 (10cmooney) [12:13:21] 7 minutes stuck loading the schema [12:13:30] I don't know if that's normal but it seems like a long time [12:14:09] (03PS2) 10Hnowlan: eventgate-main: add new kafka-main2006 instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064740 [12:14:10] indeed [12:14:20] i just checked another pod [12:14:32] and it's not logging anything after Loading schema until the kafka error... [12:14:41] gr8 [12:14:49] consistent with other issues with eventgate though [12:14:52] ah..but still it's not fine [12:14:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:15:24] (03CR) 10Clément Goubert: [C:03+1] eventgate-main: add new kafka-main2006 instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064740 (owner: 10Hnowlan) [12:15:55] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [12:16:04] (03CR) 10Hnowlan: [C:03+2] eventgate-main: add new kafka-main2006 instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064740 (owner: 10Hnowlan) [12:16:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 (10cmooney) 03NEW p:05Triage→03Medium [12:16:48] brouberol: shouldn't the kafka client not care about one broker in its connection string missing? [12:16:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084559 (10cmooney) [12:16:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10084558 (10cmooney) [12:17:04] (03Merged) 10jenkins-bot: eventgate-main: add new kafka-main2006 instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064740 (owner: 10Hnowlan) [12:17:11] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1005.eqiad.wmnet with OS bookworm [12:17:22] in modern librdkafka client versions, yes. In ours, I wouldn't bet on it [12:17:34] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [12:18:02] I see expected config changes but nothing else fwiw [12:18:04] I'd need to see what the actual error is, but if the issue is a missing networkpolicy, the issue isn;t the connection string I think [12:18:20] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [12:18:41] the issue is that the client connect to the first broker it can to ask for the cluster topoiogy, which might contain 2006 in there, for which there's no networkpolicy enabling egress [12:18:48] yeah new pods are unhealthy too [12:18:48] brouberol: the network policy is up to date (allows connections to 2006 which is the new broker) [12:18:53] well, at least I think it is [12:19:21] brouberol: and the client does not know about 2006 because it was not in the connection string initially [12:19:40] so it won't try to go there for the first connection, but later on [12:19:50] right, but the connection string is only used to contact _one_ broker in the cluster (called the bootstraop broker) to ask for the whole cluster topology [12:20:15] yeah, got that [12:20:19] after which, the client will try to establish connection to all brokers being partition leaders on the consumed/produced to topics [12:20:32] which might contain 2006 if you've already started to replicate data to it [12:20:51] understood. And yes I did. [12:21:17] so, as it stands, is eventgate still crashing on communicating with 2006? Sorry, I had to take care of the kiddo and was afk for a bit [12:21:29] my impression was that a single missing broker is not a big deal, but maybe that's wrong with old library indeed [12:21:41] we're currently dropping all jobs from mediawiki in codfw, might need more hands if there's anyone who might know this better [12:21:44] brouberol: eventgate has been re-deployed with updated connection string [12:22:00] 2001 out, 2006 in [12:22:04] networkpolicy in place [12:22:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084589 (10Clement_Goubert) [12:22:26] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2011.codfw.wmnet [12:22:26] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2011.codfw.wmnet [12:23:04] the new pods are still unhappy, though, I think? They're Ready 1/2 in kubectol get pods output [12:23:31] yep [12:23:51] hnowlan: I think this is complex (and ongoing) enough we should open an incident. Agree? [12:23:58] I'm looking at https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All, 2006 seems to be catching up ok at least. What's the error y'all are seeing on eventgate? [12:24:03] Emperor: yeah [12:24:12] brouberol: timeouts connecting to 2006 [12:24:12] +1 [12:24:12] Ack, I'll start a doc, [12:24:27] thank you [12:25:17] brouberol: we had timeouts connecting to 2006, like [12:25:18] {"name":"eventgate-wikimedia","hostname":"eventgate-production-68576446dc-qqpht","pid":1,"producer_type":"GuaranteedProducer","level":"WARN","rdkafka_facility":"FAIL","rdkafka_thread":"ssl://kafka-main2001.codfw.wmnet:9093/bootstrap","msg":"ssl://kafka-main2006.codfw.wmnet:9093/2001: Connection setup timed out in state CONNECT (after 30030ms in state CONNECT, 1 identical error(s) suppressed)","time":"2024-08-22T11:45:47.112Z","v":0} [12:25:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 (10cmooney) 03NEW p:05Triage→03Medium [12:25:41] https://docs.google.com/document/d/1hzbGhB_LoeenWX8UMVfuTaL3i8wMf-cfnIZtvLisnvg/edit#heading=h.95p2g5d67t9q <-- doc, I'll start backfilling [12:25:42] where do the external_services networkpolicy changes end up? I don't see them in the service networkpolicy [12:25:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10084619 (10cmooney) [12:25:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084620 (10cmooney) [12:25:59] hnowlan: kubectl -n external-services get ep kafka-main-codfw [12:26:01] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084622 (10Clement_Goubert) [12:26:08] jayme: cool, thanks! [12:26:10] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084623 (10Clement_Goubert) [12:26:31] hnowlan: the netpol in the system namespace just references that [12:26:32] kubectl -n eventgate-main get networkpolicies.crd.projectcalico.org eventgate-production-egress-external-services-kafka -o yaml [12:26:48] Oh, wait. I wonder if eventgate is using external_services and that we need to redpeloy admin_ng to update the Calico NetworkPoilicy controlling egress to kafka-main [12:26:54] quick summary, it populates a calico networkpolicy from the external-services endpoint [12:27:04] brouberol: j.ayme did that already [12:27:37] brouberol: that's done [12:27:42] ok, gotcha. And indeed I'm not seeing any diff [12:28:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102 (10cmooney) 03NEW p:05Triage→03Medium [12:28:25] no actual timeouts coming from the new pods (yet) [12:28:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10084644 (10cmooney) [12:28:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T371742)', diff saved to https://phabricator.wikimedia.org/P67584 and previous config saved to /var/cache/conftool/dbconfig/20240822-122841-ladsgroup.json [12:28:43] but the logging on eventgate is always pretty terse so not sure if that means anything [12:28:45] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:29:03] what's the user-facing impact of this outage? [12:29:08] but the new pods are still failing readiness probes, right? [12:29:18] yep [12:29:47] the heck... [12:30:11] Emperor: pretty far-reaching, jobs and events aren't going to arrive consistently which hits a lot of stuff [12:30:19] so stranger still [12:30:24] the health checks are *timing out* [12:30:29] not even failing [12:30:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103 (10cmooney) 03NEW p:05Triage→03Medium [12:30:57] hnowlan: I still see an old connection string in the configmap [12:31:01] just checked from the network namespace of one of eventgate's pause containers and I can openssl s_client -connect kafka-main2006.codfw.wmnet:9093 [12:31:27] jayme: the new pods have the new one [12:31:36] deploy is still in progress and will eventually fail [12:31:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10084667 (10cmooney) [12:31:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084668 (10cmooney) [12:32:59] would the lag on 2006 somehow cause other issues? [12:33:03] that's quite specific to eventgate though [12:33:11] the pods can connect to 2006 AFICS [12:33:22] *as far as I can see [12:33:24] root@mw2370:~# nsenter-container eventgate-production-748d5f457c-mvjp8 eventgate-main telnet 10.192.5.9 9093 [12:33:24] Trying 10.192.5.9... [12:33:24] Connected to 10.192.5.9. [12:33:24] Escape character is '^]'. [12:33:37] once this deploy fails I'll redeploy to try to up the log level [12:33:51] how is the readiness evaluated? [12:34:11] hnowlan: I should probably update statuspage, then: can you condense that to a short summary I can put in statuspage? [12:35:07] Emperor: hmm... changes may be delayed in appearing? inconsistencies. Some operations may fail [12:35:13] brouberol: just found this [12:35:14] # If test_events is set, EventGate will set up a /v1/_test/events [12:35:17] # route that will process these test_events as if they were POSTed [12:35:20] # to /v1/events. This is used for the k8s readinessProbe. [12:36:02] hnowlan: I don't get how new pods can have a different config then the configmap that is in k8s...am I missing somethign? [12:36:08] I'm inside a new pod, and the configmap isn't up to date [12:36:10] so does that mean that the readiness of the pod assumes that kafka is working well, as we're actually sending messages to it? [12:36:31] claime: ah [12:36:34] readiness specifically does a GET to /v1/_test/events [12:36:35] I was looking at kubectl -n eventgate-main get cm eventgate-production-config -o yaml |grep 2001 [12:36:36] runuser@eventgate-canary-78c5548965-25c76:/srv/service$ grep kafka /etc/eventgate/config.yaml [12:36:38] kafka: [12:36:40] metadata.broker.list: kafka-main2001.codfw.wmnet:9093,kafka-main2002.codfw.wmnet:9093,kafka-main2003.codfw.wmnet:9093,kafka-main2004.codfw.wmnet:9093,kafka-main2005.codfw.wmnet:9093 [12:36:41] claime: huh [12:36:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 (10cmooney) 03NEW p:05Triage→03Medium [12:36:59] maybe rollback is already in progess and helm cleaned up? [12:37:04] maybe [12:37:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10084691 (10cmooney) [12:37:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084692 (10cmooney) [12:37:10] https://manage.statuspage.io/pages/nnqjzz7cd4tj/incidents/58zltngkd6rm [12:37:13] ..2024_08_22_12_28_18.877610020 is the timestamp [12:37:27] I think we should try to update the configmap manually and roll_restart the pods [12:37:30] I can see the updated list in the diff [12:37:32] can't be worse [12:37:45] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [12:37:49] there we go [12:37:52] hi, I'm catching up [12:38:08] claime: yeah, let's do (I'm doing it) [12:38:14] ack [12:38:45] while you're at it, would you set log `level: debug` in logging please? [12:38:57] cdanis: long story short, eventgate borked following a change to kafka topology in codfw, it's been down for about an hour [12:39:10] hnowlan: ack [12:39:21] As far as I'm aware, 2006 is currently catching up as fast as possible, as it tends to happen when you replace the hardware but keep the broker id. This might negatively impact the producers, which might be causing the readiness failure, if they involve sending messages to kafka [12:39:55] that the heck...old pods are coming back [12:39:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 (10cmooney) 03NEW p:05Triage→03Medium [12:39:57] RESOLVED: ProbeDown: Service eventgate-main:4492 has failed probes (http_eventgate-main_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#eventgate-main:4492 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:39:58] jayme: did you do it? event produce rest picking up [12:40:08] s/rest/rate/ [12:40:11] rest is what I need [12:40:12] I did change the configmap, I did not restart pods [12:40:17] huh [12:40:23] lemme check something [12:40:28] some pods are healthy [12:40:35] jayme: something started scaling the replicaset [12:40:44] FIRING: [7x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:47] does someone have a helmfile running in another tab somewhere lol [12:40:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084727 (10cmooney) [12:40:49] canary is also back, I have not changed that config [12:40:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10084726 (10cmooney) [12:40:51] uh [12:41:04] it didn't change the kafka config in /etc/eventgate/config.yaml [12:41:06] wtf [12:41:10] 2006 looks to be almost caught up (in terms of disk usage % compared to 2001), so we might see the system slowing down and pods readiness coming back up [12:41:12] cdanis: that is probably from h.nowlans deploument later on [12:41:17] emphasis on _might_ [12:41:48] cdanis: *deployment earlier -sorry [12:42:20] brouberol: "o we might see the system slowing down and pods readiness coming back up" <- I don't get that, can you elaborate? [12:42:22] is eventgate smart enough to reload the config as is? [12:42:29] jayme: but... 2m59s Normal ScalingReplicaSet deployment/eventgate-production Scaled up replica set eventgate-production-68576446dc to 10 [12:42:37] ^ [12:42:38] 3 minutes ago hnowlan's helmfile run had stopped [12:43:08] pods all showing Ready 2/2 now [12:43:15] graphs coming back to nominal levels [12:43:28] https://phabricator.wikimedia.org/P67585 claime jayme I was watching live with `kubectl get events -w --all-namespaces | ts` [12:43:32] claime: yeah...dunno. But it's still the old pods coming back up - with the old config [12:43:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:43:44] Deployment eventgate-production in eventgate-main at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=eventgate-main&var-deployment=eventgate-production - ... [12:43:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:43:45] jayme: 2006 seems to be almost done with its data replication (when comparing its disk usage % with 2001). My assumption is that the full-speed replication is negatively affecting the system, and causing the eventgate readiness probe to fail, as it seems to send actual messages to kafka [12:43:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P67586 and previous config saved to /var/cache/conftool/dbconfig/20240822-124348-ladsgroup.json [12:43:49] yeah I have no idea what's happening [12:43:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventgate-main.svc.codfw.wmnet:4492 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:44:10] if the replication finishes and the system slows down, the readiness probe / sending event to kafka might start to work again [12:44:18] again, these are hypotheses [12:44:31] From a status POV, shall I move to "monitoring", given things look to be recovering? [12:44:39] yes [12:44:40] Emperor: yes [12:44:53] brouberol: thanks...I tripped on "slowing down" :D [12:44:54] mediawiki has recovered [12:44:56] we're talking about eventgate-main namespace, right? [12:44:59] cdanis: yep [12:44:59] yes [12:45:01] in codfw [12:45:02] I hate that it's eventgate-production pods in five different namespaces :P [12:45:04] yeah [12:45:11] jayme: for later. Let's sync on how we can make sure a replacement broker does not try to catch up at full speed [12:45:15] there's no eventgate-production replicaset that dates from today, which, seems odd [12:45:19] brouberol: ack [12:45:22] (kafka has a throttling mechanism) [12:45:28] cdanis: not really [12:45:45] there were no changes to the replicaset, only to a configmap [12:46:01] when you do that, helmfile will scale up and down replicasets, not remove and recreate them [12:46:04] {{done}} [12:46:08] hm okay [12:46:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:46:33] ok since we're all out of fire, I'm going to get lunch now x) [12:46:42] same :) [12:46:47] is there a task open or any document or anything? [12:46:49] can I close the incident ? [12:46:52] Emperor: yeah [12:46:58] cdanis: doc is https://docs.google.com/document/d/1hzbGhB_LoeenWX8UMVfuTaL3i8wMf-cfnIZtvLisnvg/edit [12:47:02] main thing is the readiness check for eventgate is kinda testing kafka rather than eventgate [12:47:05] * Emperor would welcome an expert review of that incident doc [12:47:09] cdanis: I think that's not correct. A change in spec (hash over the config) will force a a new replicaset (replicaset/eventgate-production-748d5f457c in this case) to be scaled up [12:47:28] hnowlan: exactly. that check it broken by design [12:47:34] jayme: well the newest rs here is 23 days old [12:47:35] hnowlan: this, and also, if it would be possible to at least have a log after "Loading schema" that says it's done loading the schema it would be awesome [12:47:48] "Ready to serve" log messages are amazing [12:47:48] hnowlan: agreed [12:48:03] that's a good recipe for outage propagation [12:48:14] claime: that is because the new one is created, and the old one is scaled down (slowly) until it is 0. then it will be deleted [12:48:30] cdanis: *happy warcraft peon noises* [12:48:33] ckaime: yep, eventgate logging has been an issue in a few other outages [12:48:42] in todays case, it never reached 0. helm rolled back (e.g. deleted the new replicaset, scaled up the old one) [12:48:48] jayme: ah bad comprehension on my part then [12:49:04] that's why we still see the very old replicaset now [12:49:24] claime: what, you don't understand the operation mode of a tool that orchestrates the actions of a different orchestration tool six times removed? [12:49:47] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-main/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=eventgate-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:49:50] aha [12:50:15] Aug 22 12:50:02 eventgate-main 0s Warning Unhealthy pod/eventgate-production-68576446dc-dqwbh Readiness probe failed: Get "http://10.194.190.201:8192/v1/_test/events": context deadline exceeded (Client.Timeout exceeded while awaiting headers) [12:50:17] Aug 22 12:50:02 eventgate-main 0s Warning Unhealthy pod/eventgate-production-68576446dc-vhhpc Readiness probe failed: Get "http://10.194.161.155:8192/v1/_test/events": context deadline exceeded (Client.Timeout exceeded while awaiting headers) [12:50:19] also [12:50:32] argh, those pods are all READY 1/2 again [12:50:38] yes [12:51:14] Sigh, shall I reopen the incident? [12:51:40] put it back in monitoring [12:51:45] it's kinda flapping afaict [12:51:47] {"name":"eventgate-wikimedia","hostname":"eventgate-production-68576446dc-xhkx5","pid":1,"producer_type":"GuaranteedProducer","level":"WARN","rdkafka_facility":"FAIL","rdkafka_thread":"ssl://kafka-main2001.codfw.wmnet:9093/bootstrap","msg":"ssl://kafka-main2006.codfw.wmnet:9093/2001: Connection setup timed out in state CONNECT (after 30030ms in state CONNECT, 1 identical error(s) suppressed)","time":"2024-08-22T11:47:10.182Z","v" [12:52:04] oh, that's old. sorry [12:52:32] the pods failing readiness checks are the old ones [12:52:47] * Emperor tries to work out how to do that [set a resolved incident back to monitoring] [12:53:15] cdanis: there's one of the new ones that got restarted [12:53:20] Last State: Terminated [12:53:22] Reason: Error [12:53:24] thanks. [12:53:37] recovering again a little? [12:53:47] do we want to roll forward the helm release? [12:53:49] it is pending again [12:53:55] debug logging isn't giving us much https://logstash.wikimedia.org/goto/004af28657d003ccf4e8a8e5f91c9639 [12:54:07] - metadata.broker.list: kafka-main2001.codfw.wmnet:9093,kafka-main2002.codfw.wmnet:9093,kafka-main2003.codfw.wmnet:9093,kafka-main2004.codfw.wmnet:9093,kafka-main2005.codfw.wmnet:9093 [12:54:09] + metadata.broker.list: kafka-main2002.codfw.wmnet:9093,kafka-main2003.codfw.wmnet:9093,kafka-main2004.codfw.wmnet:9093,kafka-main2005.codfw.wmnet:9093,kafka-main2006.codfw.wmnet:9093 [12:54:11] we need to apply that right [12:54:15] yeah, worth trying [12:54:18] hnowlan: you mean except spamming that it accepted events? x) [12:54:22] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [12:54:33] last time the condition never got accepted because the healthcheck never fully passed [12:54:49] [you can't: https://community.atlassian.com/t5/Statuspage-questions/Re-opening-Resolved-Incidents/qaq-p/1733356 ] [12:55:32] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [12:55:40] might be okay to keep it resolved for now Emperor it's ony some of the pods that sometimes fail readiness [12:55:47] eventgate-production-748d5f457c 10 10 9 37m [12:55:51] we're getting there [12:55:51] so we should be processing events [12:55:52] Emperor: let it go then, we'll see if it falls back down [12:55:56] ack [12:56:01] ok, we successfully deployed, this time [12:56:03] I've moved the incident doc back to monitoring/flapping [12:56:12] as of this writing all 10 *new* pods are healthy [12:56:13] cdanis: agree [12:56:33] well, flapping [12:56:37] great. [12:56:38] and immediately working...not like hanging for minutes after schema fetch [12:56:49] jayme: 7 of them just went unhealthy [12:56:53] and are now back [12:56:59] ah, nice [12:57:01] yeah they're flapping [12:57:27] brouberol: could it be that kafka still "knows" about 2001 (the host, not the broker-id)? [12:57:37] Readiness: http-get http://:8192/v1/_test/events delay=2s timeout=1s period=10s #success=1 #failure=3 [12:57:40] timeout 1s really? [12:57:44] aiui we can unset test_events which will disable the posting behaviour [12:57:51] timeout 1s is so aggressive [12:57:53] but I have zero idea if that will just break the readiness check [12:58:25] I've edited the readiness probe timeout to 10s [12:58:27] can we temporarily either relax the timeout or replace it with a tcpSocket check? [12:58:28] on the live rs [12:58:29] can/should we just toggle the readiness check off for now? [12:58:31] thanks cdanis [12:58:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P67587 and previous config saved to /var/cache/conftool/dbconfig/20240822-125855-ladsgroup.json [12:59:02] jayme: let me think about how we could investigate this [12:59:19] because pods are immutable (lol), this `edit` of course actually creates a new rs, which is so far scaling up successfully [12:59:26] eventgate-production-5d55469d8d 10 10 10 66s [12:59:41] a good/clean way to ensure they don't is a cluster rolling-restart, but that might cause some turbulence, which might not be what we want atm [12:59:47] RESOLVED: [2x] HelmReleaseBadStatus: Helm release eventgate-main/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=eventgate-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:59:49] they are still flapping [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1300) [13:00:05] hnowlan and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] hello! I have 15 minutes until a meeting [13:00:11] just skimmed backscroll, looking at doc now [13:00:13] :) [13:00:21] o/ [13:00:36] actually going to lunch now that we have other people up to date on the issue [13:00:46] claime: ack [13:00:48] back in a bit [13:00:53] I'm going to do the following, I'm going to relax the readiness check timeout again, *and* scale up the rs [13:01:13] sgtm [13:01:22] i'm surprised also that the incorrect broker list would cause this, but perhaps it is as brouberol suggested and the replication to the new broker is causing kafka to be overloaded and not work? [13:01:37] re readiness check: you can disable it, but it will just slightly hide the outage , no? [13:01:41] anzx: deployment might be delayed until this outage is over with [13:01:47] Ok [13:01:48] the pods will come up, but real events will fail production [13:02:04] oh, I'm a dumbass, I edited the rs and not the deployment [13:02:48] also, looking at the logs it takes minutes to load all the schemata [13:03:07] not sure what the state of the thing is while it does that [13:03:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [13:03:44] Deployment eventgate-production in eventgate-main at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=eventgate-main&var-deployment=eventgate-production - ... [13:03:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:03:59] ottomata: the theory was that the replication delay meant that test_events wasn't accessible on 2006 rather than the service itself being unhealthy, but it seems like that's not the main factor [13:03:59] in the meantime, I'll try to throttle the replication for kafka-main2006. jayme: do we agree that this host has in fact the broker id 2001? [13:04:04] (will a deployer be needed for this window, or is someone else going to handle it after the outage?) [13:04:09] > also, looking at the logs it takes minutes to load all the schemata [13:04:10] do you mean the event schemas?! [13:04:13] brouberol: yes [13:04:14] that would be very surprising [13:04:14] ottomata: yes [13:04:19] jayme: ack [13:04:25] they are loaded from inside the image... [13:04:29] in eventgate main [13:04:31] (IIRC) [13:04:51] msg":"Loading schema at /test/event/1.0.0","time":"2024-08-22T12:58:20.256Z","v":0} [13:04:55] > test_events wasn't accessible on 2006 [13:04:55] wouldn't the leader just be on another broker? [13:04:59] "Loading schema at /mediawiki/revision/score/3.0.0","time":"2024-08-22T13:02:15.531Z","v":0} [13:05:08] i don't think it is doing ACKS=all...or is it? [13:05:09] the first and the last line I see in the logs [13:05:17] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1005.eqiad.wmnet with OS bookworm [13:05:42] oh jayme that is okay i think, there is a list of schemas to preload, otherwise they are loaded on demand when the first event for it comes in [13:06:09] ah, it really needs a "ready to serve requests" log message then :) [13:06:15] i think it has one? lemme see [13:07:19] cdanis: you did edit the deployment now, relaxing the readiness timeout? [13:07:49] jayme: yes [13:07:57] ah, okaz [13:08:11] replicas: 20 and readinessProbe.timeoutSeconds: 10 [13:08:25] not sure why helm doesn't show diffs [13:08:41] codfw.eventgate-main.test.event leader is kafka-main2004 [13:08:43] `kubectl -n eventgate-main edit deployments.apps eventgate-production` still shows my edits there [13:09:39] cdanis: helm does not diff against the state of the world (unfortunately), we can discuss later [13:09:45] ... ! [13:09:53] yeah...don't make me cry [13:10:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:13] anyway, with the increased readinessprobe timeout, the canary pods are still flapping from time to time, but it looks like the edited ones aren't [13:11:33] readinessProbes with timeouts <5s are a bad idea IMO [13:11:53] probes checking backend systems are as well :) [13:11:56] okay, i think the problem with the readiness probe is that the default ACKS is -1, which is all. which means that all replicas must ACK the produce request. [13:12:00] hmm but [13:12:00] no [13:12:14] jayme: that one depends, but mostly I agree [13:12:14] wait no. 2006 isn't in the list of replicas for topic codfw.eventgate-main.test.event [13:12:28] oh is 2006 == 2001 broker id now? [13:12:55] yes, as the broker was provisioned as a replacement, with the old broker id [13:12:56] kafka-main2001 was broker id 2001, now kafka-main2006 is broker id 2001 [13:13:03] okay [13:13:21] and 2001 is in the ISR for codfw.eventgate-main.test.event then. [13:13:43] on my end, I keep getting ACL-denied when attempting to set a broker replication throttle on 2001 [13:13:52] I'll go around and fix all other connection strings I can find before things get ugly there as well with a restart or so [13:13:59] if 2006 is overloaded and not responding to the ACK, even though it is in the ISR, this would be a problem [13:14:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T371742)', diff saved to https://phabricator.wikimedia.org/P67588 and previous config saved to /var/cache/conftool/dbconfig/20240822-131402-ladsgroup.json [13:14:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:14:06] i think we need to change to ACKS=1 or ACKS=2 [13:14:07] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:14:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:14:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T371742)', diff saved to https://phabricator.wikimedia.org/P67589 and previous config saved to /var/cache/conftool/dbconfig/20240822-131425-ladsgroup.json [13:14:33] oh, was it using acks=all? [13:14:38] i think that is the default, yes. [13:14:43] and i don't see it set anywhere [13:14:46] (maybe I missed it) [13:15:09] acks=all with a low timeout and a loaded broker seems like a plausible explanation [13:15:12] indeed [13:15:22] brouberol: i have to do meeting now but maybe try changing that, it will be in the eventgate-main values.yaml file [13:15:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:35] let me have a look [13:15:36] well, the high timeout seems to have fixed it, ottomata [13:15:36] in conf.kafka.conf [13:15:44] short-term I'm not sure we're in a state where we need to block the backport window any further, agreed? [13:15:45] okay [13:15:45] the pods aren't flapping anymore [13:15:49] hnowlan: agreed [13:15:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:15:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1298.eqiad.wmnet with OS bullseye [13:15:57] jouncebot: now [13:15:57] For the next 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1300) [13:15:58] that could help too, cuz maybe 2006 is just takign too long to ACK [13:16:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084841 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1298.eqiad.wmnet with OS bullseye... [13:16:16] TheresNoTime: think we're okay to go ahead with the backport! [13:16:19] TheresNoTime: are you around to deploy? [13:16:24] efb [13:16:26] can do! [13:16:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084843 (10Jclark-ctr) [13:16:39] hnowlan: Hm, does that mean we can close the incident, or are y'all still working on it? [13:16:49] hnowlan: will start with yours [13:16:53] ottomata: imo a readiness probe timeout of 1s was broken from the start :) [13:16:53] Emperor: I'd say monitoring is okay for now [13:17:00] cdanis: that makes sense [13:17:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye [13:17:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064389 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:17:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull... [13:17:16] hnowlan: OK, we should either resolve by COB or make sure to pass onto the Americas oncallers [13:17:16] is anyone familiar with the kafka-main ACLs? I can't seem to be able to apply a config change to the cluster, even as the root user [13:17:16] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1296.eqiad.wmnet with OS bullseye [13:17:17] TheresNoTime: thanks! it only takes effect once it hits prod jobrunners so it doesn't need to be tested on mwdebug [13:17:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye... [13:17:22] ack [13:17:33] Emperor: don't worry 😎 i'm one of those [13:17:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye [13:17:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull... [13:17:53] cdanis: heh. Do you want to take IC at some point, then? [13:18:12] (03PS1) 10EoghanGaffney: gitlab: Allow backup script metrics call to fail [puppet] - 10https://gerrit.wikimedia.org/r/1064755 (https://phabricator.wikimedia.org/T371222) [13:18:13] (03Merged) 10jenkins-bot: Use shellbox-video for videoscaling on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064389 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:18:27] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1064389|Use shellbox-video for videoscaling on group2 (T356241)]] [13:18:31] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:18:38] (03PS1) 10Ayounsi: Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) [13:18:45] Emperor: sure, I am IC [13:18:53] <3 [13:19:04] (03PS2) 10Ayounsi: Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) [13:19:45] (ping me if you need anything) [13:20:10] the doc was missing an impact statement, I made a guess at one but I wasn't around for the early part of it to have good context [13:20:51] (03CR) 10CI reject: [V:04-1] Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [13:21:04] I wasn't entirely sure on the impact, either, hence the somewhat vague statuspage updates [13:21:31] ack [13:21:44] (03PS3) 10Ayounsi: Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) [13:23:02] !log samtar@deploy1003 hnowlan, samtar: Backport for [[gerrit:1064389|Use shellbox-video for videoscaling on group2 (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:23:07] !log samtar@deploy1003 hnowlan, samtar: Continuing with sync [13:23:21] (03CR) 10CI reject: [V:04-1] Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [13:23:58] (03PS1) 10JMeybohm: Update various kafka-main connection strings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064758 (https://phabricator.wikimedia.org/T363210) [13:24:27] (03PS4) 10Ayounsi: Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) [13:25:15] (03PS5) 10Ayounsi: Add devicetype validator [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064756 (https://phabricator.wikimedia.org/T348036) [13:26:25] (03PS3) 10Anzx: knwikisource : Create flood flag and add file importer right to Admin user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064556 (https://phabricator.wikimedia.org/T373073) [13:26:40] somebody has the courage for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064758 ? [13:27:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T370903)', diff saved to https://phabricator.wikimedia.org/P67590 and previous config saved to /var/cache/conftool/dbconfig/20240822-132717-ladsgroup.json [13:27:21] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:27:30] (03CR) 10Hnowlan: [C:03+2] shellbox-video, admin_ng: bump resource limits and replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060104 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:27:38] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064389|Use shellbox-video for videoscaling on group2 (T356241)]] (duration: 09m 10s) [13:27:42] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:27:53] hnowlan: did you want to test now? [13:28:10] TheresNoTime: yep, thank you! [13:28:29] should I hold on incase I need to rollback or can I continue with another deployment? [13:28:46] TheresNoTime: I'd proceed - there's no guarantee I'll get a scaling job in the time we're waiting [13:28:49] thank you! [13:29:05] (03PS3) 10Slyngshede: D:apereo_cas::service allow exclusion of LDAP groups. [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) [13:29:11] ack, starting yours now anzx [13:29:16] TheresNoTime: ok [13:29:19] andre: I suspect the tasks you're sorting through right now all have to do with the issues with eventgate earlier? [13:29:20] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#10084880 (10ayounsi) a:03ayounsi Taking the task to create the validator [13:29:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064556 (https://phabricator.wikimedia.org/T373073) (owner: 10Anzx) [13:29:45] cdanis: sorry, I need more context I'm afraid [13:29:57] https://www.wikimediastatus.net/incidents/58zltngkd6rm [13:30:07] I think failing edits was one of the ways that manifested [13:30:23] (03Merged) 10jenkins-bot: knwikisource : Create flood flag and add file importer right to Admin user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064556 (https://phabricator.wikimedia.org/T373073) (owner: 10Anzx) [13:30:35] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1064556|knwikisource : Create flood flag and add file importer right to Admin user group (T373073)]] [13:30:39] T373073: knwikisource : Create flood usergroup and add file importer right to Admin user group - https://phabricator.wikimedia.org/T373073 [13:30:54] (03Merged) 10jenkins-bot: shellbox-video, admin_ng: bump resource limits and replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060104 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:32:20] cdanis: ah, thanks! There's also a lot of stuff in Logstash which is hard to interpret... [13:32:38] !log samtar@deploy1003 anzx, samtar: Backport for [[gerrit:1064556|knwikisource : Create flood flag and add file importer right to Admin user group (T373073)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:32:41] TheresNoTime: checking [13:33:17] ack [13:33:39] (03CR) 10Brouberol: Update various kafka-main connection strings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064758 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [13:34:08] TheresNoTime: looks good [13:34:19] (03PS3) 10Brouberol: datahub: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064338 (https://phabricator.wikimedia.org/T373000) [13:34:19] (03PS3) 10Brouberol: spark-history: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064339 (https://phabricator.wikimedia.org/T373000) [13:34:19] (03PS3) 10Brouberol: superset: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064340 (https://phabricator.wikimedia.org/T373000) [13:34:19] (03PS3) 10Brouberol: airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) [13:34:20] (03PS3) 10Brouberol: growthbook: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) [13:34:21] (03PS2) 10Brouberol: cloudnative-pg-cluster: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064372 (https://phabricator.wikimedia.org/T373000) [13:34:25] (03PS2) 10Brouberol: cloudnative-pg-operator: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064373 (https://phabricator.wikimedia.org/T373000) [13:34:26] !log samtar@deploy1003 anzx, samtar: Continuing with sync [13:34:29] (03PS1) 10Brouberol: ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) [13:34:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10084905 (10cmooney) [13:34:53] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage [13:35:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10084907 (10cmooney) [13:35:37] (03CR) 10JMeybohm: Update various kafka-main connection strings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064758 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [13:35:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10084911 (10cmooney) [13:36:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10084920 (10cmooney) [13:36:16] (03CR) 10Brouberol: [C:03+1] Update various kafka-main connection strings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064758 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [13:36:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10084921 (10cmooney) [13:36:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10084925 (10cmooney) [13:36:52] cdanis: i'm going to deploy the things in https://gerrit.wikimedia.org/r/1064758 - just in case [13:36:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10084926 (10cmooney) [13:37:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10084928 (10cmooney) [13:37:20] jayme: ack sounds good, I have seen some failing healthchecks for a few of those other services [13:37:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10084929 (10cmooney) [13:37:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage [13:38:21] (03CR) 10JMeybohm: [C:03+2] Update various kafka-main connection strings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064758 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [13:38:26] (03PS2) 10Brouberol: ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) [13:38:55] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064556|knwikisource : Create flood flag and add file importer right to Admin user group (T373073)]] (duration: 08m 20s) [13:38:59] T373073: knwikisource : Create flood usergroup and add file importer right to Admin user group - https://phabricator.wikimedia.org/T373073 [13:39:06] anzx: live on prod [13:39:29] TheresNoTime: thank you [13:39:36] cdanis: ack. There also is a mirrormaker alert still which I don't fully understand (but I have not looked in detail because of the eventgate mess) [13:39:55] (03Merged) 10jenkins-bot: Update various kafka-main connection strings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064758 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm) [13:40:21] might as well be because of the replication lag of the new broker [13:40:43] (03PS6) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [13:41:35] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:41:38] (03PS4) 10Slyngshede: D:apereo_cas::service allow exclusion of LDAP groups. [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) [13:41:53] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:42:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P67591 and previous config saved to /var/cache/conftool/dbconfig/20240822-134224-ladsgroup.json [13:42:29] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3723/co" [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [13:42:29] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:42:36] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:42:38] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:42:44] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:43:12] (03PS5) 10Slyngshede: D:apereo_cas::service allow exclusion of LDAP groups. [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) [13:43:27] !log UTC afternoon backport window closed [13:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:55] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3724/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [13:43:57] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1064416 (https://phabricator.wikimedia.org/T373040) (owner: 10Ahmon Dancy) [13:44:30] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:44:43] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3725/co" [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [13:45:25] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:45:42] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:46:28] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:47:00] (03PS6) 10Slyngshede: D:apereo_cas::service allow exclusion of LDAP groups. [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) [13:47:45] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3726/co" [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [13:48:57] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:50:10] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:51:05] (03PS1) 10Hnowlan: shellbox-video: remove emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064763 (https://phabricator.wikimedia.org/T357309) [13:51:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T371742)', diff saved to https://phabricator.wikimedia.org/P67592 and previous config saved to /var/cache/conftool/dbconfig/20240822-135111-ladsgroup.json [13:51:25] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:51:29] jayme: kafka2006 is now fully caught up and network traffic has massively gone down. Should we redeploy eventgate as it was? [13:52:18] brouberol: thanks! I'll deploy the other connection string changes I have in the queue and then come back to eventgate [13:52:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:53:12] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [13:54:14] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [13:54:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1296.eqiad.wmnet with OS bullseye [13:54:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10084970 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye... [13:55:51] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:57:14] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:57:23] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [13:57:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P67593 and previous config saved to /var/cache/conftool/dbconfig/20240822-135731-ladsgroup.json [13:58:00] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [13:58:08] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [13:59:29] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [14:05:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:06:03] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:06:14] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:06:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P67594 and previous config saved to /var/cache/conftool/dbconfig/20240822-140618-ladsgroup.json [14:06:24] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2022:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:06:35] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:07:56] RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:09:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10085064 (10cmooney) [14:10:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:10:54] cdanis: I'm rolling back your manual changes to the eventgate-main production deployment (I'm changing replicas from 20 to 10, readiness probe timeout from 10 to 1) [14:10:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:11:03] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:11:05] jayme: ack [14:11:06] fwiw the 1s timeout is the k8s default [14:11:10] guh [14:11:15] (03PS2) 10Hnowlan: shellbox-video: remove emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064763 (https://phabricator.wikimedia.org/T357309) [14:11:15] I hate it [14:11:19] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:11:25] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:11:31] I read that between the lines :D [14:11:35] RESOLVED: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:11:41] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:12:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T370903)', diff saved to https://phabricator.wikimedia.org/P67595 and previous config saved to /var/cache/conftool/dbconfig/20240822-141239-ladsgroup.json [14:12:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1238.eqiad.wmnet with reason: Maintenance [14:12:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1238.eqiad.wmnet with reason: Maintenance [14:13:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T370903)', diff saved to https://phabricator.wikimedia.org/P67596 and previous config saved to /var/cache/conftool/dbconfig/20240822-141300-ladsgroup.json [14:13:12] :D [14:15:51] I'd like to run a modified version of a maintenance script in dry-run mode on testwiki, and if all looks well, on dewiki. I don't expect any issues, an almost identical version of this script runs against these wikis every day too. Can I do that or is there something going on right now, and I should come back later? [14:16:44] jouncebot: nowandnext [14:16:44] No deployments scheduled for the next 0 hour(s) and 43 minute(s) [14:16:44] In 0 hour(s) and 43 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1500) [14:16:56] MichaelG_WMF: sure [14:17:05] thanks! [14:17:23] (03PS4) 10Brouberol: datahub: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064338 (https://phabricator.wikimedia.org/T373000) [14:17:23] (03PS4) 10Brouberol: spark-history: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064339 (https://phabricator.wikimedia.org/T373000) [14:17:23] (03PS4) 10Brouberol: superset: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064340 (https://phabricator.wikimedia.org/T373000) [14:17:24] (03PS4) 10Brouberol: airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) [14:17:25] !log T372333, with I431d2aba14db9ab8931e21260cb2005c7276e2b8 checked out, running mwscript /home/migr/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=testwiki --dry-run --search-index --db-table [14:17:25] (03PS4) 10Brouberol: growthbook: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) [14:17:27] (03PS3) 10Brouberol: cloudnative-pg-cluster: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064372 (https://phabricator.wikimedia.org/T373000) [14:17:31] (03PS3) 10Brouberol: cloudnative-pg-operator: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064373 (https://phabricator.wikimedia.org/T373000) [14:17:35] (03PS3) 10Brouberol: ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) [14:19:13] that run looks much more sensible (though still interesting), let's run it against de-wiki, still as dry-run [14:19:20] !log T372333, with I431d2aba14db9ab8931e21260cb2005c7276e2b8 checked out, running mwscript /home/migr/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=dewiki --dry-run --search-index --db-table [14:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:23] T372333: de.wikipedia: Add Link unavailable due to a high-number of dangling records - https://phabricator.wikimedia.org/T372333 [14:19:39] (03CR) 10Btullis: [C:03+1] datahub: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064338 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:21:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P67597 and previous config saved to /var/cache/conftool/dbconfig/20240822-142126-ladsgroup.json [14:23:04] Alright, my scripts are done ✅ [14:24:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10085124 (10Jclark-ctr) [14:24:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10085125 (10Jclark-ctr) 05Open→03Resolved [14:25:32] (03CR) 10Brouberol: [C:03+2] datahub: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064338 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:28:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [14:29:51] (03PS7) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [14:31:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [14:32:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [14:33:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10085178 (10Jclark-ctr) 05Open→03Resolved verified cable and link lights [14:36:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [14:36:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T371742)', diff saved to https://phabricator.wikimedia.org/P67598 and previous config saved to /var/cache/conftool/dbconfig/20240822-143633-ladsgroup.json [14:36:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [14:36:37] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:36:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [14:36:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T371742)', diff saved to https://phabricator.wikimedia.org/P67599 and previous config saved to /var/cache/conftool/dbconfig/20240822-143655-ladsgroup.json [14:37:05] (03PS1) 10Andrew Bogott: Openstack eqiad1: upgrade to 2024.1 'caracal' [puppet] - 10https://gerrit.wikimedia.org/r/1064771 (https://phabricator.wikimedia.org/T369044) [14:37:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:37:58] (03PS2) 10Andrew Bogott: Openstack eqiad1: upgrade to 2024.1 'caracal' [puppet] - 10https://gerrit.wikimedia.org/r/1064771 (https://phabricator.wikimedia.org/T369044) [14:38:39] (03CR) 10Btullis: [C:03+1] spark-history: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064339 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:38:55] (03CR) 10Btullis: [C:03+1] superset: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064340 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:39:05] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:39:16] (03CR) 10Btullis: [C:03+1] growthbook: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:39:50] (03CR) 10Brouberol: [C:03+2] spark-history: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064339 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:40:25] (03CR) 10Btullis: [C:03+1] cloudnative-pg-cluster: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064372 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:40:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T370903)', diff saved to https://phabricator.wikimedia.org/P67600 and previous config saved to /var/cache/conftool/dbconfig/20240822-144036-ladsgroup.json [14:40:40] (03CR) 10Btullis: [C:03+1] cloudnative-pg-operator: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064373 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:40:40] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:40:54] (03CR) 10Btullis: [C:03+1] ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:40:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [14:41:20] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064771 (https://phabricator.wikimedia.org/T369044) (owner: 10Andrew Bogott) [14:41:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [14:42:38] (03PS1) 10Btullis: cephosd: Don't fail if /proc/mdstat doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/1064773 (https://phabricator.wikimedia.org/T372783) [14:42:44] (03PS1) 10Ayounsi: IP validator: don't allow empty dns on active mgmt interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064775 (https://phabricator.wikimedia.org/T339121) [14:43:54] (03CR) 10Brouberol: [C:03+2] superset: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064340 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:44:31] (03CR) 10FNegri: [C:03+1] "I feel optimistic that this will be the last tweak required 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1064773 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [14:44:37] (03CR) 10Brouberol: [C:03+1] cephosd: Don't fail if /proc/mdstat doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/1064773 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [14:44:58] (03CR) 10CI reject: [V:04-1] IP validator: don't allow empty dns on active mgmt interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064775 (https://phabricator.wikimedia.org/T339121) (owner: 10Ayounsi) [14:45:07] (03Merged) 10jenkins-bot: superset: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064340 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:45:40] (03CR) 10Btullis: [C:03+2] cephosd: Don't fail if /proc/mdstat doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/1064773 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [14:46:31] (03CR) 10Btullis: [C:03+2] "Your optimism is contagious. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1064773 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [14:46:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [14:47:20] (03PS2) 10Ayounsi: IP validator: don't allow empty dns on active mgmt interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064775 (https://phabricator.wikimedia.org/T339121) [14:47:25] 10SRE-tools, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121#10085226 (10ayounsi) a:03ayounsi [14:47:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [14:48:21] 10SRE-tools, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121#10085225 (10ayounsi) Added a relevant check in the IP validator. I used the following nbshell code on Netbox next to confirm that it wo... [14:49:16] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:49:45] (03PS5) 10Brouberol: airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) [14:50:45] (03CR) 10Brouberol: [V:03+2 C:03+2] airflow-test-k8s: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064341 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:50:58] (03PS5) 10Brouberol: growthbook: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) [14:51:22] (03CR) 10JHathaway: [C:03+1] P:idp Remove old CAS 6.6 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1064731 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [14:52:24] (03CR) 10Brouberol: [C:03+2] growthbook: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064342 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [14:53:38] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission an-coord1001.eqiad.wmnet and an-coord1002.eqiad.wmnet - https://phabricator.wikimedia.org/T373121#10085259 (10BTullis) [14:54:03] (03PS1) 10Brouberol: deployment_server: change the PG image tag to timestamp-sha@checksum [puppet] - 10https://gerrit.wikimedia.org/r/1064779 (https://phabricator.wikimedia.org/T373000) [14:54:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:55:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P67601 and previous config saved to /var/cache/conftool/dbconfig/20240822-145543-ladsgroup.json [14:59:54] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064372 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [15:00:02] (03PS4) 10Brouberol: cloudnative-pg-cluster: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064372 (https://phabricator.wikimedia.org/T373000) [15:00:04] andre and jeena: #bothumor I � Unicode. All rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1500). [15:01:22] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bookworm [15:01:54] (03CR) 10Brouberol: [V:03+2 C:03+2] cloudnative-pg-cluster: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064372 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [15:04:50] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:04:50] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:09:50] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:10:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P67602 and previous config saved to /var/cache/conftool/dbconfig/20240822-151050-ladsgroup.json [15:12:46] cdanis: I think we can resolve the issue now fwiw. I'll follow up with action item task creation tomorrow. [15:13:40] I think the main culprit for this was indeed the readiness probe from eventgate-main [15:14:51] maybe ottomata can follow up about if it actually requires ACK from all - which we should absolutely change then [15:15:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T371742)', diff saved to https://phabricator.wikimedia.org/P67603 and previous config saved to /var/cache/conftool/dbconfig/20240822-151530-ladsgroup.json [15:15:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:17:38] (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [15:20:43] (03PS1) 10Tiziano Fogli: curator: free up space to safely restart daemons [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) [15:21:41] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main2006.codfw.wmnet [15:21:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2006.codfw.wmnet [15:22:44] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kafka-main2001.codfw.wmnet with reason: Hardware refresh [15:22:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kafka-main2001.codfw.wmnet with reason: Hardware refresh [15:25:55] (03PS2) 10Tiziano Fogli: curator: free up space to safely restart daemons [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) [15:25:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T370903)', diff saved to https://phabricator.wikimedia.org/P67604 and previous config saved to /var/cache/conftool/dbconfig/20240822-152558-ladsgroup.json [15:26:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1241.eqiad.wmnet with reason: Maintenance [15:26:05] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:26:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1241.eqiad.wmnet with reason: Maintenance [15:26:13] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) (owner: 10Tiziano Fogli) [15:26:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T370903)', diff saved to https://phabricator.wikimedia.org/P67605 and previous config saved to /var/cache/conftool/dbconfig/20240822-152620-ladsgroup.json [15:26:44] (03PS1) 10CDobbins: varnish: Fix "%error_body_content%" in error pages [puppet] - 10https://gerrit.wikimedia.org/r/1064782 (https://phabricator.wikimedia.org/T372473) [15:27:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10085381 (10cmooney) [15:29:38] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lsw1-c2-codfw.mgmt with reason: move lvs2013 from asw to lsw [15:29:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lsw1-c2-codfw.mgmt with reason: move lvs2013 from asw to lsw [15:29:57] jayme: cdanis I added two action items to the incident report, and agree with one already there. so: [15:29:57] - stop using GET /_test/events for alivenessProbe (but keep for readinessProbe) [15:29:57] - increase timeout on readinessProbe [15:29:57] - set kafka producer request.required.acks=2 [15:29:57] does that sound right? [15:30:02] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: move lvs2013 from asw to lsw [15:30:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10085386 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4888a1d9-ee36-415c-a204-98c84040effe) set by... [15:30:14] we were using it for liveness probe??? [15:30:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10085351 (10Papaul) 05Open→03Resolved a:03Papaul [15:30:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: move lvs2013 from asw to lsw [15:30:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10085387 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=028cbb12-db86-4824-9084-463287cc8911) set by... [15:30:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P67606 and previous config saved to /var/cache/conftool/dbconfig/20240822-153037-ladsgroup.json [15:31:09] !log disabling BGP on cr1-codfw and cr2-codfw towards lvs2013 in advance of host move to new switch T370927 [15:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:15] T370927: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927 [15:34:28] (03CR) 10Ottomata: [C:03+1] eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [15:35:20] ottomata: I dissagree on calling out to kafka for the readiness probe, but we can maybe better discuss that on a task [15:35:48] (03PS2) 10CDobbins: varnish: Fix "%error_body_content%" in error pages [puppet] - 10https://gerrit.wikimedia.org/r/1064782 (https://phabricator.wikimedia.org/T372473) [15:36:22] jayme: yeah this google doc discussion differaaant but I like it. yeah task is good place to discuss [15:36:22] !log upgrading A:cp-ulsfo to ATS 9.2.5: T339134 [15:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:25] T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134 [15:36:33] i'm not too opinionated, we can work it out for sure [15:36:49] !log add vlans to trunk port on lsw1-c2-codfw facing new lvs2013 link T370927 [15:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:52] T370927: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927 [15:37:02] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-ulsfo and not P{cp4044* or cp4052*} and A:cp for 9.2.5-1wm2 [15:37:14] eheh, yeah...like chatting in etherpad in the old days [15:37:22] heheh [15:41:59] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:45:08] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2013 - cmooney@cumin1002" [15:45:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2013 - cmooney@cumin1002" [15:45:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:20] (03PS3) 10CDobbins: varnish: Fix "%error_body_content%" in error pages [puppet] - 10https://gerrit.wikimedia.org/r/1064782 (https://phabricator.wikimedia.org/T372473) [15:45:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P67607 and previous config saved to /var/cache/conftool/dbconfig/20240822-154544-ladsgroup.json [15:46:00] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache lvs2014.codfw.wmnet on all recursors [15:46:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lvs2014.codfw.wmnet on all recursors [15:46:21] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache lvs2013.codfw.wmnet on all recursors [15:46:25] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lvs2013.codfw.wmnet on all recursors [15:46:40] (03CR) 10Cathal Mooney: [C:03+2] lvs2013: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056563 (https://phabricator.wikimedia.org/T370927) (owner: 10Cathal Mooney) [15:48:21] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1005.eqiad.wmnet with OS bookworm [15:49:36] (03PS4) 10CDobbins: varnish: Fix "%error_body_content%" in error pages [puppet] - 10https://gerrit.wikimedia.org/r/1064782 (https://phabricator.wikimedia.org/T372473) [15:50:12] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host lvs2013.codfw.wmnet with OS bullseye [15:50:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10085491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host lvs2013... [15:59:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T370903)', diff saved to https://phabricator.wikimedia.org/P67608 and previous config saved to /var/cache/conftool/dbconfig/20240822-155921-ladsgroup.json [15:59:25] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:59:36] (03CR) 10Ssingh: [C:03+1] "Looks good, nice work finally cleaning it up!" [puppet] - 10https://gerrit.wikimedia.org/r/1064782 (https://phabricator.wikimedia.org/T372473) (owner: 10CDobbins) [16:00:05] jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:06] (03PS1) 10Brouberol: Change mongodb image tag to one that .. includes mongodb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064794 (https://phabricator.wikimedia.org/T373000) [16:00:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T371742)', diff saved to https://phabricator.wikimedia.org/P67609 and previous config saved to /var/cache/conftool/dbconfig/20240822-160052-ladsgroup.json [16:00:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [16:01:01] 06SRE, 10SRE-Access-Requests: ngkountas user has same SSH key for cloud/prod - https://phabricator.wikimedia.org/T371372#10085537 (10jhathaway) @ngkountas were you able to create a new key? [16:01:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [16:01:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:01:16] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:01:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:01:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T371742)', diff saved to https://phabricator.wikimedia.org/P67610 and previous config saved to /var/cache/conftool/dbconfig/20240822-160131-ladsgroup.json [16:01:35] (03CR) 10Brouberol: [C:03+2] Change mongodb image tag to one that .. includes mongodb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064794 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [16:02:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:03:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:05:22] !log cdobbins@cumin1002:~$ sudo cumin 'A:cp' 'disable-puppet' 'merging CR 1064782' [16:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:09] (03PS1) 10Isabelle Hurbain-Palatin: Rollout Parsoid Kartographer support on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) [16:07:38] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: host reimage [16:08:27] (03CR) 10Subramanya Sastry: [C:03+1] Rollout Parsoid Kartographer support on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [16:09:42] (03CR) 10CDobbins: [C:03+2] varnish: Fix "%error_body_content%" in error pages [puppet] - 10https://gerrit.wikimedia.org/r/1064782 (https://phabricator.wikimedia.org/T372473) (owner: 10CDobbins) [16:09:48] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=1) Rolling upgrade/restart of Apache Traffic Server on A:cp-ulsfo and not P{cp4044* or cp4052*} and A:cp for 9.2.5-1wm2 [16:10:00] ha [16:11:18] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: host reimage [16:14:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P67611 and previous config saved to /var/cache/conftool/dbconfig/20240822-161429-ladsgroup.json [16:17:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064795 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [16:25:25] (03PS1) 10Cathal Mooney: Add new WME AWS IPs to wikimedia_nets varnish acl [puppet] - 10https://gerrit.wikimedia.org/r/1064797 (https://phabricator.wikimedia.org/T370294) [16:26:54] (03CR) 10Brouberol: [C:03+2] profile::kubernetes::deployment_server::mariadb_master_ips: Handle no match [puppet] - 10https://gerrit.wikimedia.org/r/1064416 (https://phabricator.wikimedia.org/T373040) (owner: 10Ahmon Dancy) [16:27:02] !log cdobbins@cumin1002:~$ sudo cumin -b11 'A:cp' 'run-puppet-agent --enable "merging CR 1064782"' [16:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:49] (03CR) 10Ssingh: [C:03+1] "Looks good thanks! Puppet is currently disabled on A:cp to merge another change; will merge shortly after that." [puppet] - 10https://gerrit.wikimedia.org/r/1064797 (https://phabricator.wikimedia.org/T370294) (owner: 10Cathal Mooney) [16:29:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P67612 and previous config saved to /var/cache/conftool/dbconfig/20240822-162936-ladsgroup.json [16:32:38] (03PS1) 10JHathaway: clinic-duty: offboard gtzatchkova [puppet] - 10https://gerrit.wikimedia.org/r/1064800 (https://phabricator.wikimedia.org/T372767) [16:32:42] (03PS2) 10Cathal Mooney: Add new WME AWS IPs to wikimedia_nets varnish acl [puppet] - 10https://gerrit.wikimedia.org/r/1064797 (https://phabricator.wikimedia.org/T370294) [16:35:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2013.codfw.wmnet with OS bullseye [16:36:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10085669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs2013.cod... [16:36:52] (03CR) 10Andrew Bogott: [C:03+2] Openstack eqiad1: upgrade to 2024.1 'caracal' [puppet] - 10https://gerrit.wikimedia.org/r/1064771 (https://phabricator.wikimedia.org/T369044) (owner: 10Andrew Bogott) [16:37:11] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10085667 (10Zabe) >>! In T372767#10076004, @Dzahn wrote: > - removed both use... [16:38:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T371742)', diff saved to https://phabricator.wikimedia.org/P67613 and previous config saved to /var/cache/conftool/dbconfig/20240822-163819-ladsgroup.json [16:38:24] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:44:27] FIRING: [6x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T370903)', diff saved to https://phabricator.wikimedia.org/P67614 and previous config saved to /var/cache/conftool/dbconfig/20240822-164443-ladsgroup.json [16:44:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1242.eqiad.wmnet with reason: Maintenance [16:44:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:44:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1242.eqiad.wmnet with reason: Maintenance [16:45:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T370903)', diff saved to https://phabricator.wikimedia.org/P67615 and previous config saved to /var/cache/conftool/dbconfig/20240822-164505-ladsgroup.json [16:45:32] 06SRE, 06Traffic, 13Patch-For-Review: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#10085694 (10cmooney) If anyone is rebooting lvs hosts and has to deal with this the below shell script will make sure a working route is in use: ` #!/... [16:53:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P67616 and previous config saved to /var/cache/conftool/dbconfig/20240822-165328-ladsgroup.json [16:58:10] (03CR) 10Btullis: [C:03+2] "Sadly, our optimism was misplaced. It turns out that the mdraid devices and the LVM devices on them are not initialised by the time that t" [puppet] - 10https://gerrit.wikimedia.org/r/1064773 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [16:59:03] (03CR) 10Ssingh: [C:03+2] Add new WME AWS IPs to wikimedia_nets varnish acl [puppet] - 10https://gerrit.wikimedia.org/r/1064797 (https://phabricator.wikimedia.org/T370294) (owner: 10Cathal Mooney) [16:59:33] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10085728 (10jhathaway) [17:03:05] something is wrong with puppetserver1002 [17:03:15] and by something I mean the extent of what I have seen or looked into: [17:03:20] ERROR: puppet-merge on puppetserver1002.eqiad.wmnet (ops) failed [17:03:27] and then failure on agent run as well [17:04:22] Error: Connection to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3 failed, trying next route: Request to https://puppetserver1002.eqiad.wmnet:8140/puppet/v3 failed after 7.232 seconds: Failed to open TCP connection to puppetserver1002.eqiad.wmnet:8140 (Connection timed out - connect(2) for "puppetserver1002.eqiad.wmnet" port 8140) [17:04:45] yeah host is down [17:05:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:05:56] oh boy [17:05:58] yeah [17:06:09] these are all failures connecting to puppetserver1002 [17:06:31] is it behind lvs2013? [17:06:49] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10085747 (10jhathaway) [17:06:49] topranks: in eqiad [17:06:51] ignore me - eqiad [17:06:53] yea [17:07:04] sukhe: the host responds to ping but ssh never answers, that usually means a userspace lockup [17:08:22] https://grafana.wikimedia.org/goto/Me4IDRqIR?orgId=1 uh oh [17:08:35] trying over virtual serial it doesn't return a password prompt after the login: [17:08:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P67617 and previous config saved to /var/cache/conftool/dbconfig/20240822-170835-ladsgroup.json [17:09:01] topranks: based on the dashboard sukhe posted, it's just hard swap thrashing [17:09:10] yea.... [17:10:07] I'm going to force-reboot it via mgmt [17:10:10] memory usage seems always to be maxed on it [17:10:10] I guess [17:10:13] cdanis: +! [17:10:16] +1 [17:10:54] !log removing no-longer-required vlans from ssw1-a1-codfw after lvs move T370927 [17:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:58] T370927: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927 [17:11:07] !log 💙cdanis@cumin1002.eqiad.wmnet ~ 🕐☕ sudo ipmitool -I lanplus -H "puppetserver1002.mgmt.eqiad.wmnet" -U root -E chassis power cycle [17:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:34] in case this doesn't help, I was looking at how to remove puppetmaster1002 [17:12:39] is it as simple as editing the relevant hiera? [17:12:44] basically profile::puppetmaster::backend::puppetservers [17:13:00] and frontend [17:14:15] keeping an eye on the console as it boots, nothing out-of-the-ordinary to report [17:14:39] back up [17:14:42] back up [17:14:43] yep [17:15:41] https://www.irccloud.com/pastebin/eao4mtnE/ [17:16:46] Aug 22 17:14:29 puppetserver1002 rsync[1240]: rsync: getaddrinfo: puppetserver1001.eqiad.wmnet 873: Temporary failure in name resolution [17:16:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T370903)', diff saved to https://phabricator.wikimedia.org/P67618 and previous config saved to /var/cache/conftool/dbconfig/20240822-171646-ladsgroup.json [17:16:50] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:17:06] name resolution? [17:18:16] sukhe: lol so [17:18:17] that can cause ssh to be blocked, and login in general (i.e. me not getting 'password' prompt on serial) [17:18:22] topranks: no it works now [17:18:32] I think systemd ran the timer before the network was actually online [17:18:49] sorry yeah that's related to the units that didn't start [17:18:54] ok [17:19:06] the swapping unlikely triggered by a dns problem [17:19:13] yeah that's after it rebooted, too [17:19:20] I reran sync-puppet-ca manually and it worked fine [17:19:31] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:39] can use network-online.target I think :> [17:20:00] yep I think so [17:20:09] yeah [17:20:12] !log sudo cumin -b11 "A:cp" "run-puppet-agent" rolling out CR 1064797: T370294 [17:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:15] sync-puppet-volatile also re-ran successfully just now [17:20:22] thanks for resolving it! [17:20:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:20:54] hooray [17:20:56] happy Thursday ^ :P [17:21:29] man it is just always something these past weeks [17:21:46] yeah we had this nice period of peace and quiet [17:21:47] and then [17:22:15] well, the "nice period of peace and quiet" was actually "s4 exploding 1-2x/week instead of 2-3x/day" [17:22:19] 🙃 [17:22:50] (03PS1) 10Btullis: cephosd: Assemble the MD RAID arrays, so that they can be removed [puppet] - 10https://gerrit.wikimedia.org/r/1064807 (https://phabricator.wikimedia.org/T372783) [17:23:04] yeah before that I meant [17:23:11] ha [17:23:33] (03CR) 10FNegri: [C:03+1] "😞" [puppet] - 10https://gerrit.wikimedia.org/r/1064773 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [17:23:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T371742)', diff saved to https://phabricator.wikimedia.org/P67619 and previous config saved to /var/cache/conftool/dbconfig/20240822-172342-ladsgroup.json [17:23:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [17:23:46] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:23:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [17:24:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10085767 (10cmooney) a:05cmooney→03None All work completed, no issues to report. @Jhancock.wm @Papaul these two cross... [17:24:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T371742)', diff saved to https://phabricator.wikimedia.org/P67620 and previous config saved to /var/cache/conftool/dbconfig/20240822-172404-ladsgroup.json [17:24:27] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:31] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-ulsfo and not P{cp40[37-40]* or cp4044* or cp4052*} and A:cp for 9.2.5-1wm2 [17:27:15] (03CR) 10FNegri: [C:03+1] "Hmm not sure if this will work, but worth a try?" [puppet] - 10https://gerrit.wikimedia.org/r/1064807 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [17:31:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P67621 and previous config saved to /var/cache/conftool/dbconfig/20240822-173153-ladsgroup.json [17:34:27] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4041.ulsfo.wmnet [reason: cookbook had failed as Puppet was disabled so pooling manually] [17:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.06% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:37:42] (03PS1) 10Hnowlan: shellbox-video, admin-ng: big increase in resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064811 (https://phabricator.wikimedia.org/T356241) [17:39:27] jouncebot: now [17:39:27] For the next 0 hour(s) and 20 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1700) [17:39:27] For the next 0 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1700) [17:40:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:40:57] ah, the jouncebot was sick when the window opened. I was wondering why I didn't see a ping. nothing for me to ship today anyway so all good. [17:42:27] (03CR) 10Scott French: [C:03+1] "Does what it says on the tin! LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064763 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [17:46:56] (03CR) 10Dzahn: [C:03+1] clinic-duty: offboard gtzatchkova [puppet] - 10https://gerrit.wikimedia.org/r/1064800 (https://phabricator.wikimedia.org/T372767) (owner: 10JHathaway) [17:47:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P67622 and previous config saved to /var/cache/conftool/dbconfig/20240822-174701-ladsgroup.json [17:52:35] (03PS1) 10Scott French: php8.1-cli: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064814 (https://phabricator.wikimedia.org/T372602) [17:52:36] (03PS1) 10Scott French: php8.1-fpm: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064815 (https://phabricator.wikimedia.org/T372602) [17:52:38] (03PS1) 10Scott French: php8.1-fpm-multiversion-base: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064816 (https://phabricator.wikimedia.org/T372602) [17:54:43] (03CR) 10Hnowlan: [C:03+2] shellbox-video: remove emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064763 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [17:55:42] (03Merged) 10jenkins-bot: shellbox-video: remove emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064763 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [17:56:58] jouncebot: nowandnext [17:56:58] For the next 0 hour(s) and 3 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1700) [17:56:58] For the next 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1700) [17:56:58] In 0 hour(s) and 3 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1800) [17:58:11] (03CR) 10JHathaway: [C:03+2] clinic-duty: offboard gtzatchkova [puppet] - 10https://gerrit.wikimedia.org/r/1064800 (https://phabricator.wikimedia.org/T372767) (owner: 10JHathaway) [18:00:05] andre and jeena: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T1800) [18:01:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T371742)', diff saved to https://phabricator.wikimedia.org/P67623 and previous config saved to /var/cache/conftool/dbconfig/20240822-180106-ladsgroup.json [18:01:14] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:02:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T370903)', diff saved to https://phabricator.wikimedia.org/P67624 and previous config saved to /var/cache/conftool/dbconfig/20240822-180208-ladsgroup.json [18:02:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance [18:02:12] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:02:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance [18:02:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T370903)', diff saved to https://phabricator.wikimedia.org/P67625 and previous config saved to /var/cache/conftool/dbconfig/20240822-180230-ladsgroup.json [18:11:03] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:16:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P67626 and previous config saved to /var/cache/conftool/dbconfig/20240822-181613-ladsgroup.json [18:18:46] FIRING: [2x] ProbeDown: Service wdqs2024:443 has failed probes (http_wdqs_scholarly_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2024:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:29] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2024.codfw.wmnet with reason: needs a data transfer [18:19:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on wdqs2024.codfw.wmnet with reason: needs a data transfer [18:20:25] FIRING: [4x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:21:07] (03PS1) 10Dzahn: prometheus: create text file export for nft throttling denylist length [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) [18:21:47] (03CR) 10CI reject: [V:04-1] prometheus: create text file export for nft throttling denylist length [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [18:24:19] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542#10085973 (10bking) a:05Jhancock.wm→03None [18:24:32] (03PS2) 10Dzahn: prometheus: create text file export for nft throttling denylist length [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) [18:24:40] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542#10085975 (10bking) 05Open→03Resolved a:03bking Per IRC conversation with @Papaul , I was able to get in to this host via `install-con... [18:28:35] (03PS1) 10Andrea Denisse: alert: Allow connections from the alert[12]002 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1064818 (https://phabricator.wikimedia.org/T372418) [18:28:49] (03PS1) 10Andrea Denisse: alert: Allow Apache2 connections for the alert[12]002 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1064821 (https://phabricator.wikimedia.org/T372418) [18:29:03] (03PS1) 10Andrea Denisse: alert: Add the alert[12]002 hosts as Icinga and AM partners [puppet] - 10https://gerrit.wikimedia.org/r/1064820 (https://phabricator.wikimedia.org/T372418) [18:30:25] FIRING: [4x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P67627 and previous config saved to /var/cache/conftool/dbconfig/20240822-183120-ladsgroup.json [18:31:27] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2024:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [18:35:25] FIRING: [4x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:28] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [18:35:39] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 11s) [18:36:20] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-ulsfo and not P{cp40[37-40]* or cp4044* or cp4052*} and A:cp for 9.2.5-1wm2 [18:36:21] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-ulsfo and not P{cp4044* or cp4052*} and A:cp for 9.2.5-1wm2 [18:36:26] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling neither afterwards [18:36:29] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [18:36:29] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=97) Rolling upgrade/restart of Apache Traffic Server on A:cp-ulsfo and not P{cp4044* or cp4052*} and A:cp for 9.2.5-1wm2 [18:57:08] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-main [18:58:19] (03CR) 10Ssingh: wdqs graph split: new A, PTR, and DYNA records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [19:01:22] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-scholarly [19:01:50] !log T364368 Pooled all wdqs main/scholarly hosts except wdqs2024, which won't be ready for another hour [19:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:02] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [19:02:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T370903)', diff saved to https://phabricator.wikimedia.org/P67629 and previous config saved to /var/cache/conftool/dbconfig/20240822-190247-ladsgroup.json [19:03:04] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:03:40] (03PS8) 10Ryan Kemper: wdqs graph split: new A, PTR, and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) [19:03:58] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142 (10RMurthy) 03NEW [19:03:58] (03CR) 10Ryan Kemper: wdqs graph split: new A, PTR, and DYNA records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [19:05:05] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:05:21] (03CR) 10Ssingh: [C:03+1] wdqs graph split: new A, PTR, and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [19:08:34] (03PS1) 10Ryan Kemper: wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) [19:10:29] (03CR) 10Ssingh: [C:03+1] "Looks good!" [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [19:10:38] (03PS9) 10Ryan Kemper: wdqs graph split: new A, PTR, and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) [19:10:38] (03PS2) 10Ryan Kemper: wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) [19:11:19] (03CR) 10Ssingh: wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [19:17:39] (03CR) 10Ryan Kemper: [C:03+2] wdqs: add wdqs2024 to scholarly pool [puppet] - 10https://gerrit.wikimedia.org/r/1064829 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [19:17:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P67630 and previous config saved to /var/cache/conftool/dbconfig/20240822-191754-ladsgroup.json [19:17:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling neither afterwards [19:18:02] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [19:18:46] RESOLVED: [4x] ProbeDown: Service wdqs2023:443 has failed probes (http_wdqs_scholarly_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:25] FIRING: [6x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:30:25] RESOLVED: [6x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:30:42] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: cluster=wdqs-scholarly [19:31:13] !log T364368 Pooled wdqs2024 (its data transfer has completed successfully) [19:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:17] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [19:33:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P67631 and previous config saved to /var/cache/conftool/dbconfig/20240822-193301-ladsgroup.json [19:35:11] (03PS1) 10JHathaway: puppet8: remove unused scap config file [puppet] - 10https://gerrit.wikimedia.org/r/1064839 (https://phabricator.wikimedia.org/T372664) [19:35:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064839 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:39:11] (03PS1) 10Ryan Kemper: wdqs: -main and -scholarly are different services [puppet] - 10https://gerrit.wikimedia.org/r/1064840 (https://phabricator.wikimedia.org/T364368) [19:43:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064839 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:46:28] (03PS1) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) [19:48:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T370903)', diff saved to https://phabricator.wikimedia.org/P67632 and previous config saved to /var/cache/conftool/dbconfig/20240822-194808-ladsgroup.json [19:48:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1244.eqiad.wmnet with reason: Maintenance [19:48:13] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:48:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1244.eqiad.wmnet with reason: Maintenance [19:48:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T370903)', diff saved to https://phabricator.wikimedia.org/P67633 and previous config saved to /var/cache/conftool/dbconfig/20240822-194830-ladsgroup.json [19:49:06] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team, 13Patch-For-Review: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10086219 (10Dzahn) >>! In T372767#10085667, @Zabe wrote: >>>! In T372767#1007... [19:50:48] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:52:34] (03PS1) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) [19:54:43] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064839 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [19:55:04] (03PS2) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T2000). [20:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] hi. i saw some weird failures in CI when merging patches just a few minutes ago ("No space left on device"). hopefully this doesn't happen to the backports, but if it does, we might have to override the CI results. [20:00:48] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:01:43] any deployers around? [20:03:54] (03CR) 10Ahmon Dancy: [C:03+1] puppet8: remove unused scap config file [puppet] - 10https://gerrit.wikimedia.org/r/1064839 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:04:14] (03PS2) 10Ryan Kemper: wdqs: add service entries for -main and -scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1064841 (https://phabricator.wikimedia.org/T373145) [20:04:15] (03PS3) 10Ryan Kemper: wdqs: Prepare to configure the load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1064843 (https://phabricator.wikimedia.org/T364368) [20:07:53] MatmaRex: sure, I can deploy [20:08:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064827 (https://phabricator.wikimedia.org/T373100) (owner: 10Bartosz Dziewoński) [20:08:32] thanks cdanis [20:14:19] (03PS2) 10JHathaway: puppet8: mtail, check if notify is defined [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) [20:14:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:14:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:14:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:15:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T370903)', diff saved to https://phabricator.wikimedia.org/P67634 and previous config saved to /var/cache/conftool/dbconfig/20240822-201503-ladsgroup.json [20:15:09] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:17:20] !log imported php-luasandbox_4.1.2-1+wmf11u2 into component/php81 - T372507 [20:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:24] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [20:18:28] !log imported php-wmerrors_2.0.0-1+wmf11u2 into component/php81 - T372507 [20:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:01] (03Merged) 10jenkins-bot: Revert "Invert logic on empty talk page" [extensions/DiscussionTools] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064827 (https://phabricator.wikimedia.org/T373100) (owner: 10Bartosz Dziewoński) [20:19:15] !log imported wikidiff2_1.14.1-2+wmf11u2 into component/php81 - T372507 [20:19:16] !log cdanis@deploy1003 Started scap sync-world: Backport for [[gerrit:1064827|Revert "Invert logic on empty talk page" (T373100)]] [20:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:21] T373100: On wiki talk pages using the Talkpageheader, emptystate message may be incorrectly added twice - https://phabricator.wikimedia.org/T373100 [20:20:35] MatmaRex: it's synced to k8s testservers [20:21:32] !log cdanis@deploy1003 matmarex, cdanis: Backport for [[gerrit:1064827|Revert "Invert logic on empty talk page" (T373100)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:21:49] cdanis: thanks, works as expected (testing at https://zh.wikipedia.org/wiki/Talk:黑神话:悟空) [20:21:54] !log cdanis@deploy1003 matmarex, cdanis: Continuing with sync [20:25:01] (03PS1) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) [20:25:34] (03CR) 10CI reject: [V:04-1] wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) (owner: 10Ryan Kemper) [20:25:46] (03PS3) 10Ryan Kemper: wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) [20:26:16] (03PS2) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) [20:26:21] (03CR) 10CI reject: [V:04-1] wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [20:26:33] !log cdanis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064827|Revert "Invert logic on empty talk page" (T373100)]] (duration: 07m 16s) [20:26:39] T373100: On wiki talk pages using the Talkpageheader, emptystate message may be incorrectly added twice - https://phabricator.wikimedia.org/T373100 [20:27:52] (03PS4) 10Ryan Kemper: wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) [20:28:59] (03PS1) 10Dwisehaupt: iginga: add fran2001 and frdb2004 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1064853 (https://phabricator.wikimedia.org/T369920) [20:29:16] (03PS2) 10Scott French: php8.1-cli: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064814 (https://phabricator.wikimedia.org/T372602) [20:29:16] (03PS2) 10Scott French: php8.1-fpm: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064815 (https://phabricator.wikimedia.org/T372602) [20:29:16] (03PS2) 10Scott French: php8.1-fpm-multiversion-base: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064816 (https://phabricator.wikimedia.org/T372602) [20:30:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P67635 and previous config saved to /var/cache/conftool/dbconfig/20240822-203010-ladsgroup.json [20:34:31] (03CR) 10Dzahn: [C:03+2] iginga: add fran2001 and frdb2004 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1064853 (https://phabricator.wikimedia.org/T369920) (owner: 10Dwisehaupt) [20:37:11] (03PS3) 10Ryan Kemper: wdqs: move -main and -scholarly to production [puppet] - 10https://gerrit.wikimedia.org/r/1064848 (https://phabricator.wikimedia.org/T364368) [20:44:27] FIRING: [6x] ProbeDown: Service puppetmaster2001:8140 has failed probes (http_puppetmaster2001_codfw_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P67636 and previous config saved to /var/cache/conftool/dbconfig/20240822-204518-ladsgroup.json [20:46:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10086394 (10Dzahn) 05Resolved→03Open [20:46:59] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: name=ml-serve2002.codfw.wmnet [20:47:18] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: name=ml-serve2002.codfw.wmnet T365291 [20:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:21] T365291: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291 [20:48:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10086393 (10Dzahn) ml-serve2002 went down a couple hours ago. noticed in Icinga web UI but by pure chance. It feels like maybe nobody gets an actual notification about... [20:50:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#10086401 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=2097162 [20:50:34] (03CR) 10Dzahn: [C:03+2] "https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=fran2" [puppet] - 10https://gerrit.wikimedia.org/r/1064853 (https://phabricator.wikimedia.org/T369920) (owner: 10Dwisehaupt) [20:53:23] are folks still doing backports [20:53:40] I have a quick config change i'd like to deploy if possible [20:54:28] cdanis ^ [20:54:47] cscott: sorry, been a long day and it's family time now [20:54:58] ok, no worries. [20:58:54] (03PS1) 10C. Scott Ananian: Turn on Parsoid read views for cswikivoyage and rowikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064888 (https://phabricator.wikimedia.org/T371353) [21:00:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T370903)', diff saved to https://phabricator.wikimedia.org/P67637 and previous config saved to /var/cache/conftool/dbconfig/20240822-210025-ladsgroup.json [21:00:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [21:00:34] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:00:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1245.eqiad.wmnet with reason: Maintenance [21:01:50] (03CR) 10Ssingh: wdqs graph split: add discovery for active/active [dns] - 10https://gerrit.wikimedia.org/r/1064831 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [21:15:16] cscott: if you'd still like to get that out, i can deploy. [21:18:15] brennen: oh that would be fantastic: https://gerrit.wikimedia.org/r/1064888 [21:18:27] it will make our OKR hypothesis reports for this week a little nicer :) [21:18:40] cscott: ok, one moment [21:18:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064888 (https://phabricator.wikimedia.org/T371353) (owner: 10C. Scott Ananian) [21:19:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064888 (https://phabricator.wikimedia.org/T371353) (owner: 10C. Scott Ananian) [21:19:59] I added it to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240822T2000 retroactively [21:22:45] (03Merged) 10jenkins-bot: Turn on Parsoid read views for cswikivoyage and rowikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064888 (https://phabricator.wikimedia.org/T371353) (owner: 10C. Scott Ananian) [21:22:57] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1064888|Turn on Parsoid read views for cswikivoyage and rowikivoyage (T371353)]] [21:23:00] T371353: Deploy Parsoid Read Views for cs, hi, shn, ps wikivoyage - https://phabricator.wikimedia.org/T371353 [21:25:06] !log brennen@deploy1003 brennen, cscott: Backport for [[gerrit:1064888|Turn on Parsoid read views for cswikivoyage and rowikivoyage (T371353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:26] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:26:32] cscott: anything to test here? [21:27:02] yeah, i just tested that ro.wikivoyage.org and cs.wikivoyage.org now get the nice little "rendered with parsoid" indicator at the top of the page [21:27:07] so good to go, thanks [21:27:12] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:27:51] cscott: ack, thanks, continuing. (i did the same fwiw, but wasn't totally sure if there was anything else...) [21:27:52] and verified just for a sanity check that en.wikipedia.org does *not* get that, ie we switch ro/cs to parsoid without changing the universe. [21:28:02] !log brennen@deploy1003 brennen, cscott: Continuing with sync [21:28:19] brennen: nope, that's the check. thanks! [21:29:14] sure thing. [21:32:33] !log brennen@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064888|Turn on Parsoid read views for cswikivoyage and rowikivoyage (T371353)]] (duration: 09m 36s) [21:32:36] T371353: Deploy Parsoid Read Views for cs, tr, hi, shn, ps wikivoyage - https://phabricator.wikimedia.org/T371353 [21:33:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1247.eqiad.wmnet with reason: Maintenance [21:33:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1247.eqiad.wmnet with reason: Maintenance [21:34:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T370903)', diff saved to https://phabricator.wikimedia.org/P67638 and previous config saved to /var/cache/conftool/dbconfig/20240822-213406-ladsgroup.json [21:34:10] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:37:48] (03CR) 10Dwisehaupt: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1064853 (https://phabricator.wikimedia.org/T369920) (owner: 10Dwisehaupt) [21:38:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [21:39:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [21:39:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T371742)', diff saved to https://phabricator.wikimedia.org/P67639 and previous config saved to /var/cache/conftool/dbconfig/20240822-213909-ladsgroup.json [21:39:13] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:03:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T370903)', diff saved to https://phabricator.wikimedia.org/P67640 and previous config saved to /var/cache/conftool/dbconfig/20240822-220337-ladsgroup.json [22:03:46] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:08:38] (03PS1) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064963 (https://phabricator.wikimedia.org/T372789) [22:18:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P67641 and previous config saved to /var/cache/conftool/dbconfig/20240822-221844-ladsgroup.json [22:33:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P67642 and previous config saved to /var/cache/conftool/dbconfig/20240822-223351-ladsgroup.json [22:48:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T370903)', diff saved to https://phabricator.wikimedia.org/P67643 and previous config saved to /var/cache/conftool/dbconfig/20240822-224859-ladsgroup.json [22:49:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1248.eqiad.wmnet with reason: Maintenance [22:49:03] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:49:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1248.eqiad.wmnet with reason: Maintenance [22:49:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T370903)', diff saved to https://phabricator.wikimedia.org/P67644 and previous config saved to /var/cache/conftool/dbconfig/20240822-224921-ladsgroup.json [23:05:05] FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:26:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T371742)', diff saved to https://phabricator.wikimedia.org/P67645 and previous config saved to /var/cache/conftool/dbconfig/20240822-232656-ladsgroup.json [23:27:00] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:27:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:12] (03PS1) 10Andrew Bogott: pdns.conf.erb: secondary=yes [puppet] - 10https://gerrit.wikimedia.org/r/1065037 [23:32:17] (03PS3) 10Scott French: eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) [23:36:12] (03CR) 10Scott French: "And it seems I once again completely forgot about this patch :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [23:39:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1065038 [23:39:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1065038 (owner: 10TrainBranchBot) [23:42:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P67646 and previous config saved to /var/cache/conftool/dbconfig/20240822-234203-ladsgroup.json [23:52:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T370903)', diff saved to https://phabricator.wikimedia.org/P67647 and previous config saved to /var/cache/conftool/dbconfig/20240822-235231-ladsgroup.json [23:52:35] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:57:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P67648 and previous config saved to /var/cache/conftool/dbconfig/20240822-235711-ladsgroup.json