[08:23:10] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) [08:23:28] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) [08:24:04] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) a:05Clement_Goubert→03Joe [08:43:17] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10JMeybohm) Couldn't we just add another mobileapps deployment (like a canary) that connects to mw-api-int and scale that up slowly while scaling the exis... [11:07:58] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) >>! In T350846#9318457, @JMeybohm wrote: > Couldn't we just add another mobileapps release (like a canary) that connects to mw-api-int and scale th... [12:35:26] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) [13:54:48] 10netops, 10Infrastructure-Foundations: cr2-eqiad xe-3/2/2 has errors for the past ~week - https://phabricator.wikimedia.org/T350869 (10BBlack) [13:59:42] 10netops, 10Infrastructure-Foundations: cr2-eqiad xe-3/2/2 has errors for the past ~week - https://phabricator.wikimedia.org/T350869 (10ayounsi) 05Open→03Declined Following up on T342502#9319222 Not ideal, but that should work out :) [14:24:21] 10Traffic, 10serviceops: MW returns uncacheable responses for en.wikipedia.org when specific XFF values are sent - https://phabricator.wikimedia.org/T350861 (10Fabfur) [14:44:12] 10netops, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [14:54:32] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) Weight set to 1 for new hosts: ` $ sudo confctl select dc=eqiad,service=cdn,name='cp11.*' set/weight=1 The selector you chose has selected the following object... [14:56:54] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) Weight set to "100" for new hosts (ats-be): ` $ sudo confctl select dc=eqiad,service=ats-be,name='cp11.*' set/weight=100 The selector you chose has selected the... [15:01:30] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) 05Open→03Resolved >>! In T350479#9312405, @Volans wrote: > The code is not checking if he autoselection of the parent is None or not. Indeed. Why... [15:01:45] (VarnishHighThreadCount) firing: Varnish's thread count on cp5018:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5018 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:05:03] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) 05Resolved→03Open [15:08:52] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10Volans) >>! In T350479#9319519, @cmooney wrote: > For now I'll update //customscripts/_common.py// so that it fails cleanly if this should occur. Not sure what... [15:11:45] (VarnishHighThreadCount) resolved: Varnish's thread count on cp5018:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5018 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:23:36] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) As it's clear from the patches, I chose to take the sage advice of @JMeybohm and go down the path of least resistance :) [15:57:05] 10Traffic, 10serviceops: MW returns uncacheable responses for en.wikipedia.org when specific XFF values are sent - https://phabricator.wikimedia.org/T350861 (10Joe) [16:37:05] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:42:10] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:43:45] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye [16:51:56] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye executed with erro... [16:52:12] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye [16:56:12] 10Traffic: Provide a TCP MSS clamping mechanism for real servers - https://phabricator.wikimedia.org/T350462 (10Vgutierrez) [16:57:53] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye executed with erro... [16:58:12] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye [17:28:50] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:34:35] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye completed: - cp1101 (**PASS**) - Remov... [18:10:12] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [18:18:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [18:24:52] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL*... [18:25:07] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [18:30:41] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL*... [18:30:51] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [18:31:39] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host acmechief-test2001.codfw.wmnet with OS bookworm [18:32:48] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye [18:41:05] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL**... [18:41:11] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye [18:43:56] 10Traffic, 10SRE, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [18:52:00] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL**... [18:52:11] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye [18:56:33] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL**... [18:56:55] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye [19:04:19] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [19:04:48] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host acmechief-test2001.codfw.wmnet with OS bookworm completed: - acmechief-test2001 (**PASS**) - Downtimed o... [19:07:03] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye completed: - cp1103 (**PASS**) - Remo... [19:07:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:15:17] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**... [19:15:43] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:18:31] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [19:20:17] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**... [19:20:37] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:22:39] sukhe, fabfur: I'm momentarily around if you need a hand/eye for T350179 [19:22:40] T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 [19:23:15] I'm going to eat, don't know if sukhe is available [19:23:44] volans: I am here but I will be leaving in 15 mins sorry [19:23:54] ack no prob [19:23:57] bad timing :d [19:24:12] volans: have to shovel the snow so I can be back again but :) [19:24:24] let's talk now, I updated the task [19:24:34] same issue persists as before, repeated runs fix the problem [19:24:41] I am happy to debug but I don't know where [19:25:10] this is definitely not the hardware since a) the initial reimage to insetup was fine, 2) I have triple-checked the hardware settings [19:25:15] and so has jclark [19:25:30] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**... [19:25:48] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:26:01] one thing that is not fully clear to me is if the DHCP is ok or not [19:26:20] have you checked for example things like https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_issues [19:26:40] or also check the dhcp logs first that is quicker [19:27:01] I checked the DHCP logs [19:27:05] but I didn't do any pcaps [19:27:15] and checked "If using 10G nics, is the bios configured to PXE boot on those?" [19:27:30] I can check the install server and follow up again [19:27:42] do you have a host I can try without causing issues? [19:27:49] many :) [19:27:52] try to reimage or provision, etc... [19:27:55] I just need one :D [19:28:08] so cp1107 is what I am trying right now and is failing [19:28:14] you can pick cp1108 [19:28:20] ok, which OS? [19:28:24] bullseye [19:32:37] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**... [19:32:56] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:33:02] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye completed: - cp1105 (**PASS**) - Remov... [19:34:37] sukhe: cp1108 is a text HAProxy/Varnish/ATS cache server (cache::text) [19:34:43] isn't that already correct/ [19:34:49] ? [19:35:06] yeah because it was cache::upload earlier and agent ran and changed it to cache::text [19:35:10] ah ok [19:35:13] but needs reimage [19:35:13] but you know what we feel about agent runs vs clean reimages :) [19:35:16] got it [19:35:17] the OCD [19:35:29] insetup->final role can be ok [19:35:33] role1->role2 surely not :D [19:35:55] is not pooled right? [19:35:56] hah! given that both text and upload share the same config now (dual disks), the change was fairly simple [19:36:07] no, no host in cp1100-15 is pooled [19:36:11] er sorry wait [19:36:12] ack [19:36:15] 1100 and 1101 are [19:36:17] but nothing else [19:36:44] puppet 5 or 7? [19:36:57] 5 [19:36:57] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [19:39:56] volans: I will brb in 10-15min, thanks! [19:40:00] k [19:53:54] bac [19:53:54] k [19:55:35] the repro worked as it got stuck for me too [19:55:42] there is no dhcp request at all [19:55:45] ok, I got a pcap too [19:55:47] I checked all MACs [19:55:53] ok, so that matches up then [19:56:26] where do we go from here? as in, what have you done in the previous such cases? [19:56:46] re [19:56:52] got netops to check for dhcp request directly in the switches [19:57:13] and see if the dhcp forwarder is at fault there for some reason [19:57:13] ok. do you want to leave a comment summarizing it? if not, happy to since it's late for you [19:57:20] yeah I'm writing it [19:57:24] thanks volans! [19:57:24] so if it's the dhcp, why it works on the 3rd or 4th try? [19:57:46] fabfur: if I vaguely remember, we have seen this in the past but I don't know what the resolution was [19:58:02] not sure, race? different paths? [19:58:33] it's quite consistent, for pretty much all hosts it NEVER work on the first try [19:58:49] are those NICs different from any other host? [19:58:54] or their firmware [19:59:01] if it's some sort of true randomic thing on 16 hosts should work at least for 1 on the first try [19:59:02] has the version been checked/upgraded/downgraded? [19:59:16] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye [19:59:43] volans: 21.85, the same recommended one at least on the few hosts I checked [19:59:51] k [19:59:51] but all hosts are in puppetdb so I can do a query [20:01:30] picked a random one that was failing, cp1107, 21.85.21.92 [20:01:51] and I think one other thing is that in the previous cases where the NIC firmware was at fault, we were not getting any PXE boot at all [20:02:13] yeah [20:02:15] which I think then makes us come back to the network issue of some sorts [20:02:20] mine is going through on second attempt [20:03:02] you lucky [20:05:00] there is one more thing worth pointing out here that might be relevant [20:05:06] we observed this on cp4052 as well [20:05:09] so a different install server [20:05:15] different site [20:05:37] took 3-4 atempts to get cp4052 to reimage, but that was bookworm (not that it makes any difference about the OS at that stage) [20:06:27] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**... [20:06:49] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye [20:07:47] updated the task [20:07:59] sukhe: ack, thx for the info [20:08:22] volans: thanks for the help! I will add some of our own notes to the task [20:08:44] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye completed: - cp1107 (**PASS**) - Remov... [20:08:58] thanks! [20:09:08] no prob, sorry for not being helpful in a conclusive way so far [20:09:26] but yeah I would look at the traffic on the switch at this point [20:09:53] sorry I used the debugging task for the reimage instead of your tracking one [20:09:56] yeah I think makes the most sense [20:10:02] we have seen hardware failures and this is not that [20:10:10] in that you don't even get a single boot :) [20:11:13] I'll let you know once the reimage finishes, should I mark cp1108 in some way in T349244 ? [20:11:14] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [20:11:34] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [20:11:48] please do, at the very end or just let us know here [20:12:46] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**... [20:13:17] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye [20:13:33] volans: it's late for you, you can leave cp1108 to us [20:13:47] in on MY tmux, don't you dare :-P [20:13:57] hah! [20:13:59] no worries started first puppet run now [20:14:13] will end on its own hopefully [20:14:20] yeah no issues beyond this stage [20:16:49] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [20:17:29] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**... [20:17:39] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye [20:18:20] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [20:40:04] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Volans) cp1108 completed: see T350179#9321006 [20:40:09] sukhe: all done, commented on task, the actual run is on T350179#9321006 ^^^ [20:40:10] T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 [20:40:18] going afk :D [20:40:26] thanks! [20:40:30] np [20:54:47] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye completed: - cp1110 (**PASS**) - Remov... [22:35:44] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [22:42:19] 10Traffic, 10SRE, 10Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/1 Basic retry mechanism for specific kafka errors [22:48:26] 10Traffic, 10SRE, 10Patch-For-Review: Add version flag to purged - https://phabricator.wikimedia.org/T347839 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/2 Add version print option [23:28:24] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [23:53:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579 (10cmooney) In terms of the config when we have 2 IPs on the interface with the VGA setup, there is some behaviour we need to be careful of....