[12:43:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091725 (10ABran-WMF) preparation job with the first few critical instances on the path is done for now. I'll have a few host to mo... [12:56:17] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10091780 (10ABran-WMF) this task depends on: T373175 [12:57:18] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10091785 (10ABran-WMF) [12:57:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091786 (10ABran-WMF) [12:59:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10091794 (10ABran-WMF) [12:59:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091795 (10ABran-WMF) [13:36:09] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10091977 (10Clement_Goubert) 05Open→03In progress [13:53:22] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:02:52] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:08:11] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092103 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:17:00] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:18:01] FYI I'm upgrading prometheus3003 to bookworm, and it is rebooting now, just in case you see artifacts on the graphs [14:27:58] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10092235 (10ayounsi) [14:39:58] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092313 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:44:40] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092328 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:46:35] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092348 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:54:19] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:00:24] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092408 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:16:46] sukhe: brett: one of you available to pair on the wdqs lvs stuff in ~2 hours or so? [15:17:26] ryankemper: wfm [15:19:41] I'm booked for the next ~90m if that matters [15:19:54] ...which I don't think it does, but just FYI [15:32:54] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:49:14] 06Traffic: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134#10092818 (10ssingh) [15:49:23] 06Traffic: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134#10092819 (10ssingh) [16:01:33] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [16:01:55] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [16:05:21] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [16:52:55] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [17:07:54] Doing a final glance over patches, then ready to get started soon [17:11:49] cool [17:24:07] Here too [17:24:42] brett: thanks [17:27:40] brett: sukhe: alright we're in https://meet.google.com/fde-tbpf-wqh now. got some very broad order of operations notes in https://etherpad.wikimedia.org/p/wdqs_graph_split_production_deploy [17:44:32] 06Traffic: Error message says "%error_body_content%" - https://phabricator.wikimedia.org/T371424#10093420 (10CDobbins) 05Open→03Resolved Closing this now, since a fix was deployed last week and there haven't been any more reports of this happening. Please reopen if this occurs again. [18:07:26] 06Traffic: Error publishing edits on wiki pages: "%error_body_content%" - https://phabricator.wikimedia.org/T373108#10093494 (10CDobbins) 05Open→03Resolved a:03CDobbins Closing this now, since a fix was deployed last week and there haven't been any more reports of this happening. Please reopen if this... [18:23:24] sukhe: Back when I ran the rolling lvs restarts for wcqs back in late 2021, there was a primary and secondary for each given lvs class (in that case, `low-traffic`). Am I correct in my reading of the current docs that now there's just a general `secondary` class corresponding to all the lvs classes? [18:24:01] in other words, there's no longer a primary and secondary for each lvs class/type but rather a primary for each lvs class and then a secondary that's shared between the various lvs classes (low-traffic, high-traffic[1,2] etc)? [18:24:39] ryankemper: the only thing that has changed (and what might be confusing you) is that there are aliases for low-traffic and the secondary [18:24:58] so basically instead of looking up what the backup LVS is in Puppet, you just say A:lvs-secondary-eqiad [18:25:20] nothing else has changed architecturally [18:26:13] the secondary/backup is shared among all instances (basicall, you can fail over high-traffic[12], low-traffic to it) as we announce the relevant IPs from there [18:26:32] so there is only one of each class (and has been) and a secondary that is shared among them [18:26:55] does that help? [18:27:33] sukhe: mostly! so does that mean that there's always only been one shared secondary, even 2 years ago? [18:27:38] yep [18:27:47] it sounds like yes but i think I've got a bit of a https://en.wikipedia.org/wiki/False_memory#Mandela_effect going on [18:27:50] ack, ty [18:28:40] no worries, the naming hardly does justice -- "low-traffic" is a misnomer in that sense and so is the misused (by everyone including us) definition of the LVS class [18:28:57] also fwiw, of course feel free to directly call the LVS instance, so lvs1020.eqiad.wmnet instead of A:lvs-secondary-eqiad [18:30:17] sukhe: The LVS article has a step for deploying a new service. On the secondary section: "Ensure you use the correct LVS class, high-traffic1, high-traffic2, low-traffic, corresponding to where you service is." [18:30:28] Is that to mean "check ipvsadm to make sure it's there?" [18:31:03] brett: it means to ensure that you are using the correct class for the service. some services are high-traffic1 (such as text-http, https) but I think WDQS is all low-traffic [18:31:51] In the context of the secondary, it's not really relevant whether it's high/low, right? [18:32:10] Since it's all shoved on the same secondary [18:32:20] I think that's where the confusion came from [18:32:27] yeah, the secondary context is just for the site [18:32:38] brett: yeah, totally fair. let me fix it [18:33:08] done [18:33:39] Okay, cool. Thanks :) [19:13:43] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from kubernetes20... [19:14:48] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [19:19:04] sukhe: restarted eqiad. working on codfw now, but one thing we were a little surprised by is we didn't see ipvs diff check alerts [19:19:30] my spidey senses tell me that that means something isn't quite right...but the ipvsadm and curl checks we've ran thus far have shown what we expect [19:19:48] ryankemper: I did see the alert on lvs1019 fwiw [19:19:54] it cleared up as expected [19:19:58] so at least one data point! [19:20:11] not sure about lvs1020 [19:20:19] Where'd you see the alert? [19:20:21] but yeah, if the output looks good, feel free to proceed [19:20:24] brett: icinga UI [19:20:27] sukhe: oh, I thought it would surface in IRC, i guess not though [19:20:40] yeah, fair enough, it should have alerted there too [19:20:44] :? [19:20:49] it probably did but only in -operations where it's drowned in noise [19:21:04] I see no such alerts in -operations. Hm [19:21:08] mutante: I have a highlight for lvs* :P [19:21:12] nothing in my scrollback [19:21:21] ryankemper: brett: manually check lvs2014 this time? [19:21:26] oh, cool. but also odd [19:21:38] I have a suspicion that something might be up with the alerting, unrelated to this [19:22:03] maybe icinga-wm is down? [19:22:45] usually everything that alerts in web also does on IRC [19:22:51] yeah, something is up [19:23:28] I keep thinking the important bots should have a dedicated bot machine in prod. [19:24:57] mutante: bounced the bot, let's see [19:25:30] sukhe: which host was it? [19:26:05] alert1001 [19:26:07] I have another suspicion which is a pet peeve. disabled notifications that are forgotten. [19:26:17] not an lvs machine? [19:26:18] mutante: I thought so too but definitely not in this case [19:26:39] mutante: lvs2013 for example is alerting on Icinga as it should [19:26:41] but not to IRC [19:26:44] icinga-wm that is [19:27:07] journal output of ircecho also looks clean [19:27:21] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:27:24] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 81 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [19:27:32] but not actually being sent out to IRC? [19:27:55] it's back [19:29:28] :) lvs1013 - lvs1016 are old and to be decom'ed or so, I assume? [19:30:23] yep, they are used for Liberca testing [19:30:30] gotcha [19:30:53] ryankemper: brett: [19:30:54] sukhe@lvs1020:~$ curl localhost:9090/pools/wdqs-scholarly_80 [19:30:54] wdqs1024.eqiad.wmnet: enabled/up/pooled [19:30:54] wdqs1023.eqiad.wmnet: disabled/up/not pooled [19:30:55] intentional? [19:31:11] wdqs1023 [19:31:27] looking [19:31:27] sukhe: intentional [19:31:30] thanks [19:31:51] just pooled wdqs1023 [19:32:00] sukhe: well, semi-intentional. it can be pooled now, I depooled it because there were puppet cert failures but in the meantime I think inflatador has fixed those [19:32:08] yep pooled now, looks good [19:34:37] sukhe: as far as the `Add discovery/DNS resources` step is concerned (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064848), is the order of operations `merge patch to change state to production` -> `run puppet on lvs*`, `merge dns disc dns patch` -> `run puppet on authdns hosts`? [19:34:53] i.e. do i need to have puppet finished on lvs before merging the dns patch, or does the order of operations between those two not matter [19:35:47] ryankemper: I think it's best to do it in order [19:36:32] basically first do the lvs change and then do the DNS change (and follow that up with agent on A:dnsbox) [19:37:42] the order between these two certainly matters [19:41:40] cool, updated docs slightly to clarify [19:42:29] thanks [19:59:55] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [21:09:53] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes... [21:17:25] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094193 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [21:35:05] ryankemper do you remember if we ran`authdns-update` after merging https://gerrit.wikimedia.org/r/c/operations/dns/+/1064831 ? I don't see DNS records for `wdqs-main.discovery.wmnet` or `wdqs-scholarly.discovery.wmnet` [21:40:36] That one just wanted a puppet run not the full authdns cmd. But since we’d been missing that earlier patch at the time we ran the authdns update maybe we need to again [21:46:53] it is required for that step [21:47:08] basically this step below towards the end [21:47:10] > Merge the DNS change. Choose one authoritative DNS (eg: dns1004.wikimedia.org), and run sudo -i authdns-update. That script will deploy your change to all our DNS servers. [21:47:25] since it is a change to a zonefile (wmnet) [21:48:05] if you still get a nxdomain, we might need to clear the caches [21:48:20] sudo cookbook sre.dns.wipe-cache in that case [22:00:00] sukhe cool, ryan-kemper is away so I'm going to run the authdns shortly [22:03:16] thanks. and then you should see the eqiad IP endpoint from an eqiad host, codfw from codfw host [22:04:29] ACK, confirmed working from a codfw host [22:05:48] nice! [22:05:56] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094339 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [22:53:13] ryankemper: yeah because Puppet creates the various state files that gdnsd needs that are not managed by authdns-update [22:53:18] see https://puppetboard.wikimedia.org/report/dns1004.wikimedia.org/5e37cbd6297550b9e3b41dde467e2d203adbed35 [22:53:34] so first they need to be createad and then authdns-update for the zone file (managed via Git) [22:54:59] and it's best that you pool it in confd to see the changes (and that something is pooled) and hence the different steps