[11:50:26] If a change is made to conftool-data a puppet-merge will update conftool data as well, right? [11:50:56] yep, be sure to run it with sudo -i :) [11:51:06] hnowlan: yes, but there are other trickities [11:51:11] like the default weight [11:51:17] that needs to be set [11:52:41] Is that in conftool-data or is that something that happens manually? For now I don't necessarily care about the weight as long the hosts are depooled to begin with [11:53:29] yes the hosts are depooled to begin with but when you'll pool them they must have a positive weight [11:53:35] the default is zero [11:53:48] cool [11:53:54] for context this is because of some changes in the last few months on the conftool side, before there was a default per cluster [11:54:04] but that part has been ditched for various reasons [11:54:22] for some specific hosts there is a quick script that takes care of this part [11:54:41] but is not yet generic IIRC, cc _joe_ that has done it [11:55:51] <_joe_> it exists, it just needs people to include it in their puppet classes [11:56:22] <_joe_> conftool::scripts::initialize [11:56:39] <_joe_> only current user is profile::cache::base [12:09:08] how can I check if a specific package is on our buster-wikimedia repo? [12:10:50] you can e.g. run "reprepro ls puppet" on install1002.wikimedia.org [12:12:09] but you need sudo -i (as it grabs some setting from the home of the root user) [12:13:44] ok, thanks! [13:31:54] XioNoX: there is also https://tools.wmflabs.org/apt-browser/ [13:32:47] cdanis: no objection that I add it to the doc? [13:32:48] :) [13:33:08] it's linked to from the [[Apt]] page on wikitech [13:33:59] ah yeah! I was only looking at the reprepo page [13:34:38] seems worth linking to from the reprepro page as well [13:35:21] thx [13:40:59] akosiaris: mark: are you both around? [13:41:41] I wanted to talk about T245058 and T245060 [13:41:42] T245058: Create an automated alert for 'too many nodes depooled from a service' - https://phabricator.wikimedia.org/T245058 [13:41:42] T245060: Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 [13:55:46] cdanis: I am around [13:56:33] ehlos [13:56:51] just replied on the task(s) [13:57:01] 2.5.0 :) [13:57:09] yeah just saw it [13:57:40] as an aside I think we need some better nouns here -- 'pooled' is getting very overloaded [13:59:45] (since there's "configuration indicates should be pooled" vs "pybal believes healthy and asks LVS to include in actual-pool") [14:00:20] ok, I think I am understanding now what you want to see as an alert. pybal /alerts alerts on the "operational" side of it, you want an alert based on the "configuration" side of it [14:00:25] corrrect? [14:00:27] yes [14:00:33] s/rrr/rr/ [14:00:41] cool, yeah, makes sense [14:00:55] in general I see the preventative actions from that postmortem as being about avoiding obviously-bad configurations [14:00:58] there's a number of alerts I also want to create for conftool [14:01:19] e.g. having in an alert the "desired state" of discovery RRs [14:02:05] and issue a warning when we are diverging from that. So that we have another way of knowing a) that something has happened b) what the "desired state" of discovery RRs [14:02:24] I think both are related in that they are alerts that should be created on the conftool level of the infra [14:02:34] sure [14:02:58] (another thing I want to see is conftool itself at least printing a warning when you ask it to depool 'too much' capacity) [14:03:42] there is the question of defining what "too much" is ofc, but agreed [14:03:59] dbctl has this functionality and it lets you define it per-database-section ;) [14:04:05] pybal has a ratio config variable, I don't think we have something similar in conftool, but I guess we can add it [14:04:14] sadly we recently removed the service objects from conftool [14:04:19] which would be a natural place to put it [14:04:24] yes ... [14:04:53] cdanis: there is an additional problem with this, as depools happen on the single hosts too [14:04:58] some as part of shutdown for example [14:05:01] yes [14:05:08] as happened exactly in this case [14:05:19] i have written this in the incident doc :P [14:06:42] (https://wikitech.wikimedia.org/wiki/Incident_documentation/20200211-caching-proxies) [14:18:31] cdanis: here now for a bit [14:18:49] so right now pybal does not reject any configurations, but it does have a feature to keep a minimum percentage of its servers pooled [14:19:13] right [14:19:15] so if you have say 4 servers, you can configure it to keep at least 2 of them pooled even though it thinks they're down [14:19:30] because you know you need at least 2 otherwise you can't sustain the load, there's no point depooling more at that point [14:19:35] (and that feature has saved us many times) [14:19:40] yep [14:19:47] so similarly, we could create a feature that rejects configurations with fewer than N hosts I suppose [14:20:08] but i wouldn't want to hardcode that in for sure [14:26:06] mark: yeah, by 'rejects configurations' i mean the keep-pooled feature, basically [14:27:33] one thing I haven't checked is if the depool invocation on CP shutdown does pooled=inactive [14:29:01] it does pooled=no I believe [14:29:15] ok that's good [14:29:23] when we set up all the etcd pooling for the cps, inactive didn't exist and/or we weren't sure about how to use it with pybal etc [14:29:57] pooled=no [14:30:02] I can confirm :) [14:30:05] cdanis: is there per-service metadata settable in etcd? [14:30:13] there _used_ to be [14:30:28] could set a non-default threshold there and have confctl pull that to make a decision [14:30:31] some months ago it was removed as part of a code cleanup, since all it really did was let you set a default weight for new node additions [14:30:39] but I am thinking maybe we need to re-add it [14:31:05] on the other hand, we still have the human vs non-human thing [14:31:12] and I imagine non-human depools will be common [14:31:33] having confctl ignore the automated depool because of a threshold, with nobody at the keys to react, seems bad too [14:32:07] yeah, that's why I think that 1) confctl should output a warning, but probably should still do what you tell it to and 2) there should be some flavor of alert for such a state [14:32:20] yeah that sounds right to me [14:32:52] maybe for bonus points, if confctl could detect there's a real tty attached, it could do a warning + are you sure re-prompt [14:33:17] yeah, that's fairly straightforward [14:34:15] but while we're on the subject, IIRC (but my memory may be hazy), pybal's depool limits had some quirks that probably aren't yet solved in the branch we're running. [14:34:48] I see [14:35:01] IIRC, pybal's depool thresholding is on the final runtime pool state (the result of both healthchecks and etcd) [14:35:13] that sounds as it should be [14:35:52] but it doesn't keep a separate state table of desired pooling based on etcd and/or healthcheck, vs real pooling that was held back by threshold [14:35:59] or something like that [14:36:40] like I said, my memory is hazy, but I don't think it keeps enough state to deal with it very well [14:36:43] sure [14:37:05] and there's plenty of hard questions about what exactly to do when you're in such a state, even if you are tracking all of that [14:37:08] say you had 10 servers and a threshold of 5. If healthchecks have knocked down 4 already, then etcd knocks down two more, etc... [14:37:14] and then which ones recover in which order. [14:37:26] I think pybal only has one item of state per server entry, in one place, IIRC [14:38:04] but I remember we've run into a past incident where pybal more or less lost track of what to sanely do, depending on the combination and timing of inputs there [14:40:52] mmm. seeing as how I'd like to leave "rewrite Pybal" out of scope for this, maybe it's best just to work on the confctl and monitoring side of things [14:51:35] yes, probably best for now :) [14:59:19] someone can tell me why this always boots the VM from disk instead of from the network: [14:59:27] https://www.irccloud.com/pastebin/RT47RyrV/ [14:59:35] I did it 3 times... [15:01:56] weird I've never had that; my experience has been more 'you typoed the MAC address in DHCP, and now it silently won't boot and won't give you console either' [15:04:01] it's re-image so didn't change the mac [15:04:07] yeah [15:04:22] who are the ganeti masters? :) [15:05:24] akosiaris [15:05:39] also I thought the reimage cookbook was modified to understand Ganeti at some point?? by chaomodus I thought [15:06:32] lol [15:07:13] opened https://phabricator.wikimedia.org/T245158 [15:07:19] XioNoX: you are trying to reimage it? [15:07:31] akosiaris: following https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_%2F_Reimage_a_VM [15:07:38] it worked for 3 VMs but not that 4th one [15:07:58] hmm smell like DHCP issues. lemme verify [15:08:54] akosiaris: that's the DHCP change I did https://gerrit.wikimedia.org/r/c/operations/puppet/+/571962 [15:09:30] wait.. where are the DHCP servers running now ? [15:09:40] cdanis: nope [15:09:43] hmmm [15:09:49] akosiaris: installX [15:10:26] unrelated but someone broke https://my.juniper.net/ [15:10:30] lol volans literally just sent https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/571998 ? 👀 [15:11:31] that's noop though, it's just moving it from teh makevm cookbook to spicerack ;) [15:24:10] kvm_extra: -bios OVMF.fd [15:24:16] sigh, how did that make it in there? [15:24:48] hm, I remember someone asking me if they could use my VMs for testing something before they were prod [15:24:58] that someone is paravoid :P [15:25:46] akosiaris: can you check if rpki1001 is the same? [15:25:49] sure [15:25:53] thx! [15:29:37] XioNoX: yes it was [15:29:40] fixed there as well [15:29:43] * akosiaris running a test now [15:30:13] cool [15:30:36] Loading debian-installer/amd64/linux... ok [15:30:38] that was it [15:31:08] what's the issue with OVMF? [15:31:11] it should work [15:31:18] paravoid: nope, VMs don't get reimaged [15:31:31] in fact, they don't do network boots [15:31:51] yes they do, how do you think this was installed? :) [15:32:03] well, maybe once? [15:32:16] it's possible that the boot order can't be persistently modified though, yes [15:33:05] I 've removed it btw, so that the reimage can proceed [15:33:17] so you're reimaging it in BIOS mode now? [15:33:27] yup, same as all VMs [15:33:27] ohnoes, [15:33:52] haven't you filed a couple of bugs upstream in ganeti and OVMF about support for it? [15:33:52] fwiw, nuking the bootloader && reboot would had done it [15:34:09] that's what I remember [15:34:28] just ganeti, and yeah that's to do it in a better way [15:35:18] yeah it's the mess of having somewhere to store the boot order that is accessible by ganeti [15:35:41] with OVMF having a RO part and a RW part and ganeti still not being able to use the RW part [15:41:01] yes, gory details @ https://github.com/ganeti/ganeti/issues/1374 [15:41:38] there are two distinct/separate problems [15:42:17] one is -boot order=N vs. ,bootindex=N, which would fix the issue above [15:42:37] the other one is persistent state for OVMF settings, incl. bootloader entries [15:43:39] the latter shouldn't affect us, because we're configuring d-i to copy GRUB to bootx64.efi, so we don't need to use the "debian" menu entry [15:43:57] cf. modules/install_server/files/autoinstall/virtual.cfg:8, "d-i grub-installer/force-efi-extra-removable boolean true" [15:48:49] akosiaris: I saw that you closed the task, am I good to go with those VMs? [15:53:46] XioNoX: yup. Both should be ok [15:56:00] paravoid: ah yes, we met https://github.com/ganeti/ganeti/issues/1374#issuecomment-557524579 again today [15:56:12] I audited the clusters btw, no order OVMF enabled VMs [15:56:24] akosiaris: thanks! [15:56:39] until ganeti at least goes with the -device,bootindex=N solution we probably want to avoid it? [16:50:32] akosiaris: when do you think will be time to remove package_updates from facter? :-) [17:01:38] who's in charge of the webproxies? [17:04:08] is there a squid/webproxy dashboard somewhere? [17:06:03] I think they tend to get ignored a lot because they mostly "just work" [17:06:13] (webproxy) [17:06:32] it wouldn't be a bad idea to keep better tabs on them, esp to notice anomalous outbound stuff going on... [17:06:40] https://phabricator.wikimedia.org/T245121#5881546 [17:06:57] I'm wondering if the eqiad one is not overloaded or miss-behaving [17:07:12] it's possible! [17:08:02] there must be a squid prometheus exporter somewhere :) [17:09:14] https://github.com/boynux/squid-exporter not feature full but last update 4 days ago [17:09:33] godog :) [17:13:19] opened https://phabricator.wikimedia.org/T245176 [17:29:30] hah! [17:29:36] can we trash squid for ats? :P [17:31:38] only half kidding really AIUI ats can act as a forward proxy too [17:42:09] <_joe_> XioNoX: have you looked at the logs from the eqiad webproxy? [17:42:34] _joe_: not yet [17:42:55] <_joe_> what's the URL? [17:43:38] XioNoX: around? [17:43:51] volans: yep? [17:43:57] seems that the whole mgmt network is down [17:44:04] _joe_: https://rrdp.ripe.net/notification.xml [17:44:12] volans: that's not good [17:44:25] not sure if related to some pdu work or something but seemed too many [17:44:47] volans: pinged dcops? [17:44:53] already did [17:45:09] cool, so A6 only? [17:45:17] either a switch dead or a cable bumped [17:45:28] no [17:45:31] 333 hosts [17:45:36] so clearly not a single rack [17:49:24] <_joe_> XioNoX: the proxy works perfectly to that url [17:49:31] <_joe_> I just ran curl from boron [17:52:47] _joe_: it looks intermitent though [17:53:38] yeah things are flappy, this is really odd [17:53:57] icinga flaps between 300 and ~390 down mgmt [17:54:09] any volounteer for IC? [17:54:13] afaics there's no production impact [17:55:14] agree [17:55:40] and it's only eqiad, right? [17:55:41] now ~484 unreachable (not down) [17:55:48] yes AFAICT [17:56:01] random though, did someone looped the mgmt network? [17:56:11] could be a broadcast storm [17:56:14] cmjohnson1: ^ [17:56:22] it's a possibility [17:56:35] yeah https://librenms.wikimedia.org/graphs/lazy_w=652/to=1581616500/device=22/type=device_bits/from=1581530100/legend=no/ [17:56:56] ok we're saturating it [17:57:01] there's some odd syslog entries on librenms too [17:57:02] we need to undo whatever was changed [17:57:28] mr1-eqiad: %-: /usr/sbin/sshd[80684]: exited, status 255 [17:57:34] I guess that's normal on session termination [17:57:43] cmjohnson1: ^ likely related to the mw row D racking [17:58:01] tring to fidn which port [17:58:03] I am not racking anything....everything that is being done was already connected [17:58:10] just fixing idrac settings [17:58:35] ah my bad, misunderstood what was going on [17:58:46] I'm trying to downtime eqiad mgmt [17:59:08] I'll stop ircecho, it is just spam at this point [17:59:24] godog: wait [17:59:27] I'm downtiming them [17:59:37] ok [18:00:00] looks like D3 https://librenms.wikimedia.org/device/device=22/tab=port/port=3002/ [18:00:00] D3: test - ignore - https://phabricator.wikimedia.org/D3 [18:00:10] cmjohnson1: ^ [18:00:26] downtimed all for 1h [18:00:49] thanks volans [18:01:15] !log msw1-eqiad# set interfaces ge-0/0/34 disable [18:01:22] XioNoX: Not expecting to hear !log here [18:01:23] librenms syslog/eventlog is a PITA when tryingf to skip past the recent spam [18:01:23] and commit confirmed 5 [18:01:51] let me know if things improve, only D3 should be unreachable for now [18:01:54] for reference I've used the downtime from the 'mgmt' hostgroup, is a bit wider that I would like but fitted the purpose: [18:01:57] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=5&hostgroup=mgmt [18:02:16] if you need to extend ^^^ click on downtime services and then the checkbox for 'hosts too' [18:02:22] could be a loop, could be an buggy IDRAC [18:02:43] a single one? [18:02:58] it's a single rack disabled at present [18:03:10] XioNoX: looks much better now from icinga POV [18:03:22] was 418 now is 273... [18:03:44] making my commit permamant [18:03:57] 182 and going down [18:03:58] the D3 switch is unmanaged so I can't do anything more from there [18:04:17] cmjohnson1: can you check cabling of D3 mgmt? [18:04:34] down to 80 [18:04:37] of course I have a bus ride in 45min :) [18:04:37] there are 3 mw servers in there that were updated...checking them now [18:04:55] (side note maybe for later, but don't we even have spanning-tree or some other kind of loop protection?) [18:05:00] (for mgmt) [18:05:32] not really no, it's mostly unmanaged switches [18:06:01] even if the per-rack switches are unmanaged, should spanning-tree on the ms1 itself catch anyone looping one dumb switch to another? [18:06:09] *msw1 [18:06:26] 14 now [18:06:35] bblack: we can add some flood protection yes, but not spanning tree [18:06:54] XioNoX: how did you figure out it was D3? [18:06:54] D3: test - ignore - https://phabricator.wikimedia.org/D3 [18:06:57] we're moving from host down to ssh failing socket timeout [18:07:10] XioNoX: how many should we expect down with D.3 disabled? [18:07:28] volans: the servers in rack D.3, check netbox [18:07:38] we've much more failing SSH [18:07:45] paravoid: looked are all the msw1-eqiad switch ports in librenms and flagged the one which had only inbound flood, istead of outbound for all the others [18:07:53] all 3 of the hosts I worked on are booting into their idrac/bios [18:08:00] 30-40 or so [18:08:06] (servers in D3 in netbox) [18:08:14] it says devices: 40 [18:08:31] we've ~150 failing SSH [18:08:42] volans: ssh is very slow to recover [18:08:44] but some might be planned or decom or whatever [18:08:50] could be true [18:09:07] not for later, have !log in that channel too [18:09:11] note* [18:09:23] host down are down to 7 [18:09:28] 29 devices in D3 are "Active" [18:10:27] https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&rack_id=37&status=1 for reference [18:11:22] the remaining failing oneas are in D6 [18:11:22] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [18:11:27] the network settings on the 3 servers all look normal [18:11:40] failing as HOST DOWN I meant [18:11:58] but are not the whole rack [18:12:24] d6 is another rack where servers were being worked on, not sure if it's related but all of these host had to have their ipmi enabled [18:12:26] ok [18:12:29] that was the only network change [18:12:51] no cable changes today? [18:13:01] (for this I mean, in row D) [18:13:02] no cable changes [18:13:08] hmmmm [18:13:17] maybe the switch is faulty [18:13:22] maybe one of the impi's grabbed a critical in-use IP of something else? [18:13:46] bblack: nah that would just be an IP conflict and only impact those 2 hosts [18:13:47] so SSH ones we check every hour, so it's ormal they are not recovering [18:13:54] but we would not see that spike of traffic [18:14:09] XioNoX: no I mean like, could it have duplicated a mr/msw -level IP adress somewhere and just broke monitoring of mgmt [18:14:25] bblack: doesn't explain the traffic spike [18:14:29] yeah :/ [18:14:42] XioNoX many of them were hitting the installer [18:14:59] cmjohnson1: but installed on their primary link, not mgmt [18:15:04] installer* [18:15:14] true [18:15:26] I picked a random SSH failed and forced a recheck, it's now green, they will recover within 1h (check interval) [18:15:33] ok [18:15:49] we still have 7 failing in D6 [18:15:53] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=1 [18:16:47] gotta go [18:16:59] and I think are the *only* ones active in D.6 [18:17:02] that explains why only those [18:17:11] the other are not checked [18:17:20] so kinda ssume D.6 mgmt is also down [18:18:03] # run show interfaces descriptions | match d6 [18:18:03] ge-0/0/28 up up Core: msw-d6-eqiad {#3493} [1Gbps Cu] [18:18:10] the switch port to d6 is good [18:18:26] d6 link to msw1 is dark [18:18:48] ? [18:18:55] yeah so msw1 port traffic graphs, D3 was the only one with a bidirectional spike, the rest were out-only spikes, I'm assuming that's how X found it earlier. [18:18:55] D3: test - ignore - https://phabricator.wikimedia.org/D3 [18:19:10] bblack: yep [18:19:13] the link light XioNoX [18:19:21] uh [18:19:30] cmjohnson1: ge-0/0/28 is D6? [18:19:31] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [18:19:41] it might be misslabeled [18:20:04] oh you already said that too, I just missed it :) [18:20:17] I'd bet it's labelled msw-d3-eqiad while it actually goes to D6 [18:20:24] and I shut down D6 and not D3 [18:20:34] either way, it was the misbehaving port [18:20:38] yeah [18:20:44] TODO: blacklist in stashbot links to differential, at least here [18:21:01] or at least to single digit differential [18:21:19] so we should be looking for a cause in D6's dumb switch, or one of the hosts attached to it [18:21:22] not in D3 [18:21:54] to whatever is connected to msw1-eqiad:ge-0/0/34 [18:22:02] cmjohnson1: if you can trace this port/cable ^ [18:23:10] d6 [18:23:16] XioNoX [18:23:20] ok [18:23:50] TODO audit msw1-eqiad ports/cables [18:24:02] XioNoX figured out the cause [18:24:04] cmjohnson1: so can you check if there is any loop on d6 mgmt switch? [18:24:06] loop [18:24:07] ahhh! [18:24:09] eh [18:24:26] cmjohnson1: let me know when I can re-enabled the port [18:24:28] operator error [18:24:29] go ahead [18:24:59] !log ROLLBACK: msw1-eqiad# set interfaces ge-0/0/34 disable [18:24:59] XioNoX: Not expecting to hear !log here [18:25:04] yeah unfortunately nothing on the msw1 could reasonably detect a loop within one dumb subswitch [18:25:54] we could put some kind of flood limit in place though, down a port if it spikes traffic in over a generous threshold? [18:25:56] Input rate : 0 bps (0 pps) [18:25:57] Output rate : 4416 bps (3 pps) [18:26:03] so the port is good now [18:26:31] bblack: yeah juniper has some features for that, especially for BUM traffic spikes [18:27:39] * volans forcing a re-check for the 7 remained [18:27:50] thx [18:28:05] recovering [18:28:09] great [18:28:31] so what happened? [18:29:05] paravoid: looped msw-d6 [18:29:23] yes, how? [18:29:50] paravoid: plugged both side of a cable on the switch? [18:30:14] (just guessing) [18:30:57] paravoid yes, a cable was not removed from the switch and mistaken for one being used by a mw server....turns out it was already plugged in [18:31:02] forcing some recheck on icinga to clear it up [18:31:11] ah [18:31:39] but I thought there were no cable changes today? [18:31:43] I guess we don't need an IC anymore :) [18:33:38] alright going to catch a bus soon, will be working from there [18:33:48] all cleared on icinga [18:33:51] I've forced all the SSH checks [18:33:55] and they recovered [18:34:07] I removed and re-inserted a green cable during the process it slipped out of my fingers and I must've grabbed the one that wasn't removed...I didn't really classify that as a cable change. It was until we moved to d6 from d3 did it occur to me [18:34:09] thanks! [19:03:59] bblack: if you're curious - https://www.juniper.net/documentation/en_US/junos/topics/concept/rate-limiting-storm-control-understanding.html [19:04:18] should prevent futur similar events [19:10:41] opened https://phabricator.wikimedia.org/T245192 [20:48:30] volans: JDI ? [20:49:51] nothing is using it anymore right? [21:03:46] <_joe_> volans: we'll know soon enough :D [22:09:15] akosiaris: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/572101 for you :)