[06:56:44] arturo: o/ - there is an alert for cloudvirt hosts in puppet, it seems that the class backy2 tries to install python3-fusepy on stretch that is not available (only in stretch-backports) [06:56:58] s/in puppet/in icinga/ [08:32:44] thanks elukey will look into it soon [08:39:20] <_joe_> I've disabled puppet on a bunch of hosts, I'm doing some changes that are relatively risky [12:14:33] scap sync-file throws "snapshot1010.eqiad.wmnet returned [255]: Host key verification failed." as warning - dunno if that's an expected or an issue - can someone weigh in in -operations please? [12:18:35] I can ssh to snapshot1010.eqiad.wmnet just fine [12:19:00] I can ssh there as well, but scap has hardcoded host keys iirc - so it doesn't work for scap [12:19:41] <_joe_> Urbanecm: lemme run puppet on deploy1001 [12:19:47] thanks _joe_ [13:49:28] elukey, arturo, what host is that? Many of those cloudvirts have a buster-only puppet role applied but are still awaiting reimage to buster [13:49:33] But in theory they're all silenced [13:51:32] andrewbogott: puppet keeps doing changes every run, due to the package not being present in streth [13:51:35] *stretch [13:51:43] (there is an aggregate alarm about it) [13:52:36] ah, so even if a host is in downtime it shows up? [13:53:33] that seems like a problem with the check — downtiming alerts is definitely a thing we should support [13:55:54] I think the "downtime" procedure would be to disable puppet, assuming the reimage will happen soon [13:56:16] otherwise maybe add a conditional instalation, depending on the os version? [13:57:13] It's not that I can't think of ways to work around it, it's just that I marked the host as in downtime in icinga and that should be the end of it. Otherwise 'downtime' doesn't actually mean downtime? [13:57:31] no, downtime the host is working [13:57:46] but it is the aggregate non-host check that is failing [14:09:36] andrewbogott: you can disable puppet and it should be fine, the check is unrelated to downtimed hosts, it check which ones have puppet runs doing something (probably) weird because misconfigured [14:09:43] yep [14:10:44] the idea behind the aggregation was to avoid multiple alerts spam, so only 1 is received instead 0:-) [14:11:01] I understand. My point is that "unrelated to downtimed hosts" == the test is reporting incorrectly. It should not include downtimed hosts in the aggregate. [14:15:51] akosiaris: good email re: alerting for maps. [14:16:25] kormat: ♥ [14:41:41] <_joe_> there is just an error: there is a runbook for maps :) [14:46:27] <_joe_> https://wikitech.wikimedia.org/wiki/Maps/Runbook [14:48:20] huh. db2077 has paused on boot, with just `[ OK ` showing on the serial console. useful. [14:50:59] kormat: I think there is a ticket about that [14:51:57] https://phabricator.wikimedia.org/T216240 [14:51:59] kormat: https://phabricator.wikimedia.org/T216240 [14:52:03] the firmware from hell [14:52:05] you find it first [14:52:09] *found [14:52:27] mm. would this also cause a CPU error in the management logs? [14:52:48] I don't think so, it was only boot failure (non-deterministic) [14:53:00] ok. then this seems different. [14:53:01] it "ended up booting" [14:53:51] but you know what is the support think we will get from vendor, right? [14:53:54] *thing [14:54:03] "upgrade firmware" [14:54:13] :-( [14:54:20] actually, I think it's the same error, there's nothing in SEL and the symptom of failing boots is the same as I remember it? [14:54:41] and TTBOMK we didn't run into this again once the firmwares were updated [14:54:46] it could be- in any case it could be related, based on the range [14:54:55] https://phabricator.wikimedia.org/T267220 filed. [14:55:54] so either "update firmware" or "drain power" [14:56:13] add dc-ops anyway [14:57:06] yeah, and it that fixes it, let's also schedule updates for the remainders of https://phabricator.wikimedia.org/T216240, I'm pretty sure there are similar time bombs in the other servers not yet updated [14:57:26] it would suck to run into this when we have to firefight something urgently [14:57:41] moritzm: I don't remember the details, but unlike what you said, I think it kept happening sometimes after upgade [14:57:56] that is why we stopped doing it systematically [14:58:15] jynus, moritzm: the SEL shows a ton of errors [14:58:22] see https://phabricator.wikimedia.org/P13187 [14:59:08] moritzm: however, Jaime says it worked on the ticket [14:59:19] so maybe it wouldn't hurt, IF it is related [15:00:02] it's been a long-standing ticket my memory misses some details, but at least for manuel's test it was working reliably: https://phabricator.wikimedia.org/T216240#5134952 [15:00:24] and before the firmware was updated, it was a bit of a head scratcher and fairly reproducible [15:00:30] moritzm: yeah, see also : https://phabricator.wikimedia.org/T216240#4965126 [15:01:03] rebooting again while attached to the console [15:01:19] kormat: the problem with CPU errors is that it can be anything [15:01:24] kormat: the CPU error might be a red herring here, the issue was something in early boot (hence the lack of details on the console), that's why it was only showing up during reboots [15:01:28] CPU, motherboard, memory... [15:01:41] or what moritzm says [15:01:46] I'd say: let's schedule a fw update for 2077 first and if that doesn't fix it, we can still escalate to Dell [15:01:50] +1 [15:02:10] kormat: make sure to extend downtime of the host if needed [15:05:55] alright, _this_ time it hung at "Loading ramdisk..." for a minute or so, and then it booted [15:06:06] which is different behaviour than last time [15:06:19] yeah, smels like the ticket that last time [16:00:29] herron: can you invite me to 011y office hours on Monday? i'm trying to find in gcal but failling [16:01:11] Oh is it part of your 'weekly' meeting? [16:01:22] ottomata: yes second half of the weekly [16:01:25] you see it? [16:02:02] i see the weeklyy [16:05:23] ok great, made a note in the meeting agenda [16:38:58] there was a very weird blip for 3 mw14xx nodes in C3, yielding to a ton of mcrouter tkos (and mediawiki excp) [16:39:01] https://grafana.wikimedia.org/d/000000549/mcrouter?viewPanel=9&orgId=1&from=1604507036219&to=1604507797129&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All [16:45:44] mmm this looks strange [16:45:46] Nov 4 16:24:59 asw2-c-eqiad vccpd[1845]: Member 3, interface vcp-255/1/0 went down [16:46:15] and then [16:46:16] Nov 4 16:25:00 asw2-c-eqiad vccpd[1845]: Member 3, interface vcp-255/1/0 came up [16:47:11] Nov 4 16:25:00 asw2-c-eqiad vccpd[1845]: VCCPD_PROTOCOL_ADJUP: New adjacency to c042.d00e.9f20 on vcp-255/0/48 [16:47:37] effie: ^ fyi [16:48:12] elukey: we're in the process of reimaging mc1036 to buster, I'm not sure if that's what you're seeing? but we're keeping an eye out for weird stuff [16:48:34] rzl: hi! nono I think it is not related, the reimage hasn't started yet [16:49:14] I jumped into another meeting and didn't proceed with it [16:49:23] ahh okay [16:49:33] sorry elukey I should have known you'd already have a closer eye on that :) [16:49:35] I will do so in a bit [16:49:51] it seems as if one link between asw2-c in rack 3 went down [16:50:04] agree in that case it's probably network yeah [16:50:13] rzl: aahah yes so excited about finally seeing 1.5 on mc10xx :D [16:50:15] just checked and that's all the app servers in C3 [16:51:47] my understanding of vccpd is below zero :D [16:53:36] so for fpc3 I see [16:53:36] interface-name: vcp-255/1/0, State: Up, Expires in 59 secs [16:53:37] Priority: 0, Up/Down transitions: 5, Last transition: 00:27:40 ago [16:53:56] and fpc2 [16:53:56] interface-name: vcp-255/0/48, State: Up, Expires in 59 secs [16:53:57] Priority: 0, Up/Down transitions: 5, Last transition: 00:27:47 ago [16:54:08] the others have as "Last transition" weeks [16:54:49] XioNoX: --^ [16:55:04] (not sure if he is around or on holidays) [16:55:46] elukey: might need a faulty cable or optic [16:55:49] but it shouldn't flap [16:55:55] s/need/mean/ [16:56:05] John is working on C4, could it be related? [16:56:15] elukey: could be a bumped cable/optic [16:56:22] elukey: can you open a task about to go in a meeting [16:56:41] XioNoX: sure I can, in exchange you'll tell me how to debug this :D [16:58:37] no pb, in 1h [17:10:19] https://phabricator.wikimedia.org/T267242 [17:16:40] elukey: commented. cmjohnson1, jclark-ctr could you have a look at https://phabricator.wikimedia.org/T267242 today please? [17:16:58] i am walking back to cage now [17:19:16] thanks! [17:19:28] I'm in a meeting so might not be able to pay full attention here