[06:56:44] <elukey>	 arturo: o/ - there is an alert for cloudvirt hosts in puppet, it seems that the class backy2 tries to install python3-fusepy on stretch that is not available (only in stretch-backports)
[06:56:58] <elukey>	 s/in puppet/in icinga/
[08:32:44] <arturo>	 thanks elukey  will look into it soon
[08:39:20] <_joe_>	 I've disabled puppet on a bunch of hosts, I'm doing some changes that are relatively risky
[12:14:33] <Urbanecm>	 scap sync-file throws "snapshot1010.eqiad.wmnet returned [255]: Host key verification failed." as  warning - dunno if that's an expected or an issue - can someone weigh in in -operations please?
[12:18:35] <sobanski>	 I can ssh to snapshot1010.eqiad.wmnet just fine
[12:19:00] <Urbanecm>	 I can ssh there as well, but scap has hardcoded host keys iirc - so it doesn't work for scap
[12:19:41] <_joe_>	 Urbanecm: lemme run puppet on deploy1001
[12:19:47] <Urbanecm>	 thanks _joe_
[13:49:28] <andrewbogott>	 elukey, arturo, what host is that?  Many of those cloudvirts have a buster-only puppet role applied but are still awaiting reimage to buster
[13:49:33] <andrewbogott>	 But in theory they're all silenced
[13:51:32] <elukey>	 andrewbogott: puppet keeps doing changes every run, due to the package not being present in streth
[13:51:35] <elukey>	 *stretch
[13:51:43] <elukey>	 (there is an aggregate alarm about it)
[13:52:36] <andrewbogott>	 ah, so even if a host is in downtime it shows up?
[13:53:33] <andrewbogott>	 that seems like a problem with the check — downtiming alerts is definitely a thing we should support
[13:55:54] <jynus>	 I think the "downtime" procedure would be to disable puppet, assuming the reimage will happen soon
[13:56:16] <jynus>	 otherwise maybe add a conditional instalation, depending on the os version?
[13:57:13] <andrewbogott>	 It's not that I can't think of ways to work around it, it's just that I marked the host as in downtime in icinga and that should be the end of it.  Otherwise 'downtime' doesn't actually mean downtime?
[13:57:31] <jynus>	 no, downtime the host is working
[13:57:46] <jynus>	 but it is the aggregate non-host check that is failing
[14:09:36] <elukey>	 andrewbogott: you can disable puppet and it should be fine, the check is unrelated to downtimed hosts, it check which ones have puppet runs doing something (probably) weird because misconfigured
[14:09:43] <jynus>	 yep
[14:10:44] <jynus>	 the idea behind the aggregation was to avoid multiple alerts spam, so only 1 is received instead 0:-)
[14:11:01] <andrewbogott>	 I understand.  My point is that "unrelated to downtimed hosts" == the test is reporting incorrectly.  It should not include downtimed hosts in the aggregate.
[14:15:51] <kormat>	 akosiaris: good email re: alerting for maps.
[14:16:25] <akosiaris>	 kormat: ♥
[14:41:41] <_joe_>	 there is just an error: there is a runbook for maps :)
[14:46:27] <_joe_>	 https://wikitech.wikimedia.org/wiki/Maps/Runbook
[14:48:20] <kormat>	 huh. db2077 has paused on boot, with just `[  OK  ` showing on the serial console. useful.
[14:50:59] <jynus>	 kormat: I think there is a ticket about that
[14:51:57] <moritzm>	 https://phabricator.wikimedia.org/T216240
[14:51:59] <jynus>	 kormat: https://phabricator.wikimedia.org/T216240
[14:52:03] <moritzm>	 the firmware from hell
[14:52:05] <jynus>	 you find it first
[14:52:09] <jynus>	 *found
[14:52:27] <kormat>	 mm. would this also cause a CPU error in the management logs?
[14:52:48] <jynus>	 I don't think so, it was only boot failure (non-deterministic)
[14:53:00] <kormat>	 ok. then this seems different.
[14:53:01] <jynus>	 it "ended up booting"
[14:53:51] <jynus>	 but you know what is the support think we will get from vendor, right?
[14:53:54] <jynus>	 *thing
[14:54:03] <jynus>	 "upgrade firmware"
[14:54:13] <jynus>	 :-(
[14:54:20] <moritzm>	 actually, I think it's the same error, there's nothing in SEL and the symptom of failing boots is the same as I remember it?
[14:54:41] <moritzm>	 and TTBOMK we didn't run into this again once the firmwares were updated
[14:54:46] <jynus>	 it could be- in any case it could be related, based on the range
[14:54:55] <kormat>	 https://phabricator.wikimedia.org/T267220 filed.
[14:55:54] <jynus>	 so either "update firmware" or "drain power"
[14:56:13] <jynus>	 add dc-ops anyway
[14:57:06] <moritzm>	 yeah, and it that fixes it, let's also schedule updates for the remainders of https://phabricator.wikimedia.org/T216240, I'm pretty sure there are similar time bombs in the other servers not yet updated
[14:57:26] <moritzm>	 it would suck to run into this when we have to firefight something urgently
[14:57:41] <jynus>	 moritzm: I don't remember the details, but unlike what you said, I think it kept happening sometimes after upgade
[14:57:56] <jynus>	 that is why we stopped doing it systematically
[14:58:15] <kormat>	 jynus, moritzm: the SEL shows a ton of errors
[14:58:22] <kormat>	 see https://phabricator.wikimedia.org/P13187
[14:59:08] <jynus>	 moritzm: however, Jaime says it worked on the ticket
[14:59:19] <jynus>	 so maybe it wouldn't hurt, IF it is related
[15:00:02] <moritzm>	 it's been a long-standing ticket my memory misses some details, but at least for manuel's test it was working reliably: https://phabricator.wikimedia.org/T216240#5134952
[15:00:24] <moritzm>	 and before the firmware was updated, it was a bit of a head scratcher and fairly reproducible
[15:00:30] <jynus>	 moritzm: yeah, see also : https://phabricator.wikimedia.org/T216240#4965126
[15:01:03] <kormat>	 rebooting again while attached to the console
[15:01:19] <jynus>	 kormat: the problem with CPU errors is that it can be anything
[15:01:24] <moritzm>	 kormat: the CPU error might be a red herring here, the issue was something in early boot (hence the lack of details on the console), that's why it was only showing up during reboots
[15:01:28] <jynus>	 CPU, motherboard, memory...
[15:01:41] <jynus>	 or what moritzm says
[15:01:46] <moritzm>	 I'd say: let's schedule a fw update for 2077 first and if that doesn't fix it, we can still escalate to Dell
[15:01:50] <jynus>	 +1
[15:02:10] <jynus>	 kormat: make sure to extend downtime of the host if needed
[15:05:55] <kormat>	 alright, _this_ time it hung at "Loading ramdisk..." for a minute or so, and then it booted
[15:06:06] <kormat>	 which is different behaviour than last time
[15:06:19] <jynus>	 yeah, smels like the ticket that last time
[16:00:29] <ottomata>	 herron: can you invite me to 011y office hours on Monday?  i'm trying to find in gcal but failling
[16:01:11] <ottomata>	 Oh is it part of your 'weekly' meeting?
[16:01:22] <herron>	 ottomata: yes second half of the weekly
[16:01:25] <herron>	 you see it?
[16:02:02] <ottomata>	 i see the weeklyy
[16:05:23] <herron>	 ok great, made a note in the meeting agenda
[16:38:58] <elukey>	 there was a very weird blip for 3 mw14xx nodes in C3, yielding to a ton of mcrouter tkos (and mediawiki excp)
[16:39:01] <elukey>	 https://grafana.wikimedia.org/d/000000549/mcrouter?viewPanel=9&orgId=1&from=1604507036219&to=1604507797129&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All
[16:45:44] <elukey>	 mmm this looks strange
[16:45:46] <elukey>	 Nov  4 16:24:59  asw2-c-eqiad vccpd[1845]: Member 3, interface vcp-255/1/0 went down
[16:46:15] <elukey>	 and then
[16:46:16] <elukey>	 Nov  4 16:25:00  asw2-c-eqiad vccpd[1845]: Member 3, interface vcp-255/1/0 came up
[16:47:11] <elukey>	 Nov  4 16:25:00  asw2-c-eqiad vccpd[1845]: VCCPD_PROTOCOL_ADJUP: New adjacency to c042.d00e.9f20 on vcp-255/0/48
[16:47:37] <rzl>	 effie: ^ fyi
[16:48:12] <rzl>	 elukey: we're in the process of reimaging mc1036 to buster, I'm not sure if that's what you're seeing? but we're keeping an eye out for weird stuff
[16:48:34] <elukey>	 rzl: hi! nono I think it is not related, the reimage hasn't started yet
[16:49:14] <effie>	 I jumped into another meeting and didn't proceed with it
[16:49:23] <rzl>	 ahh okay
[16:49:33] <rzl>	 sorry elukey I should have known you'd already have a closer eye on that :)
[16:49:35] <effie>	 I will do so in a bit
[16:49:51] <elukey>	 it seems as if one link between asw2-c in rack 3 went down
[16:50:04] <rzl>	 agree in that case it's probably network yeah
[16:50:13] <elukey>	 rzl: aahah yes so excited about finally seeing 1.5 on mc10xx :D
[16:50:15] <rzl>	 just checked and that's all the app servers in C3
[16:51:47] <elukey>	 my understanding of vccpd is below zero :D
[16:53:36] <elukey>	 so for fpc3 I see
[16:53:36] <elukey>	   interface-name: vcp-255/1/0, State: Up, Expires in 59 secs
[16:53:37] <elukey>	   Priority: 0, Up/Down transitions: 5, Last transition: 00:27:40 ago
[16:53:56] <elukey>	 and fpc2
[16:53:56] <elukey>	   interface-name: vcp-255/0/48, State: Up, Expires in 59 secs
[16:53:57] <elukey>	   Priority: 0, Up/Down transitions: 5, Last transition: 00:27:47 ago
[16:54:08] <elukey>	 the others have as "Last transition" weeks
[16:54:49] <elukey>	 XioNoX: --^
[16:55:04] <elukey>	 (not sure if he is around or on holidays)
[16:55:46] <XioNoX>	 elukey: might need a faulty cable or optic
[16:55:49] <XioNoX>	 but it shouldn't flap
[16:55:55] <XioNoX>	 s/need/mean/
[16:56:05] <elukey>	 John is working on C4, could it be related?
[16:56:15] <XioNoX>	 elukey: could be a bumped cable/optic
[16:56:22] <XioNoX>	 elukey: can you open a task about to go in a meeting
[16:56:41] <elukey>	 XioNoX: sure I can, in exchange you'll tell me how to debug this :D
[16:58:37] <XioNoX>	 no pb, in 1h
[17:10:19] <elukey>	 https://phabricator.wikimedia.org/T267242
[17:16:40] <XioNoX>	 elukey: commented. cmjohnson1, jclark-ctr could you have a look at https://phabricator.wikimedia.org/T267242 today please?
[17:16:58] <jclark-ctr>	 i am walking back to cage now
[17:19:16] <XioNoX>	 thanks!
[17:19:28] <XioNoX>	 I'm in a meeting so might not be able to pay full attention here