[09:59:28] <elukey>	 XioNoX: o/ anything happening in eqsin? I just seen varnishkafkas on cp50xx hosts getting into trouble temporarily
[09:59:40] <elukey>	 (while pushing data to kafka in eqiad)
[09:59:41] <elukey>	 https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=All&var-instance=All&from=now-3h&to=now
[10:00:07] <elukey>	 the most affected one is a single node, going to restart vk in there
[10:00:15] <elukey>	 but there was a jump on all metrics
[10:01:46] <ema>	 elukey: purged on cp5001 also had a temporary drop in kafka bytes sent https://grafana.wikimedia.org/d/RvscY1CZk/purged?viewPanel=38&orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=cache_upload&var-instance=cp5001&from=now-6h&to=now
[10:03:10] <ema>	 at 09:42, which aligns with the varnishkafka troubles you mentioned
[10:03:56] <elukey>	 on cr3-eqsin I see a nice jump in traffic https://librenms.wikimedia.org/graphs/lazy_w=804/to=1610100000/device=159/type=device_bits/from=1610013600/legend=no/
[10:05:20] <elukey>	 anyway, restarted vk on 5001, librdkafka was kinda stuck
[10:07:07] <ema>	 elukey: purged seems to have recovered well on its own, no restart needed
[10:17:39] <_joe_>	 no maintenance expected right now
[10:17:57] <_joe_>	 ema: purged >> vk
[14:28:25] <klausman>	 So say due to a BIOS flub, a machine got reimaged twice, and now the puppet cert and SSH host keys are wrong. How does one wipe the relevant state to make the first puppet run work again?
[14:38:43] <kormat>	 ^ this might be resolved now
[14:41:05] <klausman>	 Yep. Figured out the "remove certs on client, make new ones and sign them" sequence
[15:22:53] <_joe_>	 klausman: yeah that horrible :D
[15:33:29] <volans|off>	 klausman: what chain of events lead to that happening? the reimage nukes the old cert before reimaging
[15:34:54] <volans|off>	 we also have an sre.puppet.renew-cert cookbook but I think it doesn't support (yet?) a not-yet-installed host (doesn't have yet the cumin key should use the install key)
[15:38:22] <volans|off>	 btw there are 3 unsigned CSRs on puppetmaster1001 (10.3.0.1, d-i-test, webproxy)
[15:39:00] <volans|off>	 maybe we should add an icinga check for unsigned CSRs older than X
[15:39:19] * volans|off back being off, will read any eventual reply later
[15:57:23] <klausman>	 volans, the problem was that after being installed and everything, the machine rebooted, but PXE booted again and then got stuck.
[15:57:39] <klausman>	 It didn't help that the SSH host keys were also wrong by that time.
[16:15:28] <elukey>	 andrewbogott: o/
[16:16:07] <andrewbogott>	 elukey: what's up?
[16:16:11] <elukey>	 in https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ cloudcephmon200x-dev nodes are in red, afaics from phabricator they are already active right? If so I can change their status
[16:16:26] <andrewbogott>	 yep, they're active
[16:16:27] <andrewbogott>	 thx
[16:16:29] <elukey>	 super
[16:21:39] <elukey>	 andrewbogott: sorry to ping again, same thing for cloudcephostd20xx ?
[16:22:05] <andrewbogott>	 yes, same.  Is that something I needed to clean up in netbox by hand or does the image script do that in theory?
[16:23:52] <elukey>	 from the dcops point of view I think that the host goes from planned to staged, then active needs to be set by the service owner when there is the hand-off
[16:23:55] <elukey>	 IIUC
[16:24:08] <elukey>	 in this case there might have been leftovers, I was checking alerts
[16:25:12] <elukey>	 hnowlan: o/ question for you about maps nodes :)
[16:25:22] <elukey>	 (please don't kill me if I am the 10th person asking)
[16:25:24] <hnowlan>	 elukey: go for it! :)
[16:25:31] <elukey>	 :)
[16:25:33] <elukey>	 https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/
[16:25:51] <elukey>	 so maps1009 and others are in mixed states in netbox
[16:26:21] <elukey>	 I see that some of them are with puppet disabled since a while, we can try to adjust states so we clear the alert
[16:26:27] <elukey>	 there is also restbase2009 since we are here :)
[16:28:14] <andrewbogott>	 elukey: makes sense.  thanks for cleaning up my mess!
[16:29:10] <hnowlan>	 elukey: oh, didn't realise these were causing issues - all but restbase2009 have puppet disabled. I think I can just resolve maps1009 by starting puppet again. What should I do for the others?
[16:29:19] <hnowlan>	 restbase2009 has hardware issues, I will try to get that looked at
[16:30:57] <elukey>	 hnowlan: ah okok! It is not a big issue, it is just netbox records :) So for restbase2009 we can set it as failed, lemme do it
[16:31:40] <hnowlan>	 thanks!
[16:32:03] <elukey>	 remaining ones are maps2002 and 2007, are those not active?
[16:32:12] <elukey>	 or possibly parked waiting for something
[16:32:46] <hnowlan>	 maps2007 is a testing node so I'd like to keep pupppet off. maps2002 is an unhealthy node
[16:32:57] <hnowlan>	 so I need to keep puppet disabled until I figure out what to do I guess heh
[16:33:58] <elukey>	 ah okok so let's possibly set maps2002 as failed? For maps2007 not sure, it is active in theory
[16:35:30] <elukey>	 all right https://netbox.wikimedia.org/dcim/devices/156/ is set as failed, and I left a comment at the bottom
[16:35:36] <elukey>	 so people can blame me in case :D
[16:36:25] <elukey>	 so maps2007 is missing from puppetdb due to puppet disabled, and netbox is complaining since the status is "Active"
[16:36:30] <hnowlan>	 nice! Thanks
[16:37:49] <elukey>	 hnowlan: what do we do for maps2007 ? In theory the less puppet stays disabled the better.. I have a similar use case for hadoop test nodes, I solved boxing tests in $days and re-enabling puppet as much as possible
[16:40:20] <hnowlan>	 I'll try to check in with the team and see if we can reenable it for now
[16:40:42] <elukey>	 super thanks
[16:50:18] <jbond42>	 for the python geeks should one `map` or list comprehension.  i.e. `' '.join([str(i) for i in eyes])`  vs `' '.join(map(str, eyes))`
[16:51:37] <rzl>	 neither, just strip off the brackets and use the generator: `' '.join(str(i) for i in eyes)`
[16:52:19] <elukey>	 map is also nice to see compared to the list comprehension
[16:52:52] <jbond42>	 rzl: i thught that was just a short hand of `' '.join([str(i) for i in eyes])` are the functionaly different?
[16:52:55] <rzl>	 yeah the map is fine too
[16:53:23] <cdanis>	 in this context I don't think they're different because join will consume all of it regardless?
[16:53:26] <rzl>	 jbond42: not usually in a significant way -- the list is eager and the generator is lazy, so it's more performant in cases where that matters
[16:53:50] <jbond42>	 ahh ok thanks
[16:53:52] <rzl>	 (it'll consume all of it but the eager evaluation still consumes more memory, again in cases where that matters)
[16:54:18] <rzl>	 but even in the common case where it doesn't make a difference, the brackets are just an extra unnecessary complication, leave em out
[16:54:25] <rzl>	 you used to need them but you don't anymore
[16:54:55] <jbond42>	 cool thanks will make a note to drop them
[16:55:31] <sukhe>	 I almost assume comprehensions to be more Python-ic now given that I have come across very few cases of map being used in Python code, but confirmation bias is possible
[16:58:13] <rzl>	 yeah -- if you're writing a bunch of functional-style python then absolutely use map, filter, reduce and friends -- but it's pretty rare to do that nowadays
[16:58:13] <cdanis>	 I think that's canon, sukhe
[16:58:40] <rzl>	 and if you're not writing a bunch of functional-style code, throwing in a single map() is kind of an aberration, even though it isn't wrong
[16:59:22] <rzl>	 as it is said, there should be one -- and preferably not more than six -- obvious ways to do it
[16:59:45] <cdanis>	 rzl: https://twitter.com/mcclure111/status/1278517867445706758
[17:00:02] <rzl>	 a tweet that became classic milliseconds after it was posted
[17:00:50] <sukhe>	 ha
[17:05:17] <jbond42>	 :D
[17:39:08] <cdanis>	 if someone has a minute, I'd appreciate looks at https://gerrit.wikimedia.org/r/655100 and https://gerrit.wikimedia.org/r/655109
[17:41:13] <rzl>	 lookin
[18:02:46] <cdanis>	 rzl: confirmed, nothing sensitive there that isn't something the user has access to anyway
[18:03:53] <rzl>	 👍