[09:59:28] XioNoX: o/ anything happening in eqsin? I just seen varnishkafkas on cp50xx hosts getting into trouble temporarily [09:59:40] (while pushing data to kafka in eqiad) [09:59:41] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-source=webrequest&var-cp_cluster=All&var-instance=All&from=now-3h&to=now [10:00:07] the most affected one is a single node, going to restart vk in there [10:00:15] but there was a jump on all metrics [10:01:46] elukey: purged on cp5001 also had a temporary drop in kafka bytes sent https://grafana.wikimedia.org/d/RvscY1CZk/purged?viewPanel=38&orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=cache_upload&var-instance=cp5001&from=now-6h&to=now [10:03:10] at 09:42, which aligns with the varnishkafka troubles you mentioned [10:03:56] on cr3-eqsin I see a nice jump in traffic https://librenms.wikimedia.org/graphs/lazy_w=804/to=1610100000/device=159/type=device_bits/from=1610013600/legend=no/ [10:05:20] anyway, restarted vk on 5001, librdkafka was kinda stuck [10:07:07] elukey: purged seems to have recovered well on its own, no restart needed [10:17:39] <_joe_> no maintenance expected right now [10:17:57] <_joe_> ema: purged >> vk [14:28:25] So say due to a BIOS flub, a machine got reimaged twice, and now the puppet cert and SSH host keys are wrong. How does one wipe the relevant state to make the first puppet run work again? [14:38:43] ^ this might be resolved now [14:41:05] Yep. Figured out the "remove certs on client, make new ones and sign them" sequence [15:22:53] <_joe_> klausman: yeah that horrible :D [15:33:29] klausman: what chain of events lead to that happening? the reimage nukes the old cert before reimaging [15:34:54] we also have an sre.puppet.renew-cert cookbook but I think it doesn't support (yet?) a not-yet-installed host (doesn't have yet the cumin key should use the install key) [15:38:22] btw there are 3 unsigned CSRs on puppetmaster1001 (10.3.0.1, d-i-test, webproxy) [15:39:00] maybe we should add an icinga check for unsigned CSRs older than X [15:39:19] * volans|off back being off, will read any eventual reply later [15:57:23] volans, the problem was that after being installed and everything, the machine rebooted, but PXE booted again and then got stuck. [15:57:39] It didn't help that the SSH host keys were also wrong by that time. [16:15:28] andrewbogott: o/ [16:16:07] elukey: what's up? [16:16:11] in https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ cloudcephmon200x-dev nodes are in red, afaics from phabricator they are already active right? If so I can change their status [16:16:26] yep, they're active [16:16:27] thx [16:16:29] super [16:21:39] andrewbogott: sorry to ping again, same thing for cloudcephostd20xx ? [16:22:05] yes, same. Is that something I needed to clean up in netbox by hand or does the image script do that in theory? [16:23:52] from the dcops point of view I think that the host goes from planned to staged, then active needs to be set by the service owner when there is the hand-off [16:23:55] IIUC [16:24:08] in this case there might have been leftovers, I was checking alerts [16:25:12] hnowlan: o/ question for you about maps nodes :) [16:25:22] (please don't kill me if I am the 10th person asking) [16:25:24] elukey: go for it! :) [16:25:31] :) [16:25:33] https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ [16:25:51] so maps1009 and others are in mixed states in netbox [16:26:21] I see that some of them are with puppet disabled since a while, we can try to adjust states so we clear the alert [16:26:27] there is also restbase2009 since we are here :) [16:28:14] elukey: makes sense. thanks for cleaning up my mess! [16:29:10] elukey: oh, didn't realise these were causing issues - all but restbase2009 have puppet disabled. I think I can just resolve maps1009 by starting puppet again. What should I do for the others? [16:29:19] restbase2009 has hardware issues, I will try to get that looked at [16:30:57] hnowlan: ah okok! It is not a big issue, it is just netbox records :) So for restbase2009 we can set it as failed, lemme do it [16:31:40] thanks! [16:32:03] remaining ones are maps2002 and 2007, are those not active? [16:32:12] or possibly parked waiting for something [16:32:46] maps2007 is a testing node so I'd like to keep pupppet off. maps2002 is an unhealthy node [16:32:57] so I need to keep puppet disabled until I figure out what to do I guess heh [16:33:58] ah okok so let's possibly set maps2002 as failed? For maps2007 not sure, it is active in theory [16:35:30] all right https://netbox.wikimedia.org/dcim/devices/156/ is set as failed, and I left a comment at the bottom [16:35:36] so people can blame me in case :D [16:36:25] so maps2007 is missing from puppetdb due to puppet disabled, and netbox is complaining since the status is "Active" [16:36:30] nice! Thanks [16:37:49] hnowlan: what do we do for maps2007 ? In theory the less puppet stays disabled the better.. I have a similar use case for hadoop test nodes, I solved boxing tests in $days and re-enabling puppet as much as possible [16:40:20] I'll try to check in with the team and see if we can reenable it for now [16:40:42] super thanks [16:50:18] for the python geeks should one `map` or list comprehension. i.e. `' '.join([str(i) for i in eyes])` vs `' '.join(map(str, eyes))` [16:51:37] neither, just strip off the brackets and use the generator: `' '.join(str(i) for i in eyes)` [16:52:19] map is also nice to see compared to the list comprehension [16:52:52] rzl: i thught that was just a short hand of `' '.join([str(i) for i in eyes])` are the functionaly different? [16:52:55] yeah the map is fine too [16:53:23] in this context I don't think they're different because join will consume all of it regardless? [16:53:26] jbond42: not usually in a significant way -- the list is eager and the generator is lazy, so it's more performant in cases where that matters [16:53:50] ahh ok thanks [16:53:52] (it'll consume all of it but the eager evaluation still consumes more memory, again in cases where that matters) [16:54:18] but even in the common case where it doesn't make a difference, the brackets are just an extra unnecessary complication, leave em out [16:54:25] you used to need them but you don't anymore [16:54:55] cool thanks will make a note to drop them [16:55:31] I almost assume comprehensions to be more Python-ic now given that I have come across very few cases of map being used in Python code, but confirmation bias is possible [16:58:13] yeah -- if you're writing a bunch of functional-style python then absolutely use map, filter, reduce and friends -- but it's pretty rare to do that nowadays [16:58:13] I think that's canon, sukhe [16:58:40] and if you're not writing a bunch of functional-style code, throwing in a single map() is kind of an aberration, even though it isn't wrong [16:59:22] as it is said, there should be one -- and preferably not more than six -- obvious ways to do it [16:59:45] rzl: https://twitter.com/mcclure111/status/1278517867445706758 [17:00:02] a tweet that became classic milliseconds after it was posted [17:00:50] ha [17:05:17] :D [17:39:08] if someone has a minute, I'd appreciate looks at https://gerrit.wikimedia.org/r/655100 and https://gerrit.wikimedia.org/r/655109 [17:41:13] lookin [18:02:46] rzl: confirmed, nothing sensitive there that isn't something the user has access to anyway [18:03:53] 👍