[00:15:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [00:25:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [02:05:34] FIRING: DiskSpace: Disk space build2001:9100:/ 5.911% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:00:34] RESOLVED: DiskSpace: Disk space build2001:9100:/ 5.907% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:40:54] 10netops, 06Infrastructure-Foundations, 06SRE: Manage VRRP priority from Netbox - https://phabricator.wikimedia.org/T381873 (10cmooney) 03NEW p:05Triage→03Low [15:11:44] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:16:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:21:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:34:14] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:41:59] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [16:34:16] o/, I need a cookbook rubber duck, so I'll just type in this here text field and see what happens [16:34:36] (is "cookbook" a v.olans=summoning spell? :D) [16:35:09] lol, my guess is it will work :) [16:35:24] if not I can possibly assist though I am not really an expert on cookbooks [16:35:32] thanks <3 [16:36:00] rotfl [16:36:02] what's up? [16:36:06] haha see ??/ [16:36:10] :D [16:36:22] but I will not reveal wich words summoned me, if any [16:36:30] I am writing a batch-reimage cookbook that takes a list of hosts, so it would be great if I could create spicerack locks such that a given host can be in only one invocation of the cookbook at a time [16:36:30] I watch this channel updates anyway :D [16:36:51] I was chatting about this with Janis the other day [16:36:54] so that seems like what I want to do is create N locks, one per host, at the beginning of the cookbook? [16:36:58] oh, yeah, it's that CR :-) [16:36:59] not sure if he shared our chat with you or not [16:37:04] nope [16:37:16] but we're looking at the same problem I assume [16:37:36] yes was a chat born from that CR [16:38:14] so right now there are 2 possible ways, going forward we might want to add another way in spicerack itself [16:38:24] problem with N locks in __init__ is that I don't get to use `with`, so it can get messy [16:38:39] ok, I'm listening :D [16:39:54] quick and dirty, set the lock to be per-cluster/dc and allow to run your parallel one only once per DC per cluster [16:40:11] is that enough or does limit the way you were planning to use it? [16:40:13] sure, I already did that one :D [16:40:25] like multiple runs with smaller sets of hosts [16:40:26] it's good enough for now I suppose [16:40:55] so whether I want to do multiple smaller runs kinda depends on other things [16:41:11] second way is to get a lock for each single host in your batch, a bit heavier on etcd if you go with larger batches, very ok for smaller batches of say 10 hosts [16:41:14] (for now what I actually did was per cluster with concurrency 5 because that's what I pulled out of my you know where) [16:41:34] I would have put it exclusive per cluster/dc [16:41:40] given it affects already N hosts [16:41:46] ok [16:42:13] second way, from https://doc.wikimedia.org/spicerack/master/introduction.html#distributed-locking you can get "Cookbooks custom locks" in addition to the one automatically created [16:42:28] yes, I've been looking at that [16:42:40] so just do that in __init__ and be careful about lock cleanup? [16:43:04] in run() not init is better, you can create your own context manager that does loop over them and then ensures cleanuop [16:43:17] and in run() have a single with statement [16:43:19] oh, okay, yes, that's better [16:43:20] thanks! [16:43:22] that helps [16:43:46] the 3rd way we don't have now [16:44:25] FIRING: SystemdUnitFailed: librenms-alerts.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:40] I was thinking of adding a weight field to the locks, so we could get a cluster lock that say for a cluster of 50 hosts has a concurrency of 10, meaning 10 hosts. And each time you run the cookbook you say that your lock weights N where N is the number of affected hosts [16:44:54] doesn't prevent to run on the same host twice but prevents from running on too many hosts in the cluster [16:45:03] mmm, yeah, that could be neat [16:45:13] oh I know, I should also lock on taint group [16:45:18] thanks, good 🦆 [16:45:19] :D [16:45:34] (and one day maybe make a weight thing on taint group!) [16:45:42] eheheh [16:45:47] sorry I did not forward our chat to you kamila_ :/ [16:46:36] np jayme, seems like we arrived at the same ideas? [16:47:20] yeah, absolutely. I'd say go with a per host lock in first place to prevent parallel runs for the same node. [16:47:22] kamila_: lmk when you have more or less final version of what you have in mind and I'll be happy to review it [16:47:49] thanks to both of you <3 [16:47:51] will do [16:47:53] as for the name of the custom lock you can pick anything, so if this is meant to prevent any kind of action on those hosts from multiple K8S cookjbooks [16:48:00] we need to think about a general (k8s) strategy on how to ensure we're not taking too much capacity out of the cluster in a more general way I suppose [16:48:01] yeah [16:48:03] think about a good naming that might make sense to add to other cookbooks too [16:48:30] he custom ones are not attached to the cookbook name like the automatic one [16:48:51] the other approach is to add a spicerack specific one in the spicerack module [16:49:02] if you have a common entry point [16:49:25] FIRING: [5x] SystemdUnitFailed: librenms-alerts.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:26] like cordon/uncordon [16:50:39] that's not a bad idea [16:51:03] I'll think about whether it makes more sense to put it in cookbooks or spicerack itself, quite likely spicerack [16:51:21] thanks a ton volans, super useful input [16:51:54] jayme: re capacity, that sounds like an excellent candidate for the weight thing :D [16:52:17] (I assume that's where the idea came from) [16:52:34] happy to help with implementing that once the time comes [16:53:00] (correct) [16:53:02] otoh we have live data in k8s api that we could query ... [16:53:09] also true [16:54:00] and check if it's drainable or not and similar [16:54:07] it will also cover the case of hosts down or similar [16:54:17] or manual interventions that the locks don't prevent [16:54:25] RESOLVED: [5x] SystemdUnitFailed: librenms-alerts.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:30] so we already have the "will refuse to do things when the host is cordoned" check in the cookbook [16:54:32] so +1 for API checks [16:54:44] sounds like somebody should stop writing code and think for a bit :D [16:55:03] just make sure to leave a nuclear option in case you WANT to run something also if it's a bad idea in general [16:55:11] so some force option or similar [16:55:32] yeah, but for now that's not necessary [16:55:55] I meant if you put the limit in the spicerack module itself based on API info [16:56:00] right now I am only doing batch reimages, which you can always run by hand [16:56:07] I was planning on spending some time on a bigger style refactor of k8s cookbooks next year ... there is a lot of duplication in there already [16:56:08] yeah, there that's a different thing [16:56:08] yes [16:56:09] thanks [16:56:16] yep jayme [16:57:15] volans: that's why I'm not really sure about adding the API checks into spicerack and would be more inclined to stick them in the cookbook abstract class for now [16:57:26] smaller blast radius :D [16:58:41] we can move them to spicerack later if they deemed helpful [16:58:48] exactly [16:59:09] I would also like to get this merged in finite time [16:59:19] so "can improve later" is a strong argument IMO [17:00:02] * kamila_ 's inner perfectionist is dying [17:05:28] sure sure, cookbook first, spicerack later is totally ok [17:50:33] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10394722 (10cmooney) >>! In T381389#10389706, @BTullis wrote: > This change looks fine to me, but would it be OK to wait until the New Ye... [18:18:44] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [18:23:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [18:28:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [19:35:48] Heyo. https://phabricator.wikimedia.org/T375569 has us removing RSA certs from various services in Puppet. Mail services are one of the remaining ones with RSA. My guess would be that we'll still want to use RSA certs for SMTP but remove it for everything else? What does IF want? [19:57:53] I would love to remove them on the mail servers as well to keep consistency, what is the best way to monitor usage or need? [21:28:06] jhathaway: FWIW TLS 1.0 seems to still be in use as well. I figured that we wanted max compat with SMTP. [21:28:17] I grepped some logs for some metrics at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075604/comments/24ffbfac_ceb941b6 [21:32:12] nod, thanks brett, looks like we are not logging any TLS handshake info on the new postfix servers, I'll cut a card to add that [21:34:11] 10Mail, 06Infrastructure-Foundations, 06SRE: Log tls cipher information - https://phabricator.wikimedia.org/T381927 (10jhathaway) 03NEW