[10:05:05] <_joe_> Majavah: thanks for spotting the error in the apache config patch! [10:05:12] <_joe_> really appreciated <3 [10:52:35] you will see some warnings on bacula monitoring, you can ignore those, its is a temporary artifact of refactoring [11:04:20] hnowlan: o/ qq - do we have plans to make cassandra packages available for Buster, or should I open a task? [11:08:35] some of the restbase hosts are already running buster [11:09:06] same package is used as on stretch [11:12:15] moritzm: I tried reprepro lsbycomponent cassandra on apt1001 and found only stretch/jessie packages, this is why I am asking [11:16:16] ah yes, the cassandra puppet classes simply install the packages based on the target version configured in Hiera and simply installs the cassandra311 component from stretch even on buster [11:16:21] elukey: open a task please! there are restbase buster hosts but it looks like they're just installing the stretch cassandra311 component [11:16:45] should be imported to buster as well, but 3.11 works fine on buster together with OpenJDK 8 [11:23:18] hnowlan: ack thanks! [11:32:05] is someone (an sre) around for a backport in a few minutes, related to a train blocker? sadly it won't unblock the train but it will get us more logging, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/661456 [11:38:04] XioNoX, running now Es backups from backup2001 -> backup1002 and backup1002 -> backup2002, in case it consumess too much cross-dc bandwidth [11:39:27] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=8&orgId=1&refresh=5m&var-server=backup1002&var-datasource=thanos&var-cluster=misc&from=1612780762076&to=1612784362076 [11:48:18] well nm, we will just wait for the backport window in another ten minutes anyhow [11:48:33] :w [11:48:42] * jayme saved [11:50:30] jayme: ⚽ [11:50:56] how is there a football emoji?? [11:51:07] and what does it mean? [11:53:04] jayme -> 🥅. ⚽ -> ❌ -> 🥅 [11:53:06] see? [11:54:04] barely :) [14:09:02] jbond42: we the numa_networking stuff, while the observations in those patches are correct about the current utility of it, etc... we had recently realized that we probably should've applied something similar to the nginx hack, to our ats-tls, although we may just wait for ats-tls's replacement to bother with setting it up again. [14:09:45] and yeah, if we keep it, it could stand to be generalized, but that's the point of it being this global flag, is that various modules could use per-daemon specializations (but yeah, maybe one systemd fragment could be done generically for most, not sure) [14:11:16] luckily the numa_networking stuff doesn't have to be aware of avoid_cpu0 anyways [14:13:21] bblack: completly agree that this is usefull. in genral i really dont like globals configuered in realm. meed to think about the various use cases but i would probably move the flag to either a fact or a profile level parameter e.g. profile::base::numa_networking [14:14:30] as far as i can tell in the current case he numa flag and the numa_networking global are not practicly use in our puppet code base so we are at a point where we can pretty much start a fresh [14:14:59] jbond42: we can _always_ make a fresh start. just need to nuke the gerrit repo. ;) [14:15:50] yeah maybe a profile param [14:15:52] another consideration is that if systemd is the only thing that needs to be aware of numa then i would just add the functionality there howeveve i have a feeling that more then systemd may need to be aware of this [14:16:22] it's not really fact-like, as I don't think we'd ever apply something like numa_networking broadly across random $clusters and such. it's always going to be a special case for cases that are well-tested and seem to want it [14:16:25] (it would still need to pass through some profile param but the implmentation could be in systemd module instead fo a dedicated numa module) [14:16:39] well, interface-rps also needs to be aware of it [14:16:58] did you see my CR for interfaces-rps? [14:17:00] I think interface-rps is numa_networking by default [14:17:17] jbond42: I did, but I haven't had a chance to dive into it deeply yet. Thanks for working on it though! [14:17:49] ack no probs the intention was to make that script so it is standalone (i.e. dosn;t need to care about the current numa puppet config) [14:18:00] right [14:18:20] the trick is, while they have opposite defaults and everything looks disconnected [14:18:52] technically there's a connection between interface-rps's default-off "numa_filter" setting, and the $numa_networking global (if it were doing what it's supposed to be doing, for some daemon on some interface-rps host) [14:19:51] err sorry, I guess interface-rps's numa_filter is default-on [14:20:01] while numa_networking (what's left of it!) is or was default-off [14:20:55] but we can always fix all that up later when re-implementing it [14:24:15] ack ill try to get some basic scaffolding for a more genralise implmentation we can use to bash [14:26:06] I'm reading the rps patch more now, review commentary incoming (and sorry, any confusion is my fault, it's not a very well-factored script in some ways!) [14:26:45] ack thanks :) [17:59:28] cdanis: really great pres thank you so much! [17:59:38] any idea why these two graphs are different? [18:00:09] https://w.wiki/yQW [18:00:19] the left one is just a larger time range than the right [18:00:25] but otherwise it is the same query [18:01:03] also klausman ^ :) [18:01:04] ottomata: you mean aside from the Y axes? [18:01:37] yes, the right one i just selected from the left by dragging over one of the spikes [18:01:42] so it shows about a day [18:03:58] ottomata: so, the huge spike in the right-hand graph lasts only two data points [18:04:01] it looks like the underlying data is pretty spiky, so computing a derivative on it is going to be inherently pretty wild -- it'll vary a lot depending on which data points you happen to pull [18:04:06] I think it's simply not being evaluated at those instances on the left [18:04:17] s/instances/instants/ [18:04:38] oh because the range is too zoomed out? [18:04:39] https://i.imgur.com/7plQszj.png [18:04:47] yeah, the scrape interval is 30 seconds [18:04:52] ahhhh ok [18:04:55] so it's 1 or 2 data points that are responsible for that spike [18:05:01] context: [18:05:02] https://phabricator.wikimedia.org/T273702 [18:05:12] i only want to alert if lag is increasing [18:05:29] e.g. [18:05:29] https://grafana-rw.wikimedia.org/d/000000027/kafka?orgId=1&from=1612232643622&to=1612319043622&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All [18:05:31] this is normal [18:05:34] big spike then drained [18:14:02] All these graphing systems eventually run into aliasing issues as well [18:15:30] I suspect you want more than 2 minutes in your deriv() as well -- that's only a window 3 or 4 data points, depending on evaluation time [18:15:43] but yes, aliasing issues depending on zoom levels are common :/ [18:29:09] cdanis: yeah, but the data wasn't quite right for what i'm doing, which is really alerting, not graphing [18:29:26] i think as an alert it is good, beacuse i can retry a few times before actually alerting [18:29:30] yeah, for sure [18:29:41] I would still use a longer deriv in the alert -- maybe 5 minutes? and look for it remaining true over 10 minutes? [18:29:46] if i make the window wider, a jump from 0 to a billion looks like a gradual increse [18:29:51] right [18:30:12] you might need a composite thing, that checks for an absolute value of at least X, and a positive deriv, or something [18:30:17] it's tricky :) [18:30:34] yeah but the abs value threshold is hard as it depends on what is happening [18:30:45] yeah :) [18:30:49] depends on the througput of the topic that is rebalancing [18:31:10] really when rebalacning like this, it will immediately jump to the total number of messages in a partition [18:31:16] most partitiionis have retenton of 7 days [18:31:19] some of 31 days [18:31:33] some partioins have ~1000 messages per second [18:32:02] anyway, I thiiiIiINk that the derive with the 2m for an alert wiith a bunch of retries is what I want [18:32:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/662005/1/modules/profile/manifests/kafka/broker/monitoring.pp