[06:32:59] In 30 minutes we'll failover s4 (commons) master [06:34:51] ooooh, exciting! or hopefully not :-) [06:44:06] Can we downtime registry2002's systemd alert? It's been flapping for a few days [07:11:49] <_joe_> or well, trying to resolve it [07:17:22] you are resolved to resolve it [11:05:14] volans: heyp :), I got a first poc with spicerack here https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/658357, though the lint errors seem to be a known pylint issue, can you give the patch a look when you have time? [11:06:32] dcaro: hey, I had already started yesterday to do a first pass I'll finish it today for sure [11:06:49] thanks! [11:10:35] I've seen `lookup('cluster')` a few places in puppet (profile::cassandra::single_instance in this case https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/cassandra/single_instance.pp#3) - is using such a simple key value a historical artefact? [11:10:44] or more importantly should it be gotten rid of? [11:19:46] hnowlan: see https://phabricator.wikimedia.org/T179395 for some context [11:22:00] moritzm: ah, thanks! [11:22:21] in the context of cassandra it's probably better to move it to some profile::cassandra:foo variable instead, [11:22:41] from a quick glance it only seems to designate the cluster name in the cassandra context [11:22:49] I also see it's checked in profile::base against wikimedia_clusters [11:23:31] yeah, other use cases have crept in over time :-) [14:58:48] hashar: can jenkins do nightly builds? [15:00:37] kormat: git review | at 00:00 :-P [15:00:50] ha ;) [15:01:57] kormat: are you wanting to just check that HEAD builds, or are you wanting to produce artifacts in some automated way? [15:02:04] cdanis: the former [15:02:10] in theory the post-merge build does that already [15:02:30] today i sent a CR to a repo that hadn't been touched in a while, and it failed because external dependencies had been updated [15:02:59] which turned into a couple of hours of lost time trying to track down what was wrong [15:03:12] i don't care about artifacts [15:04:21] (https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/658592 for more details if you're morbidly curious) [15:04:47] godog: o/ [15:05:40] ottomata: yo! [15:08:38] thanks for your reply! maybe you can give me specific [15:08:40] advice [15:09:12] i have a metric in prom and i want to trigger alerts based on dynamic label values [15:09:14] e.g. [15:09:41] i want individual alerts for each distinct label value, when the metric for that label goes too high [15:09:43] specicially [15:09:52] i want alerts for validation errors per stream [15:10:14] the label values aren't known ahead of time, or rather, they are, but they are in mw-config / mw stream config api [15:10:26] can I do that? [15:12:12] ottomata: yes, you "can" even with check_prometheus / icinga ATM, you'll get the labels in the alert description, but you wouldn't be able to e.g. silence only one label [15:12:30] (if we were using alertmanager, this would all work nicely. :) [15:12:42] hm [15:13:17] yeah the other option is to try the alertmanager stack, namely write alerting rules like I mentioned in the task [15:13:18] godog: part of the trouble is that differrent streams probably shoudl alert on different thresholds [15:13:20] kormat: we are :) [15:13:26] godog: !!! [15:13:36] godog: i'm all for that, but should I try with grafana or prometheus? [15:13:53] it looks like maybe prometheus is more powerful? also then they aer declarerd in code (puppet) rather than in a UI? [15:14:03] (also my dashboard has template vars) [15:14:32] ottomata: yeah it'd be an alerting rule, easier for sure [15:14:45] godog: is it ready for 'production' use? did i miss an annoucement? :excite: [15:15:24] kormat: haha! no you didn't miss an announcement, the stack is up and we're moving some use cases this quarter, e.g. performance team alerts [15:15:40] 🎉 😅 [15:15:47] kormat: there's still some missing bits though before e.g. starting to move over icinga alerts, but getting there [15:17:03] ottomata: I'm ok to assist with setting that up, depending on how urgent that is on your side to have the alerts [15:17:36] godog: not urgent at all [15:17:40] godog: ok, cool :) [15:18:50] ottomata: then yeah alerting rules are definitely the way to go [15:19:18] ok cool, will look in a few mins, struggling with a grarfana template variable substitusion atm [15:20:16] godog: on a related note, https://github.com/prometheus/alertmanager/issues/1682 finally got fixed [15:22:03] godog: is it possible to transform label values with something like a regex replace in a prom query? [15:22:05] kormat: nice! thank you that's useful [15:22:08] i'm looking at label_replace [15:22:11] https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace [15:22:15] but i don't think it will do what I need [15:22:18] i want ot go from [15:22:31] mediawiki.client.session_tick -> mediawiki_client_session_tick [15:22:43] (kafka normalizes the topics and removes .) [15:22:55] so I can't use a template var that has the . in it to match the kafka topics [15:24:39] ottomata: you _could_ do this at scrape time [15:24:53] oh in statsd exporter config? [15:25:14] i mean, the proper values have . [15:25:16] the streams have . [15:25:18] the topics have . [15:25:36] rdkafka (and kafka too) itself is removing the . as far as i can tell [15:25:43] in the metrics [15:25:55] yeah I don't think label_replace will do global subst, which seems to be what you are asking [15:25:55] i can't correctly map from the metrics topics with _ [15:25:59] yea [15:26:02] ottomata: you can use `relabel_configs` in the prometheus config to modify label names and/or values [15:26:24] right, but i don't have the proper mapping at that point to map back from normalized topics with _ to the original topic name [15:26:31] since topic names might also have _ [15:26:33] so e.g. [15:26:41] mediawiki.session_tick -> mediawiki_session_tick [15:27:23] alternatively, if i could keep grafana from escaping the '.' in the prometheus queries it sends...it would just work [15:27:30] because . would end up regex matching the _ :p [15:29:24] ottomata: can you give an example prometheus metric? [15:30:09] working on this dash [15:30:10] https://grafana-rw.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m [15:30:15] I've got a template var, $stream [15:30:34] the dash is currently saved with the problematic example [15:31:01] this panel [15:31:02] https://grafana-rw.wikimedia.org/d/ePFPOkqiz/eventgate?editPanel=43&orgId=1&refresh=1m [15:31:11] is the one i want populated with the topics for the selected streams [15:31:38] i've currently got another template var, $kafka_topic, which I'm trying to populate based on the value of streams [15:32:09] but, it isn't working because, when the prom query is issued for the $kafka_topic: [15:32:10] eventgate_rdkafka_producer_topic_partition_txbytes{service='$service',topic=~'$stream'} [15:32:11] $stream [15:32:13] is sent like [15:32:40] %3D~%22eventlogging_SearchSatisfaction%7Cmediawiki%5C%5C.client%5C%5C.session_tick%22%7D%5B5m%5D [15:33:03] well [15:33:03] %22eventlogging_SearchSatisfaction%7Cmediawiki%5C%5C.client%5C%5C.session_tick%22 [15:33:07] which url decodes to [15:33:10] "eventlogging_SearchSatisfaction|mediawiki\\.client\\.session_tick" [15:33:19] mediawiki\\.client\\.session_tick [15:33:44] does not match the value of the topic label in eventgate_rdkafka_producer_topic_partition_txmsgs [15:33:54] there, the topic values look like [15:34:00] eqiad_mediawiki_client_session_tick [15:35:20] in summary: [15:35:33] if $stream is mediawiki.client.session_tick [15:35:36] the corresponding topics are [15:35:49] codfw.mediawiki.client.session_tick and eqiad.mediawiki.client.session_tick [15:35:56] but kafka metrics replace '. [15:36:01] '.' with '_' [15:36:26] i'm about to give up, and just not templatize the topic panels [15:36:47] ottomata: the way you're doing label extraction is a bit weird. normally you'd use `label_values()` [15:37:02] wouldl be happy to change, you mean in the template vars? [15:37:07] ah ya [15:37:11] instead of doing a regex to pull them out? [15:37:26] yeah [15:40:38] ottomata: there are some examples here: https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?editview=templating&orgId=1 [15:41:27] got it, changed :) [15:41:47] it might not fix anything, but it will at least make it simpler :) [15:42:25] yeah [15:43:26] this is also my first time modifying a dashboard since thanos [15:43:27] VERY NICE! [15:43:33] much nicer to be able to see global metrics! [15:43:54] thanos++ [15:45:42] ok, another advanced feature q: is it possible to add links to a panel somehow? [15:45:44] e.g.g [15:45:46] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&from=1611672255836&to=1611675855836&viewPanel=75 [15:46:04] it'd be cool if I could link from each value shown there to a kibana dash [15:46:43] ottomata: you can put a markdown block [15:46:51] dynamically populated? :) [15:46:59] you can write a grafana plugin ;) [15:47:01] haha [15:47:10] actually then i could use mw stream config api [15:49:01] godog: is there an example of a prometheus alert in puppet? am looking,haven't found yet [15:52:34] ottomata: not yet no [15:52:42] ah [15:53:11] godog: would this then be deployed using the prometheus::rule define? [15:53:13] ottomata: actually I stand corrected, for wmcs there are modules/profile/files/wmcs/prometheus/metricsinfra/alerts_projects.yml [15:53:29] ottomata: try using `${templateVar:raw}` to stop grafana escaping things [15:53:38] OH!~ [15:53:39] hm [15:55:20] ottomata: but yes tl;dr now the way would be via prometheus::rule, in the future there will be a separate repo as well for self-service rules, since you have puppet privileges then prometheus::rule with your alerts in it would do the trick [15:56:56] godog: declared in...profile::prometheus::alerts? [15:57:15] and...can I make the rule use thanos? [16:00:41] ottomata: no thanos for alerts no, probably the easiest is to add the rule to e.g. ::profile::prometheus::ops [16:01:08] or maybe another profile entirely, to be added to the prometheus role, not sure [16:05:29] hm ok, metrics comming from prometheus::k8s [16:15:12] rats kormat :raw ALMOST works [16:15:17] it fixes the . escaping problem [16:15:23] but when selecting multiple streams [16:15:32] it won't make a corrrect regex [16:15:37] e.g. with | instead of, [16:15:39] , [16:16:38] ottomata: there's also https://grafana.com/docs/grafana/latest/variables/advanced-variable-format-options/#regex - maybe that makes it better (or worse) [16:16:56] i have some vague memory of finding the Right option at $LASTJOB for something like this [16:23:32] hm looks like i'd have to apply both of those things! [16:23:34] like [16:23:36] :regex:raw [16:23:38] or something [21:23:48] What's the correct way to run homer after running the decom script? [21:24:25] ryankemper: the decom script should run it for you [21:24:59] volans: I did notice the decom script running homer, but I'm a bit confused why the decommission checklist includes the homer as a separate step [21:25:06] See https://phabricator.wikimedia.org/T272444 for the checklist [21:25:11] > - run homer on cumin host to update switch stack [21:25:15] that was added recently [21:25:26] by arzhel, so I guess that the template has not yet been updated [21:25:36] is a copy/paste from wikitech or a phab template? [21:25:51] volans: from the phab template [21:26:07] I did just notice in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_ANY_Opsen it mentions homer as something the decom script does [21:26:11] then you should probably ping Rob in dcops to update it [21:26:13] So the phab template is just a bit out of date presumably? [21:26:20] Cool, thanks [21:26:30] I think I don't have the power to change it