[11:38:58] hey folks, for the temporary accounts initiative, TSP is looking to expose certain metrics from MW that would need to be calculated periodically, rather than on the fly during web requests (T375506 Temp accounts Grafana Dashboard: Number of IP Reveal users, T375508 Temp accounts Grafana Dashboard: Active IP Reveal users). [11:38:58] T375506: Temp accounts Grafana Dashboard: Number of IP Reveal users - https://phabricator.wikimedia.org/T375506 [11:38:58] T375508: Temp accounts Grafana Dashboard: Active IP Reveal users - https://phabricator.wikimedia.org/T375508 [11:39:00] Historically, such use cases were solved by using maintenance scripts triggered via a cron (e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MediaModeration/+/refs/heads/master/maintenance/updateMetrics.php), which works fine with statsd's push model. With a sunset of statsd/graphite looming, I was wondering what our preferred approach would be. [11:39:09] I see two options: [11:39:16] ingest such metrics via pushgateway - I don't know if we have PG set up though, and we may need to adapt the MW StatsLib interface to support that [11:39:23] or create a "MediaWiki exporter" endpoint that'd then allow these metrics to be scraped by prometheus in a regular fashion [11:39:32] what would y'all prefer? [12:53:37] mszabo: we do have PG setup. See https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway) [12:54:24] However, there is a statsd-exporter deployed across all major deployment [12:55:05] right, but wouldn't an ephemeral maintenance script be unable to push metrics to an exporter? [12:55:16] MediaWiki deployments that is. So that is a trodden down path already. I 'd defer to o11y (for when they are around) for the final say. [12:55:44] ephemeral maint script being mwscript-k8s ? [12:55:55] cause I think that can be fixed in that case [12:56:20] it's now that we are working on the finishing touches on mwscript-k8s, so couldn't have a better timing [12:56:23] AIUI yeah - or whatever we will switch our current cron jobs to, whose schedules are currently in puppet [12:57:05] ah, it's the systemd-timer ones. It's one of the last things to iron out this Q [12:57:19] AFAIK since mwscript-k8s creates ephemeral kubernetes objects, we'd still need some kind of push model to ingest metrics that the scripts may want to emit [12:57:23] yeah, I am pretty sure we 'll need to run a statsd-exporter for those. Too many to switch to PG I fear [12:57:30] unless we make an exporter available as a DS or sth [12:58:53] let me file a task. I am not sure if we 'll split one off from timer based ones at the namespace level [12:59:03] but otherwise the approach is going to be pretty similar [12:59:25] thanks! [12:59:47] so just to confirm - it's fine to keep using statslib in such scripts for now and assume there will be an exporter ready once these scripts move to k8s? [13:00:45] I 'd rather o11y gave you the final confirmation on that one. [13:00:57] but as far as I am concerned up to now, yes [13:01:10] and I reserve the right to be wrong :P [13:03:51] OK thanks, I'll wait for them to weigh in :) [13:08:34] https://phabricator.wikimedia.org/T376714 fwiw [13:41:48] FIRING: PuppetFailure: Puppet has failed on alert2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:42:49] ^ i'll take a look. [13:46:03] ehi denisse [13:46:28] I think it's a problem related to a patchset i merged [13:48:24] I still can't figure it out [13:48:33] tappof: ACK, let me know if I can help with that. [13:49:08] ok denisse, thank you [13:49:09] Do you have the link to your patch? [13:49:17] I can take a look. [13:49:32] sure, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1069117 [13:50:12] the problem seems to be related to check_ripe_atlas.cfg which I want to remove [13:51:48] FIRING: [2x] PuppetFailure: Puppet has failed on alert1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:58:24] tappof: I wonder if the config_dir needs to be deleted from Puppet as well in order for the state absent to work. [14:00:43] denisse: uh ... I'll give it a try! thank you [14:19:37] denisse: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078682 [14:26:35] But I think I have to remove the resource and manually add another one to clean up the file... that's because the nagios_common::check_command injects defaults for many parameters... [14:34:29] tappof: I'm running PCC on your chang. [14:35:13] yes denisse I've just added the hosts to the commit [14:36:58] These are the results: https://puppet-compiler.wmflabs.org/output/1078682/4248/ [14:49:45] thank you denisse, It worked [14:51:32] tappof: nice! [14:51:48] FIRING: [2x] PuppetFailure: Puppet has failed on alert1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:01:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on alert1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:38:41] FIRING: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [16:19:07] FIRING: [2x] ErrorBudgetBurn: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:19:08] RESOLVED: [2x] ErrorBudgetBurn: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:34:08] FIRING: [2x] ErrorBudgetBurn: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:49:07] RESOLVED: [2x] ErrorBudgetBurn: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn