[08:41:41] <godog>	 indeed I'll adjust the threshold
[09:17:35] <jinxer-wm>	 FIRING: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries
[09:27:35] <jinxer-wm>	 RESOLVED: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries
[09:31:35] <jinxer-wm>	 FIRING: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries
[09:36:35] <jinxer-wm>	 RESOLVED: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries
[09:36:39] <godog>	 should be fixed shortly by https://gerrit.wikimedia.org/r/c/operations/alerts/+/1164129
[13:23:53] <herron>	 thanks go.dog!
[14:04:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:05:47] <godog>	 sure np
[14:06:35] <ebernhardson>	 Within mediawiki, have you run into maintenance scripts that need to flush their stats output mid-run to not blow out memory?  I'm looking into a maintenance script that visits most articles on a wiki, and it OOM's due to, iiuc, BaseMetric caching samples (repro: https://phabricator.wikimedia.org/P78712)
[14:07:07] <ebernhardson>	 i found a related ticket from statsd, T181385, but doesn't really apply to prometheus
[14:07:10] <stashbot>	 T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385
[14:07:44] <ebernhardson>	 we might not have noticed these problems before because we used to run maint scripts on bare metal with 60GB+ memory available, now they run in k8s with 2G
[14:14:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:17:10] <cdanis>	 ebernhardson: I think jobrunners have some logic in them to flush out stats in the background?
[14:17:47] <cdanis>	 I'm going to innocently ping mszabo since we talked about similar concerns for tracing, and he has forgotten more about MW than I've ever learned
[14:18:07] <ebernhardson>	 thanks! I'll poke over the job runner bits, i wasn't finding a good way to flush
[14:19:41] <mszabo>	 ebernhardson: some Maintenance methods call emitBufferedStats() as a side-effect
[14:19:59] <mszabo>	 output(),commitTransactionRound(), waitForReplication()
[14:20:05] <mszabo>	 does your script call any of these?
[14:20:35] <ebernhardson>	 mszabo: hmm, we dont do any sql writes so no commit/wait for replication.  It does output though
[14:20:46] <mszabo>	 can you point me to your script?
[14:20:55] <ebernhardson>	 sure, sec
[14:21:38] <ebernhardson>	 mszabo: this is the main loop of the script: https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/maintenance/UpdateSuggesterIndex.php#L515
[14:25:25] <mszabo>	 ebernhardson: can you clarify how you used your reproducer? 
[14:25:44] <ebernhardson>	 mszabo: mwscript-k8s --attach -- shell.php --wiki=hewikisource
[14:25:53] <ebernhardson>	 mszabo: you probably don't have to use hewikisource, thats just where i first saw the problem
[14:26:09] <ebernhardson>	 mszabo: once that shell opens you can paste the code into it
[14:26:26] <mszabo>	 maybe there was a copy-paste problem? the paste has a mix of php code and shell commands like "sudo $counter"
[14:26:48] <ebernhardson>	 mszabo: sudo is a special thing the shell provides, it allows you to access protectd/private things
[14:27:04] <mszabo>	 ah neat, TIL
[14:27:33] <mszabo>	 right, so that's valid but it doesn't necessarily indicate that the emitBufferedStats() calls in the maint script aren't cleaning those
[14:29:20] <ebernhardson>	 mszabo: hmm, so calling StatsFactory::flush P(which emitBufferedStats does) doesn't change the count when it OOM's :S
[14:30:05] <ebernhardson>	 mszabo: part of my theory is that in BaseMetric there is no code that every takes anything out of $this->samples
[14:30:11] <ebernhardson>	 s/every/ever/
[14:31:15] <ebernhardson>	 but if i `sudo $counter->baseMetric->samples = []` every 30k metrics then it doesn't OOM
[14:31:56] <mszabo>	 ebernhardson: hmm I think this can happen if something holds a reference to the metric object in e.g. a field of some class
[14:32:13] <mszabo>	 flush() clears StatCache but that only clears samples if that was the only ref to the object
[14:32:51] <ebernhardson>	 hmm, my repro certainly doesn't allow that to happen, since it reuses a counter in the global scope. lemme try and adjust
[14:34:41] <mszabo>	 https://www.irccloud.com/pastebin/79CYnpVD/
[14:34:59] <ebernhardson>	 mszabo: indeed, that avoids the OOM!  I'll have to work this backwards to see if this is still reproducing what the full script does
[14:35:02] <ebernhardson>	 thanks
[14:35:02] <mszabo>	 ebernhardson: ^ seems to work as described at least - commenting out $m->flush() leaves $c2 with 1k samples
[14:35:16] <mszabo>	 for the specific example of the pagestore metric, it doesnt seem to cache it in a member var
[14:36:33] <ebernhardson>	 indeed, it looks like that shouldn't be it, so i may be chasing the wrong end of the OOM, will find out :)
[14:51:43] <mszabo>	 ebernhardson: an insidious way for this to happen would be for a closure or arrow fn to capture a metric variable
[15:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:49:30] <sukhe>	 hi folks. if I updated the name of a check in nagios_command, do I have to bump Icinga? reload didn't work clearly.
[15:49:35] <sukhe>	 sukhe@alert1002:~$ sudo /usr/sbin/icinga -v /etc/icinga/icinga.cfg
[15:49:44] <sukhe>	 gives me 166 errors, related to the change:
[15:49:48] <sukhe>	 Error: Service check command 'check_ssl_ats' specified in service 'HAProxy HTTPS wikiworkshop.org ECDSA' for host 'cp6016' (file '/etc/icinga/objects/puppet_services.cfg', line 131927) not defined anywhere!
[15:50:00] <sukhe>	 this has now been renamed to check_ssl_cdn
[15:53:55] <sukhe>	 ok nvm, I see what's happening here. running agent on the affected hosts as well
[15:55:43] <sukhe>	 this is basically saying that this 
[15:57:24] <sukhe>	 ok so the resolution was ^: enabling agent on the affected host. so what this was saying was that the check is not defined _there_
[16:13:45] <sukhe>	 ==> ok fwiw, the warning has now cleared up so there is nothing pending from my deploy.
[16:13:49] <sukhe>	 Total Warnings: 0
[16:13:49] <sukhe>	 Total Errors:   0
[16:27:32] <lmata>	 thanks for the heads up !
[16:36:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed