[08:41:41] indeed I'll adjust the threshold [09:17:35] FIRING: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries [09:27:35] RESOLVED: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries [09:31:35] FIRING: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries [09:36:35] RESOLVED: [2x] ThanosSidecarDropQueries: Thanos Sidecar is dropping large queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarDropQueries [09:36:39] should be fixed shortly by https://gerrit.wikimedia.org/r/c/operations/alerts/+/1164129 [13:23:53] thanks go.dog! [14:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:05:47] sure np [14:06:35] Within mediawiki, have you run into maintenance scripts that need to flush their stats output mid-run to not blow out memory? I'm looking into a maintenance script that visits most articles on a wiki, and it OOM's due to, iiuc, BaseMetric caching samples (repro: https://phabricator.wikimedia.org/P78712) [14:07:07] i found a related ticket from statsd, T181385, but doesn't really apply to prometheus [14:07:10] T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [14:07:44] we might not have noticed these problems before because we used to run maint scripts on bare metal with 60GB+ memory available, now they run in k8s with 2G [14:14:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:17:10] ebernhardson: I think jobrunners have some logic in them to flush out stats in the background? [14:17:47] I'm going to innocently ping mszabo since we talked about similar concerns for tracing, and he has forgotten more about MW than I've ever learned [14:18:07] thanks! I'll poke over the job runner bits, i wasn't finding a good way to flush [14:19:41] ebernhardson: some Maintenance methods call emitBufferedStats() as a side-effect [14:19:59] output(),commitTransactionRound(), waitForReplication() [14:20:05] does your script call any of these? [14:20:35] mszabo: hmm, we dont do any sql writes so no commit/wait for replication. It does output though [14:20:46] can you point me to your script? [14:20:55] sure, sec [14:21:38] mszabo: this is the main loop of the script: https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/maintenance/UpdateSuggesterIndex.php#L515 [14:25:25] ebernhardson: can you clarify how you used your reproducer? [14:25:44] mszabo: mwscript-k8s --attach -- shell.php --wiki=hewikisource [14:25:53] mszabo: you probably don't have to use hewikisource, thats just where i first saw the problem [14:26:09] mszabo: once that shell opens you can paste the code into it [14:26:26] maybe there was a copy-paste problem? the paste has a mix of php code and shell commands like "sudo $counter" [14:26:48] mszabo: sudo is a special thing the shell provides, it allows you to access protectd/private things [14:27:04] ah neat, TIL [14:27:33] right, so that's valid but it doesn't necessarily indicate that the emitBufferedStats() calls in the maint script aren't cleaning those [14:29:20] mszabo: hmm, so calling StatsFactory::flush P(which emitBufferedStats does) doesn't change the count when it OOM's :S [14:30:05] mszabo: part of my theory is that in BaseMetric there is no code that every takes anything out of $this->samples [14:30:11] s/every/ever/ [14:31:15] but if i `sudo $counter->baseMetric->samples = []` every 30k metrics then it doesn't OOM [14:31:56] ebernhardson: hmm I think this can happen if something holds a reference to the metric object in e.g. a field of some class [14:32:13] flush() clears StatCache but that only clears samples if that was the only ref to the object [14:32:51] hmm, my repro certainly doesn't allow that to happen, since it reuses a counter in the global scope. lemme try and adjust [14:34:41] https://www.irccloud.com/pastebin/79CYnpVD/ [14:34:59] mszabo: indeed, that avoids the OOM! I'll have to work this backwards to see if this is still reproducing what the full script does [14:35:02] thanks [14:35:02] ebernhardson: ^ seems to work as described at least - commenting out $m->flush() leaves $c2 with 1k samples [14:35:16] for the specific example of the pagestore metric, it doesnt seem to cache it in a member var [14:36:33] indeed, it looks like that shouldn't be it, so i may be chasing the wrong end of the OOM, will find out :) [14:51:43] ebernhardson: an insidious way for this to happen would be for a closure or arrow fn to capture a metric variable [15:41:25] FIRING: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:30] hi folks. if I updated the name of a check in nagios_command, do I have to bump Icinga? reload didn't work clearly. [15:49:35] sukhe@alert1002:~$ sudo /usr/sbin/icinga -v /etc/icinga/icinga.cfg [15:49:44] gives me 166 errors, related to the change: [15:49:48] Error: Service check command 'check_ssl_ats' specified in service 'HAProxy HTTPS wikiworkshop.org ECDSA' for host 'cp6016' (file '/etc/icinga/objects/puppet_services.cfg', line 131927) not defined anywhere! [15:50:00] this has now been renamed to check_ssl_cdn [15:53:55] ok nvm, I see what's happening here. running agent on the affected hosts as well [15:55:43] this is basically saying that this [15:57:24] ok so the resolution was ^: enabling agent on the affected host. so what this was saying was that the check is not defined _there_ [16:13:45] ==> ok fwiw, the warning has now cleared up so there is nothing pending from my deploy. [16:13:49] Total Warnings: 0 [16:13:49] Total Errors: 0 [16:27:32] thanks for the heads up ! [16:36:25] RESOLVED: SystemdUnitFailed: sync-icinga-state.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed