[12:43:43] !log uploaded purged 0.23 to bullseye-wikimedia (apt.wm.o) - T334078 [12:43:44] vgutierrez: Not expecting to hear !log here [12:43:44] T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 [12:43:52] and you're right :) [13:37:31] i've got a statslib question, hoping this is the right place to ask it. [13:37:47] q is: how do i test a new metric locally? i'd like something which just dumped the metric to a log somewhere so I could verify it was being generated correctly. [13:37:58] i think i'd set up a statsd server locally at one point, but my new metrics don't have "backward-compatible" statsd names. [13:38:09] $wgStatsdServer is documented, but no mention of prometheus in MainConfigSchema.php ? [13:38:29] Lucas said he'd "used `nc -ukl 8125` before (listen on the statsd port, dump to stdout)" [13:38:39] but i'm not calling ::copyToStatsdAt() for these, so I don't think they are going to show up on statsd [13:46:32] cscott: Cole has the most context on this being the author of statslib and he's back next week, however in case it is helpful I did adjust tests with mocks for stats names https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/994185 [13:46:50] I'm sure there are other examples e.g. in core [13:47:09] not sure if that answers your question [13:50:44] reading UDPEmitter it does seem like maybe I could get them emitted on UDP and watch them that way [13:51:41] but maybe a StatsFactory::newLogger() (to go with ::newNull()) or ::newTestFactory() would be helpful. [13:53:15] eg your test could just be `$testFactory = StatsFactory::newTestFactory(); $this->setService('StatsFactory', $testFactory); ...; $this->assertSame([ 'metric1:foo:bar', 'metric2:foo:bar' ], $testFactory->dumpMetrics();` [13:53:58] to your point cscott, the metrics will be emitted in dogstatsd format anyways on udp, so yeah you could also watch them live that way [13:58:43] ah, that's suggested at https://www.mediawiki.org/wiki/Manual:Stats#Developers [15:09:10] I'm getting this error on alerts.w.o '[GET /status] getStatus (status 503): {}', [15:09:21] some alerts still show up, but I think that some others do not [15:09:45] dcaro: expected, denisse is switching over alert host now [15:09:53] ack, thanks :) [15:10:13] sure np [15:13:58] FIRING: [2x] ThanosRuleSenderIsFailingAlerts: Thanos Rule is failing to send alerts to alertmanager. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleSenderIsFailingAlerts [15:15:45] dcaro: It must be fixed now. [15:17:27] denisse: thanks! [15:17:41] yep, back to many alerts xd [15:17:46] oh, silences got dropped? [15:18:16] Ah, I'm not sure if they're stored by Karma on the other host, let me see. [15:18:38] ah, they just expired as they did not see any alerts for that brief period [15:18:40] might be good to keep the discussion in one channel [15:18:52] RESOLVED: [2x] ThanosRuleSenderIsFailingAlerts: Thanos Rule is failing to send alerts to alertmanager. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleSenderIsFailingAlerts [15:18:55] FIRING: [3x] SystemdUnitFailed: alertmanager-irc-relay.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:49] yeah I think the acks didn't get renewed in time by kthxbye [15:22:37] That seems reasonable, I'll take not of that to see if there's a way to back them up before the next failover. [15:23:12] sounds good [15:48:11] denisse: just to confirm things went ok with your move to alert1002? [15:48:26] we have some network maintenance starting in a few minutes just making sure there is nothing outstanding [15:49:39] topranks: Everything went fine with the failover. Feel free to proceed with the maintenance. :) [15:50:11] denisse: well done!! [15:50:18] here's hoping our work goes as well :) [15:50:57] topranks: Thank you! Good luck with it. 🍀 [15:51:35] congrats denisse :) [15:52:18] one little thing. sirenbot just quit [15:52:54] but there it is again :) and icinga looks good. [15:54:12] mutante: Thank you! :) [16:12:47] hmm.. sirenbot keeps quitting and then rejoining right away [16:13:14] Let me take a look at it. [16:13:28] thank you [16:21:34] I think sirenbot needs to be voiced in the -operations channel, but I'm not sure if that's the reason why it quits and re joins. [16:26:56] does it run on alert* ? [16:27:00] or in cloud? [16:27:04] I always mix the bots up [16:27:52] not getting voice does not seem like a reason to quit, imho [16:29:29] and it does have the cloak.. so it also doesnt seem like an issue with nickserv.. hmm [16:30:48] It's hosted in the alert hosts. [16:33:55] FIRING: [2x] SystemdUnitFailed: corto.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:10] ^ Known, looking at it. [16:37:33] took a look and realized it's "vopsbot" service. I found this: "lvl=eror msg="Could not save the topic to the database"" [16:38:38] digging some more to find that database [16:39:04] it's /srv/vopsbot/vopsbot.db [16:39:33] the size of that db is 0 bytes [16:39:44] but it is owned by vopsbot:vopsbot with 644 [16:40:05] comparing with alert1001 [16:40:27] some permissions there but it's 20K and not 0.. hmmm [16:40:32] same [16:40:46] Good find! That's odd. 🤔 [16:40:58] I wonder if we should copy that file over [16:41:34] That may work, but we didn't have that issue when failing over to alert2002. [16:41:35] but I dont see why it can't write to it.. maybe it has to initialize it somehow [16:41:47] So I'm wonder what's the cause of the issue. 🤔 [16:42:37] On alert2002; -rw-r--r-- 1 vopsbot vopsbot 20480 Sep 18 15:03 /srv/vopsbot/vopsbot.db [16:43:18] you think it was copied with rsync before? [16:43:25] or just started from scratch on a new host [16:43:34] It wasn't, we did the failover last week and we didn't copy it. [16:43:40] It started from scratch on the new host. [16:43:43] odd [16:43:57] also checked it runs as vopsbot user [16:44:44] maybe we should start with the "IT crowd" debugging [16:44:53] What's that? [16:45:03] "turn it off and turn it on again" :p [16:45:15] I restarted the bot to no avail. [16:45:22] ah, gotcha [16:46:30] also has the same sqlite versions.. [16:46:39] I deleted the file, and running Puppet again. [16:48:31] that's an idea, yea [16:49:00] Tho that wouldn't explain why the DB is empty, that's odd... [16:49:05] though hard to imagine how an empty file would be corrupted [16:49:19] I'm also looking at the Puppet manifests regarding the db... [16:49:51] I also tested like " sudo -u vopsbot touch foo.db" [16:49:58] to create a file in that dir as the user.. no problem [16:50:17] Yeah, the permissions seem to be correct. [16:51:57] denisse: see -operations [16:52:01] and comment from volans [16:52:16] maybe we should try just copying the file after all [16:52:19] Yes, I've answered. [16:52:57] I'll copy it to get the service working but it's something we didn't do last time, so I'm not sure what caused the issue. [16:53:23] yea, I agree it's still weird, but: [16:53:23] 16:51 error="no such table: topics" [16:53:23] 16:52 yeah, the schema's not there [16:53:23] 16:52 rsync should fix [16:53:43] should I scp it ? [16:57:29] I tried with rsync but I keep getting SSH permissions errors. [16:57:36] mutante: Yeah, let's try scp. [16:57:46] that's expected unless you use rsync::quickdatacopy or so [16:57:52] ok, doing it [16:58:18] (it's because you can't forward ssh agents for security and firewalling too) [16:58:30] Ah, that makes sense, but to use rsync::quickdatacopy I'd need to do that from Puppet, right? [16:58:51] yea, you would need this in puppet to have a permanent way to just rsync files between them [16:59:08] it does all the things.. setup rsyncd, open firewall, add to allowed_hosts etc [16:59:26] then it would be rsync:// vs rsync over ssh [17:00:39] a theory: puppet has a resource that should create the db from the schema [0] but there's a race, where the service can start before that happens, creating a new / empty file? [17:00:39] [0] https://gerrit.wikimedia.org/g/operations/puppet/+/ac9c3ed546aa57822bc91aa52be468037c055ac6/modules/vopsbot/manifests/init.pp#96 [17:01:01] A race condition seems plausible. [17:01:13] you should be able to validate that by looking at puppet.log, or in puppetboard [17:01:20] but that seems quite plausible yes [17:01:30] the service should definitely require the db [17:01:35] copied the db from alert1001 to alert1002 [17:01:52] Nice, I'll look at those logs. I think it can be fixed by adding the require clause in Puppet. [17:01:55] adjusted permissions and restarted service [17:02:43] mutante: Thanks! [17:02:57] I'm looking at the logs to see if a race condition was the cause. [17:09:59] I may be missing something but I can't find the step where the DB is created: https://puppetboard.wikimedia.org/report/alert1002.wikimedia.org/b3a4b92de399543428841c8ed362d9168913cc6f [17:11:21] And the logs for when the failover to alert2002 was done are gone, they must've been rotated already as it's been about a week since it happened. [17:15:14] I hear there is a TODO to add rsync in that class that sets up the bot. [17:15:25] that sounds like the intention was to copy data regardless [17:15:41] even if sometimes we could get away without it [17:15:52] so we could just do that and not worry about the race I guess [18:22:37] cwhite: we are thinking of using [2024-09-16 logstash unavailability](https://docs.google.com/document/d/1Otic_JQbqHG0BNkMyU2dQ7K6yMSHUPxjgBnCsDIPdrk/edit#heading=h.95p2g5d67t9q) for Monday's Incident Review Ritual, and word has it that you are the best person to speak to that issue? Would you be present/willing if we scheduled it? [18:24:13] urandom: Cole is OOO this week. [18:26:11] denisse: thanks! [19:44:37] hi, dumb question, what version of opensearch do we run for the instance backing logstash? [19:49:14] ... ah, looks like I can do GET _cluster/stats on https://logstash.wikimedia.org/app/dev_tools#/console [19:49:22] 2.7.0 apparently :) [20:34:10] FIRING: SystemdUnitFailed: opensearch-dashboards.service on logstash1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed