[07:07:36] greetings [09:03:01] morning, this should be the last bit needed to enable infra-tracing-loki as backend into grafana [09:03:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211610 [09:03:29] if anyone has time. The logs are already being collected, it seems without problems so far. [09:20:44] morning [09:36:58] the opentofu-infra-diff.service alert was my fault, I upgraded the tofu version yesterday. now it's fixed. [09:37:51] ack thx [09:37:59] this made me notice we should have a corresponding alert in codfw where the same unit has been failing for a while [09:38:22] can you think of a reason why it's not firing? the host is cloudcontrol2006-dev [09:50:51] the metric is there in thanos if I check 'node_systemd_unit_state{name="opentofu-infra-diff.service"}' [09:51:24] might be disabled? [09:53:20] the alert definition is in https://thanos.wikimedia.org/alerts [09:53:44] if I copy-paste the expr, it returns 1, but it says "0 active" [09:54:38] mmh I don't know enough to know by heart what's happening and have a meeting in 5 [09:54:41] sorry [09:54:55] np it's not urgent but I'd like to understand what's going on [10:09:51] dhinus: can you review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211083 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203042? [10:32:09] taavi: on it [10:58:13] thanks all for the reviews on the haproxy patch! what's the safe way of deploying it without breaking everything? [10:58:41] (there is still a chance that the blackbox exporter check fails for infra-metrics-loki, but wouldn't be a big deal) [10:59:00] I just want to make sure it doesn't break the rest of haproxy :) [10:59:25] disable puppet in tools, merge and run puppet in toolsbeta, check it works, re-enable puppet [11:02:22] ok as usual then :) thanks, will probably do after lunch to make sure I keep the disabled window small [11:27:11] taavi: I'm not quite sure I understand why https://puppet-compiler.wmflabs.org/output/1211666/7768/ shows diffs that don't seem related to the change ? [11:27:23] I was expecting noop [11:29:25] godog: those are from the parent patch :/ [11:31:48] hah lol that explains [11:38:18] * godog lunch [12:37:42] deployed on toolsbeta, can curl locally on 30004. where can I check that there are no alerts? [12:37:58] I've also tried https://admin.beta.toolforge.org/healthz that should hit haproxy AFAICT [12:38:01] and works fine [12:39:57] once puppet has run on the prometheus node, it'd be visible on https://prometheus.svc.beta.toolforge.org/tools/alerts [12:42:40] ack thx [12:46:28] ok to switch cloudcumin1001 to nftables (along with a reboot) now? I see a few idle tmuxes/screens, but no running cookbooks or similar [12:46:33] mmmh it added the checks to probes-custom_puppet-http.yaml [12:46:56] but I don't see them in the UI (not even other http stuff, maybe I'm not looking correctly) [12:46:56] moritzm: ok on my end [12:48:17] * volans logged out [12:48:24] moritzm: no problem on my end [12:48:25] volans: I see it https://prod-misc-upload.public.object.majava.org/taavi/6eRZ9OeHWy0hv.png [12:49:16] now I see it, the search is a bit weird [12:49:34] doesn't search on expr or filenames I guess [12:50:57] ok, swichting cloudcumin1001, will leave a note when it's rebooted and rearmed [12:51:49] ack thanks [13:05:45] cloudcumin1001 is back with nftables, keyholder has been rearmed and it's good to be used again [13:06:16] great, thanks! [13:08:29] taavi: quick question, clicking on the expression shouldn't give me some data? but I see also the other probe down don't so I guess it's not the right way [13:08:31] taavi: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211651 firewall changes in that case will become effective after the second puppet run, is that right ? [13:09:44] godog: sigh, the commit message is wrong, should be s/exported/virtual/ [13:10:08] taavi: I see, got it thank you that answers my question [13:12:16] so no, all changes get applied with a single run [13:23:57] ok, so curl-ing infra-tracing-loki.svc.toolsbeta... from toolsbeta works, curl-ing infra-tracing-loki.svc.tools... from tools works. curl-ing either from metricsinfra-grafana-2 doesn't. So I guess the internal endpoint is limited to its own project. Is there a way to expose it to metricsinfra without making it accessible from the internet? [13:24:51] seems a firewall and/or routing issue AFAICT right now [13:25:36] yep, it'll need security group rules defined in https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/blob/main/modules/shared/kubernetes.tf?ref_type=heads#L13 to allow traffic from outside the project [13:26:24] if opening it up to all of cloud vps is enough, then that's trivial (we can do something like https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/blob/main/modules/shared/prometheus.tf?ref_type=heads#L16), but limiting to metricsinfra only without duplicating IPs in tofu is not easily doable [13:31:48] I guess with basic auth it's probably fine opening to cloudvps, at least for now, what do you think? [13:31:57] sounds ok to me [13:33:07] ack, thanks for hte pointers [13:33:40] dhinus: I need to run wikireplica-dns either way so I'll run the cookbook for T404570 (even though you've claimed it some time ago) [13:33:40] T404570: [wikireplicas] Create views for new wiki tokwiki - https://phabricator.wikimedia.org/T404570 [13:34:48] taavi: ok thanks! [13:40:04] done: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/104 [13:41:23] left a comment [13:41:55] it goes to haproxy, I guess I mixed up the naming? [13:42:24] better to put haproxy instead of web proxies? [13:42:52] "from the web proxies" in prometheus meant that the cloud vps web proxy sends traffic to the prometheus hosts [13:43:12] ahhh sorry, yeah I clearly misread that part [13:43:56] "Ingress traffic from to infra-metrics-loki from CloudVPS" would be more clear? [13:44:02] s/from to/from/ [13:44:36] that's fine, I would maybe also have specified it's from metricsinfra [13:45:03] ok I'll add that too, even if we're not limiting it [13:45:53] {done} added also a TODO comment [13:47:13] ship it [13:47:49] <3 [14:34:35] sigh T411192 [14:34:36] T411192: wmcs-wikireplica-dns is horribly inefficient - https://phabricator.wikimedia.org/T411192 [14:37:19] sigh [14:37:35] sensible-chuckle.png [14:39:15] thanks I wanted to create that task the last time I ran it :D [14:39:51] side note, we should stop creating .wmflabs records for new wikis [14:42:01] good point, although I hope we can just deprecate them all at once, is there any real blocker? apart from random tools/projects still using them? [14:44:23] I'm afraid the number of tools doing that is rather high [14:47:16] :/ [17:51:10] if you get an error when recreating lima-kilo, see T411208 [17:51:11] T411208: [lima-kilo] error mounting docker cache - https://phabricator.wikimedia.org/T411208 [17:56:44] * dhinus off