[01:58:27] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [05:58:27] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [08:03:06] morning folks, I want to merge https://gerrit.wikimedia.org/r/q/topic:%22T387291%22 (already reviewed by de.ni.sse, thx <3) , that will migrate thanos-web and thanos-query to IPIP encapsulation so it would be great if someone is around to spot potential issues [08:06:37] vgutierrez: SGTM, I'll be around this morning starting in ~1h [08:08:27] RESOLVED: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [08:09:01] godog: perfect, I remember you saying that the thanos-web hack using `mh` is no longer necessary, if that's the case we could add the scheduler_flag: mh-port there as well to use both realservers [09:07:35] vgutierrez: yes that's right, mh no longer needed and +1 to mh-port [09:07:38] vgutierrez: I'm ready when you are [09:07:53] I'll ping you as soon as I finish migrating apus :D [09:08:06] and I'll amend the patch [09:08:18] vgutierrez: ack [09:12:52] godog: updated :D [09:13:04] that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122995 [09:13:59] neat, +1 [09:16:19] ok, I'm done with apus [09:16:40] ok standing by [09:17:26] luckily the cookbook takes care of all the moving parts :D [09:17:41] very cool [09:18:30] merging codfw CR [09:19:10] cookbook will run puppet on the realservers and LVS, check that the realservers are accepting IPIP encapsulated traffic and restart pybal afterwards [09:20:05] (running puppet on thanos[2001-2002] as we speak) [09:23:36] godog: pybal@lvs2013 got restarted so thanos-web and thanos-query in codfw should be getting all its traffic IPIP encapsulated [09:24:51] RX packets 139 bytes 17122 (16.7 KiB) [09:24:54] not a lot.. but it's flowing :) [09:25:33] vgutierrez: sweet! yes codfw doesn't see much traffic in EU morning hours [09:25:48] cool, I'll proceed with eqiad [09:26:05] ack [09:32:26] eqiad done [09:32:47] I'm seeing queries flowing [09:33:01] cool :D [09:33:11] thanos.w.o replies as expected, I think we're good [09:33:25] godog: if you have the chance, we could finish thanos this morning, thanos-swift is the only one missing, CRs are ready for your review on https://gerrit.wikimedia.org/r/q/topic:%22T387293%22 [09:33:36] * vgutierrez brews some coffee [09:33:48] vgutierrez: thank you, confusingly enough thanos-swift is data persistence not o11y [09:34:09] * vgutierrez cries in ownership [09:34:13] thx :D [09:34:14] for histerical raisins [09:34:36] I'll ping o11y again when I have the prometheus CRs ready :D [09:34:55] ok! [09:35:45] love the "histerical raisins", are they good? :-P [09:36:12] lol, only in small quantities [09:37:50] the hysterical raisins are godog wanting nothing to do with swift ever again ;p [09:38:06] A position I can endorse, FTR :) [09:39:22] haha! [11:28:30] godog: do you wanna take care of kibana7 or should I wait for Mr Cole? https://gerrit.wikimedia.org/r/q/topic:%22T387301%22 [11:29:58] vgutierrez: I'm available now for another 15/20 min if that's ok with you [11:30:18] godog: yup :D [11:31:06] vgutierrez: ok patches +1'd [11:31:16] aweesome, thanks [11:32:45] uh.. [11:32:46] RuntimeError: No hosts found matching {self.role} in {self.dc} [11:33:02] that's a nice bug on the format string.. but why it's not finding any nodes? :) [11:33:33] oh.. the role is logging::opensearch::collector [11:33:38] not opensearch::collector [11:33:40] my bad [11:33:47] that's right yeah [11:42:22] codfw should be getting traffic via IPIP already [11:42:29] checking [11:43:22] I can see the healthchecks and probes indeed, logstash2023 [11:44:20] cool [11:46:13] (proceeding with eqiad) [11:47:17] ack [11:50:18] godog: traffic should be already flowing using IPIP [11:51:24] vgutierrez: confirmed, I can load logstash.w.o [11:51:49] lovely :D, thanks godog [11:53:36] vgutierrez: sure no problem! thank you too [14:15:28] sorry to keep hammering you folks, but hopefully by the end of today I'll be finished migrating o11y services, it's time for prometheus && prometheus-https, CRs are available here for your review: https://gerrit.wikimedia.org/r/q/topic:%22T387302%22 🍻 [14:17:58] vgutierrez: no worries, I'd rather take the hit all at the same time, I'm 13 minutes away from a meeting and ok to proceed if that's ok with you [14:18:14] perfect [14:18:54] patches LGTM [14:19:23] and I missed the chance of doing logs-api at the same time as kibana7 :) [14:19:39] lol [14:19:43] thx :D, proceeding with codfw [14:20:02] those two are handled by the same realservers :) [14:23:38] puppet is kinda slow on prometheus nodes [14:25:30] yes in the order of > 1 minute IIRC [14:25:57] before nerd sniping kicks in: it is the puppetdb queries [14:27:44] sigh.. [14:27:55] according to the coobook puppet failed in two nodes out of 4? [14:28:09] 50.0% (2/4) of nodes timeout to execute command 'run-puppet-agent ': prometheus[2005-2006].codfw.wmnet [14:28:09] 50.0% (2/4) success ratio (< 100.0% threshold) for command: 'run-puppet-agent '. Aborting.: prometheus[2007-2008].codfw.wmnet [14:28:12] * vgutierrez checking [14:28:47] https://puppetboard.wikimedia.org/report/prometheus2005.codfw.wmnet/b89375deed366a5c1f3fd830b6979b26bd4e7d96 [14:28:52] I don't see what's wrong here [14:29:18] vgutierrez: did not fail, "nodes timeout to execute command" [14:29:28] FFS :) [14:29:38] cookbook timeout is too low for puppet on prometheus nodes? [14:29:52] timeout: int = 300 [14:29:52] https://doc.wikimedia.org/spicerack/master/api/spicerack.puppet.html#spicerack.puppet.PuppetHosts.run [14:30:13] if a host takes more than 5m I think we have quite some problem [14:30:21] duration: 0:05:51.885000 [14:30:23] I'm jumping into a meeting, will keep an eye here [14:31:24] IIRC when we set that timeout long time ago the slowest servers were around 1m [14:32:27] ack, I'm manually running puppet on 2005-2006 to clear errors [14:32:33] and proceeding manually :D [14:32:39] I'll patch the cookbook before hitting eqiad [14:32:51] loading facts took ~3 minutes btw [14:33:41] unless I'm misreading puppet logs, but I guess that's a red herring, "Caching catalog" took 3m [14:35:20] confirmed [14:35:20] Puppet Compiled catalog for prometheus2005.codfw.wmnet in environment production in 157.81 seconds [14:35:26] from puppetserver logs [14:36:32] then took more than 1m to apply [14:36:35] sigh [14:36:36] volans: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1123384 [14:37:20] +1ed, sorry you hit that [14:37:25] * vgutierrez manually validating that prometheus realservers can handle incoming IPIP traffic [14:38:00] vgutierrez: can't you re-run the cookbook/ [14:38:01] ? [14:38:11] yeah [14:38:15] and wait some minutes :D [14:38:29] I can't merge the cookbook yet... [14:38:30] another option is also to wrap the puppet run with confirm_on_failure() [14:40:27] I'm back FWIW [14:42:04] godog: waiting on some CI stuff to finish [14:42:12] prometheus nodes in codfw should be OK and ready to handle IPIP traffic [14:42:34] everything feels slower when you are in a rush [14:42:46] jokes aside you could add a --check-only flag ;) [14:43:12] volans: I could do so many things :) [14:43:32] vgutierrez: can confirm, https://prometheus-codfw.wikimedia.org/ loads fine [14:44:16] cookbook CR merged, running puppet on cumin1002.. [14:46:46] (rerunning the cookbook now) [14:53:12] https://www.irccloud.com/pastebin/ZTbKcYiV/ [14:53:22] definitely this is messing with me :) [14:53:53] oh.. it worked on the second time.. so it needed a retry :) [14:54:32] oh crap [14:54:49] the retry logic applies to the whole function [14:54:57] so it re-runs pupept [14:54:59] *puppet [14:55:03] FFS [14:57:18] godog: port 80 should be reachable? [14:57:47] vgutierrez: yes [14:58:09] https://www.irccloud.com/pastebin/NT9hWNkL/ [14:58:28] mmhh ok checking [14:58:46] https://www.irccloud.com/pastebin/edslGBpQ/ [14:58:54] port 80 is down in prometheus2007 [15:00:33] ok apache is listening on *, then next would be firewall [15:01:08] vgutierrez: doh, ok my bad 200[78] are not in service yet [15:01:24] ditto 100[78] [15:03:27] vgutierrez: does that invalidate the tests/cookbook i.e. we have to fix port 80 to be reachable ? [15:07:46] uh... yes [15:08:00] ok checking [15:08:10] but I can proceed manually [15:08:38] given it's a known issue :) [15:09:39] heh yes not sure I can puppet the solution before you are done manually tbh [15:09:50] no problem, I'll proceed [15:09:55] I'm guessing those two are depooled, right? [15:10:00] correct yes [15:10:13] and eqiad will be ok because 100[78] are still insetup [15:10:20] role(insetup) that is [15:10:48] ack [15:14:31] godog: traffic should be flowing now in codfw using IPIP [15:15:26] vgutierrez: confirmed [15:16:03] found the port 80 problem btw, will be fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123391 [15:18:09] lovely, proceeding with eqiad [15:19:30] ack [15:27:28] godog: cookbook validated both servers in eqiad, restarting pybal [15:28:27] vgutierrez: ok [15:32:58] godog: it's done :D [15:35:29] vgutierrez: can confirm, all good \o/ [15:37:12] cool.. I'll submit the logs-api one soon :D [15:40:14] ok, might as well finish yeah [15:59:14] yup [15:59:31] and the logs-api should be fairly easy given that the instances are already handling IPIP for kibana7 [16:06:48] indeed [16:12:24] CRs ready on https://gerrit.wikimedia.org/r/q/topic:%22T387304%22 [16:14:18] as expected, on the realservers we only need to add clamping for the additional VIP [16:17:41] FIRING: PrometheusLowRetention: Prometheus k8s-aux is storing less than 20 days of data on prometheus2005:9911. - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fk8s-aux - https://alerts.wikimedia.org/?q=alertname%3DPrometheusLowRetention [16:19:12] ^ side effect of bootstrapping aux-k8s codfw afaik I'll mute [16:20:15] vgutierrez: sorry I got distracted [16:20:39] vgutierrez: would you mind mentioning me here so I get the highlight ? [16:20:47] godog: yes sir [16:22:05] vgutierrez: patches LGTM [16:22:13] nice :D [16:22:41] RESOLVED: PrometheusLowRetention: Prometheus k8s-aux is storing less than 20 days of data on prometheus2005:9911. - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fk8s-aux - https://alerts.wikimedia.org/?q=alertname%3DPrometheusLowRetention [16:23:31] proceeding with codfw [16:23:54] ack [16:29:56] godog: codfw done [16:30:57] vgutierrez: looks good to me [16:31:17] ok, hitting eqiad then [16:32:39] ok [16:38:59] eqiad looking good, restarting pybal [16:39:41] traffic should be flowing now via IPIP [16:40:43] LGTM vgutierrez [16:40:52] and unless I've missed any o11y service on https://phabricator.wikimedia.org/T373020 we are done with o11y services [16:42:14] vgutierrez: yep I think we are [16:43:17] nice, thanks :D [16:46:28] neat, thank you too! [21:26:35] FIRING: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [21:29:34] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:35] FIRING: [3x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [21:32:23] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:36:35] FIRING: [4x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [21:41:35] RESOLVED: [4x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [22:14:34] FIRING: SystemdUnitFailed: statograph_post.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:35] FIRING: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [22:17:23] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:35] RESOLVED: [4x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [22:22:23] RESOLVED: [2x] SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:31] cwhite: I just deployed a couple of patches that stop flood of 600,000 logs per hour to logstash, it should show some impact on the devices and the infra https://logstash.wikimedia.org/goto/b61ad77ebc13933c5339841c71b95928 [22:23:05] Amir1: Cole is OOO, thanks for the patches! [22:23:39] ah okay. Let me know if it's making a visible impact