[07:18:36] GitLab needs a short maintenance break in 30 minutes [07:57:36] GitLab maintenance finished [10:30:50] FYI, that OpenSSH 10.1 regression mentioned by Jesse yesterday is now fixed in the latest 10.1p2-2 upload to Debian [12:07:01] `summary: WARNING: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h).` <- some ongoing activity that could've caused this? [12:07:05] gerrit switch? [12:50:38] fabfur: not that I'm aware... was that on a specific lvs host? [12:53:57] https://usercontent.irccloud-cdn.com/file/6SxBDMkr/image.png [12:57:18] lvs1019 logs show this change in pybal.conf: https://phabricator.wikimedia.org/P83716 [12:57:36] wdqs changes.. they seem reasonable, though not sure why the restart wouldn't have happened? [12:57:45] yep [12:58:22] should it restart automatically? [13:37:09] no, we don't restart pybal automatically [13:37:22] that's why we put the alert since we did in the past forget to restart it :P [13:40:01] inflatador: ryankemper: ok to restart pybal for the wdqs changes? [13:40:18] (I mean we kinda have to but checking with you first) [13:49:08] cool thanks for confirming... yeah I could see why we might want to avoid auto-restart of the service [13:53:17] topranks: thanks for checking the logs <3 [14:02:04] sukhe sure, sorry I just got this [14:02:52] inflatador: no worries, topranks pasted the diff above, does that seem fine to you? I can take care of restarting pybal [14:04:37] sukhe yes, feel free to restart, I'll keep an eye out. btullis ^^ just a heads-up, this is probably from the ticket where we disabled the plaintext ports for wdqs [14:05:01] inflatador: ok thanks, I am going to do it now [14:05:20] Ah, sorry. That's on me. I had forgotten about the pybal restart requirement. [14:05:26] no worries [14:05:37] but yeah, any pybal change requires a manual restart [14:06:38] I'll be watching QPS for the internal WDQS endpoints https://grafana.wikimedia.org/goto/cP565S6Ng?orgId=1 [14:10:27] inflatador: [14:10:27] sukhe@lvs1020:~$ curl localhost:9090/pools/wdqs-internal-main_443 [14:10:27] wdqs1025.eqiad.wmnet: enabled/partially up/not pooled [14:10:27] wdqs1026.eqiad.wmnet: enabled/partially up/pooled [14:10:45] wdqs1025 expected state of not pooled? [14:11:14] also wdqs1026 as well [14:11:14] we did a data transfer yesterday, if that broke it might be the reason. Checking [14:12:23] no, neither should be depool [14:12:29] they are all pooled in confctl [14:13:06] sukhe does the above mean they're failing health checks? 'cause yeah, they seem to be pooled in confctl [14:13:19] I suspect we are missing something else here [14:13:29] yeah, wdqs-internal-main: [14:13:33] port: 80 [14:13:35] ProxyFetch: [14:13:35] url: [14:13:35] - http://localhost/readiness-probe [14:13:50] I am guessing you are not listening on 80 anymore, right? [14:14:49] we are, but I wonder if it's firewalled off now. Checking [14:15:12] ok. I have restarted only the backup so far, so we will wait for it to be resolved before moving on to the primary low-traffic [14:15:22] no, it's wide open. and I get the expected output if I curl from wdqs1026 anyway [14:15:44] arguably we shouldn't be doing plaintext traffic over our networks if we can help it, even dc-local. [14:16:34] agreed, we're trying to get rid of it. But it sounds like we missed something [14:16:49] I can get a patch up for changing that health check, I assume it is in Puppet? [14:17:21] inflatador: yes in hieradata/common/service.yaml [14:17:25] look for wdqs-internal [14:17:34] but I am checking on why that's failing though [14:17:53] sukhe will do, but I'm not sure it'll help since port 80 is still listening/open in fw [14:18:35] hmm yeah healtcheck also seems happy [14:18:40] well at least the manual one [14:18:50] true [14:18:58] and 200 [14:21:24] as currently configured ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#3138 ) , is the health check expected to come in directly on port 80 or on 443? [14:22:14] we're looking at the `wdqs-internal-main` and `wdqs-internal-scholarly` services. If there is still a `wdqs-internal` service that's a mistake [14:23:23] inflatador: you need https:// not http:// in the ProxyFetch url [14:23:45] does that service having `encryption: false` affect anything about that? [14:24:29] good catch, both of you! I'll get a patch up [14:24:31] btullis: what was changed yesterday? wdqs-internal-main? [14:24:40] taavi: yes [14:24:49] 68 │ $scheme = $service['encryption'] ? { [14:24:51] 69 │ true => 'https', [14:24:53] 70 │ default => 'http' [14:24:55] 71 │ } [14:25:01] Does the TLS cert need to be valid for localhost? Because it's not ATM [14:25:13] It was this patch but it as today: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1187772 [14:25:14] inflatador: wdqs-internal-scholarly also needs an update, but did you change that yesterday? [14:25:23] Sorry, in meeting right now. [14:26:29] sukhe that would have been b-tullis, but yes both services need their health checks fixed [14:29:09] OK, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194956 is ready for review [14:29:41] I will look shortly [14:32:01] hmm that's failing compile [14:33:52] > Evaluation Error: Unknown function: 'group_by'. (file: /srv/jenkins/puppet-compiler/7623/change/src/modules/wmflib/functions/service/get_lvs_class_hosts.pp, l [14:34:07] I remember this one because well I wrote that function. try with Puppet 7 [14:35:15] (meeting ending soon) [14:38:25] looking now, back [14:39:08] puppet 7 run is clean https://puppet-compiler.wmflabs.org/output/1194956/5107/ [14:41:38] so btullis' earlier patch [14:41:45] changed both the services, so that at least adds up [14:42:19] that's also in the paste which Cathal shared above but yeah, I didn't notice scholarly there [14:42:32] so yeah, both are failing in theory [14:43:30] sukhe@lvs1020:~$ curl localhost:9090/pools/wdqs-internal-scholarly_443 [14:43:30] wdqs1027.eqiad.wmnet: enabled/partially up/pooled [14:44:40] The CR updates internal-scholarly settings as well [14:44:48] yep merging let's see [14:48:08] looks good [14:48:10] confirming again [14:48:50] sukhe@lvs1020:~$ curl localhost:9090/pools/wdqs-internal-main_443 [14:48:50] wdqs1025.eqiad.wmnet: enabled/up/pooled [14:48:50] wdqs1026.eqiad.wmnet: enabled/up/pooled [14:48:56] sukhe@lvs1020:~$ curl localhost:9090/pools/wdqs-internal-scholarly_443 [14:48:56] wdqs1027.eqiad.wmnet: enabled/up/pooled [14:50:49] {◕ ◡ ◕} [14:51:54] Awesome, I don't see any changes in QPS...pretty sure all the clients changed to HTTPS awhile back [14:52:17] wait...I take that back. Internal scholarly DID drop to zero [14:52:29] but that was ~20m ago? [14:52:29] I haven't restarted 1019 yet [14:52:43] https://grafana.wikimedia.org/goto/g4yaoIeNR?orgId=1 [14:54:02] not that this should affect this anyway since we restarted the backup LVS and not primary low-traffic [14:54:17] but the first restart was at 14:08 UTC [14:54:20] https://sal.toolforge.org/log/B8tNyZkB8tZ8Ohr0GN_4 [14:55:01] interesting, that does not match the time of the traffic drop (~1330 UTC) [14:55:02] whereas the drop above happens at ~13:18 UTC [14:55:04] yeah [14:55:31] does the metric need to be updated or something? I have no idea about that [14:55:37] but we should check that before we go on restarting the primary lvs [14:55:38] doesn't match https://gerrit.wikimedia.org/r/c/operations/puppet/+/1187772 either [14:56:15] yeah, that was much earlier, and plus that doesn't change anything on the LVS side until pybal is restarted [14:56:26] unless, someone restarted pybal around 13:20ish and did not log it [14:56:45] yup, this might be unrelated. Don't let this stop you from completing the LVS stuff, I'll start looking [14:59:33] ok. I am going to check a few things before moving ahead but yeah [15:11:15] so the last non-monitoring-related query I see on internal-scholarly (1027) is at `09/Oct/2025:13:17:39` and it's from `10.67.233.115` which has no reverse record, guessing it's a k8s IP? [15:12:04] inflatador: reverse-resolution of k8s endpoint IPs (pod IPs that match a service spec, basically) should get reverse-resolved nowadays [15:12:19] I think we're gonna need to find out clients before they find us ;) . What's the best way to do that, Superset? [15:12:40] 10.67/16 is k8s eqiad IP space though [15:13:21] yeah, netbox says it's part of wikikube pod IPs [15:15:21] pod no longer running :) [15:16:59] kubectl get pod -A -o wide | fgrep 10.67.233.115 [15:20:50] as for your other question ... where do you keep wdqs access logs? if this traffic isn't going via the CDN it's not in superset [15:21:30] OK, then it's probably not in superset. The access logs are all local to the host. I guess it's possible they're in logstash too. Checking [15:22:37] btw, the thing which that `encryption` setting is the URL generated into the service mesh config [15:23:29] I didn't quite parse that ;) [15:23:42] inflatador: https://phabricator.wikimedia.org/P83721 [15:24:14] similar diff for internal_scholarly [15:24:56] OK, so adding `encryption:true` to the service definition caused that config to be added [15:24:58] ? [15:26:52] probably yes [15:27:09] you can do a pcc that includes mwdebug1001.eqiad.wmnet [15:28:11] inflatador: restarted pybal on lvs1019 as well [15:28:17] sukhe@lvs1019:~$ curl localhost:9090/pools/wdqs-internal-scholarly_443 [15:28:17] wdqs1027.eqiad.wmnet: enabled/up/pooled [15:28:17] sukhe@lvs1019:~$ curl localhost:9090/pools/wdqs-internal-main_443 [15:28:17] wdqs1025.eqiad.wmnet: enabled/up/pooled [15:28:17] wdqs1026.eqiad.wmnet: enabled/up/pooled [15:28:47] (last restart was ~2 weeks ago, so yeah, that's probably one more confirmation) [15:32:07] I can't `curl https://wdqs-internal-main.discovery.wmnet/readiness-probe` from cumin2002, that seems like it should work (and it does work for wdqs-main.discovery.wmnet) [15:32:47] `wdqs-internal-main.discovery.wmnet` resolves to 10.2.1.93 which matches the DNS repo [15:33:45] looks like we have at least one client in the wikikube cluster https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/kartotherian/values.yaml#77 [15:35:39] but it'd be hitting the mesh instead of LVS [15:37:22] inflatador: I have to do another deploy now but can come back to this later [15:37:37] we will need to go through the service definitions and the VIPs to check what is happening with the above [15:37:39] hard to say [15:37:40] inflatador: the mesh envoys do contact the internal LVS IP [15:37:45] sukhe np, we may need your help at some point but don't feel like you have to actively work on it [15:37:52] `encryption` changes the scheme with which they do so [15:38:08] inflatador: na it's all good, LVS work is on us anyway to guide through if not implement the bits [15:39:56] fabfur: jhathaway: starting the hcaptcha work, no alerts expected but if they happen, that's on me [15:40:16] thanks sukhe [15:41:32] cdanis ack, thanks for the correction. `wdqs-internal-main`, which is used by kartotherian, is still getting traffic. `wdqs-internal-scholarly` is getting no traffic, at least based on https://grafana.wikimedia.org/goto/vQ55xI6Hg?orgId=1 . No one is screaming, no alerts are firing, etc [15:41:40] ack [15:43:07] thanks sukhe [15:43:40] For all we know, this could be non-user-impacting. I'm gonna try and dig up some tickets from the graph split migration and get a handle on our internal clients, don't feel like y'all have to activate on this [15:51:50] fabfur, jhathaway: heads-up: we're about to shift 10% of enwiki's rest.php traffic to the rest-gateway. All of the other group migrations have been uneventful, but this is unsurprisingly going to be a fair whack of traffic [15:52:12] 🤞 [15:52:17] thanks [16:03:07] sukhe: hey how are things in relation to that lvs stuff? [16:03:09] cc brett [16:04:03] reason I ask is Valerie is on site and wanted to move the cable coming from lvs1020 from rack F1 to E1 (T404959) [16:04:03] T404959: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959 [16:04:36] topranks: bret.t is out after the on-call shift [16:04:46] let me take a look at this in a bit? [16:04:53] rolling out a deploy right now and can comment after that [16:04:54] that fine? [16:04:55] but the move will mean the backup LVS will be disconnected from row E/F vlans for an hour or so.... obviously she is on site and I am online so if something goes wrong we can move it back in a hurry [16:05:01] sukhe: yeah no worries [16:05:20] topranks: how long is she around for and are we doing it today? [16:05:24] both work but just checking [16:05:25] I guess the question is if things are in a "normal" situation now. Or if you anticipate swinging traffic to the backup lvs over the next while [16:05:49] she is around now... I don't know, I gotta be online too so would rather we started sooner than later [16:06:10] when you get a minute ping be back anyway, keep at your deploy for now thanks [16:06:19] thanks [16:22:41] topranks: ok sorry [16:22:46] so the plan is to do this today? [16:23:24] yeah no ongoing work from us anymore I guess so if you and Valerie are fine, let's do it? [17:10:07] well I'm not sure what happened, but WDQS internal main and scholarly traffic recovered about 40m ago https://grafana.wikimedia.org/goto/i2j2sSeHg?orgId=1 [17:12:23] inflatador: can you make sure that the Wikidata Platform team is aware of the interruption? [17:12:29] Thx! [17:25:17] Yep, linked 'em to https://wikimedia.slack.com/archives/C055QGPTC69/p1760025201445329