[08:48:58] re: docker system prune on build2001 [08:48:59] elukey@build2001:~$ sudo systemctl list-timers | grep prune [08:48:59] Wed 2024-12-11 03:38:16 UTC 18h left Tue 2024-12-10 03:57:05 UTC 4h 51min ago docker-system-prune-dangling.timer docker-system-prune-dangling.service [08:49:03] Sun 2024-12-15 03:49:38 UTC 4 days left Sun 2024-12-08 03:44:25 UTC 2 days ago docker-system-prune-all.timer docker-system-prune-all.service [08:49:25] we already do it periodically, so there is the option to just kick off the units manually [08:50:21] kicked off the -all unit, hopefully it will clean up some space [08:50:54] root is now ~37% [08:56:26] mutante, klausman - re: home dirs never deleted - it is not ideal but at the same time it doesn't really impact to us that much, and when we do the reimages the cleanups happens automatically. Whoever syncs over the homedirs could review/drop old users if needed, but we can create a workflow. If you have ideas please open a task and propose :) [09:07:20] we could however bump disk space for build2002 to e.g. 400G, it's still WIP anyway and now it's much easier than in retrospect [09:08:10] we tend to build more image and some of the images are also getting bigger (like the ML toolchains) [09:09:03] build2001 (bullseye) and build2002 (bookworm) will co-exist in the mid term as 2001 will still be needed to build Python 2 things [10:10:07] thanks, both of you! [10:10:33] I had looked for cronjobs for image cleaning, but totally forgot about sysd timers [17:24:19] Not sure if there is a better place to send this, but WCQS has been down for a couple hours at least. https://commons-query.wikimedia.org/ [17:24:42] Just getting a "upstream request timeout" error. [17:24:57] DominicBM: thanks for the report [17:25:34] inflatador, ryankemper, any idea why wcqs would be down ^ [17:27:00] Lucas_WMDE: itamarWMDE: because I know you are related to this service [17:29:13] Well, it's now alternating between that 504 error above and also 502 Bad Gateway, if that's relevant. [17:33:29] herron: You didn't get to depool right? [17:33:46] marostegui: no I did commit the depool [17:34:15] herron: Ok, I will take of the repooling. Thank you [17:34:22] kk thanks marostegui! [17:34:40] marostegui: also https://phabricator.wikimedia.org/T381901 fwiw [17:34:45] thank you [17:36:05] looks like WCQS was affected by maintainance last week https://phabricator.wikimedia.org/T350793, a rollback was attempted, but that didn't fix the problem. [17:36:27] https://phabricator.wikimedia.org/T350793#10378435 [17:50:23] AntiComposite dcausse :eyes [17:55:59] none of that seems to have had any effect on the query rate though... https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wcqs&viewPanel=18 [18:02:20] The web frontend (commons-query.wikimedia.org) is being mapped to https://wcqs.discovery.wmnet , I would expect the redirect to go to miscweb a la query.wikidata.org , ref https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#257 [18:02:35] inflatador: it has to go through wcqs.discovery.wmnet due to the auth [18:03:41] inflatador: iirc wcqs nginx then reflects non-backend requests to miscweb [18:03:49] ebernhardson ACK, so much for that theory then [18:05:10] inflatador: look at /var/log/query_service/wcqs-blazegraph.log on the instances, looks like some sort of problem connecting to sessionstore [18:05:10] WCQS nodes are all pooled in discovery and passing health checks according to confctl/pybal [18:05:31] essentially this service uses sessionstore to remember temporary state [18:06:57] although those errors go back as far as the logs go, to nov 10th. Maybe it's normal? Seems worth checking [18:08:43] https://grafana.wikimedia.org/goto/P70DBXVNg?orgId=1 error rate from envoy seems normal [18:09:07] maybe there's a way to check from the sessionstore side? [18:11:26] hmm, not sure if that's it, they might just be itermittent errors :S I have the log for all 3 servers open with `tail -f` and running a public request doesn't add to the logs. There must be some other log though, because my public request gets a 500 [18:11:26] inflatador: from wcqs logs: I/O exception (java.net.SocketException) caught when processing request to {s}->https://sessionstore.discovery.wmnet:8081: Broken pipe (Write failed) [18:11:56] yeah, I'm seeing that too...curling the endpoint gives a response, so probably not firewall [18:13:04] Let me check network probes dashboard, maybe we can get some idea when this started since the logs don't go back far enough [18:16:40] no, I don't see any active monitors for commons-query [18:17:56] oh, curious. If i visit wcqs-query from a logged out browser i get redirected to commons, but visiting from my normal browser with an auth token it fails. Suggests oauth middleware which probes probably bypass [18:20:34] have to step out but will get back to this in ~40. For lurkers, WCQS is not considered a production service [18:20:44] manually performing the auth, looks like it validates the wcqdOauth token, issues a wcqsSession token, but then the request with a wcqsSession token just hangs. Which is especially odd because the wcqsSession is a stateless JWT [18:24:56] ahh, i bet this is it. This request from a wcqs instance hangs: wget -H 'Host: commons-query.wikimedia.org' https://webserver-misc-sites.discovery.wmnet/ [18:26:02] that would fail after auth, because with a passing auth it then proxies the request to webserver-misc for static assets [18:37:15] plausibly, in https://gerrit.wikimedia.org/r/1071073 miscweb was changed to limit srange to the caches. Would suggest broken since ~sept 6 [18:37:48] uhm, if that's the case I am here and can revert that [18:38:19] I thought it has been a production service ever since it was moved off of the beta site in 2022? e.g. https://phabricator.wikimedia.org/T296470 [18:38:47] it's more like November 19! [18:39:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1092827 [18:39:06] would that match? [18:39:35] mutante: unfortunately i'm not sure when it started, i haven't been able to convince anything in the backend to generate logs [18:39:36] if something needs to access that VM that isn't CACHING servers, that would be an issue [18:40:03] hold on, I am ok just testing this [18:40:04] generally whats happening is the request goes to the wcqs.discovery.wmnet servers, then those forward static assets requests to miscweb. [18:40:17] miscweb legacy or miscweb kubernetes though [18:40:40] like, the ganeti VMs, right? [18:40:47] hmm, config is `proxy_pass https://webserver-misc-sites.discovery.wmnet;` [18:41:15] yea, that is "legacy" [18:41:18] mutante: i think so, but havent followed miscweb [18:41:26] ok, give me a minute [18:42:28] I'm back if I can do anything to help [18:43:09] DominicBM Sorry, the word "production" is a bit overloaded. I mean in the sense that we don't have a formal SLO, don't page, etc. We can (and should) do a better job of monitoring the service, however [18:43:30] inflatador: i would guess the problem here is we monitor the backend, but we dont monitor access to the frontend assets [18:43:41] DominicBM: relatedly there are discussions regarding the future of this service at: https://phabricator.wikimedia.org/T376979 please feel free to comment there [18:44:40] the "old" production means ganeti VMs, the new productions kubernetes. both are production but different, also what inflatador said [18:45:04] things are moving over but not all at the same time [18:45:34] ebernhardson yeah, that's what it looks like...we do monitor `query.wikidata.org` web UI so we should probably do the same for `commons-query` [18:45:39] please check if it's fixed now by any chance [18:45:40] will get a patch up for that shortly [18:45:50] mutante: yup, page loads [18:45:55] +1, WFM [18:45:57] heh, yea, so that was firewalling [18:46:11] sorry about that, but did not expected anything to access that besides caching servers [18:46:18] like other misc sites [18:46:24] it's a reasonable guess, too many snowflakes [18:46:39] I reverted the change that limited it [18:46:42] we keep hoping to move this behind a standardized api gateway, but haven't really investigated it much [18:46:54] so either this moves fully over to k8s soon then this is irrelevant [18:47:00] or we can make the rules better..either or [18:47:17] like if you know what exactly needs to access it.. [18:47:53] but it's not urgent for today, happy it's fixed [18:48:04] mutante: sure, i'll start up a ticket [18:48:05] ACK, thanks mutante ! [18:48:12] :) [18:49:09] so yea, this means downtime was from Nov 19 [18:49:56] dcausse: Thanks, I guess I have been subscribed to that for a while, but haven't gotten around to saying anything... [18:50:00] well,actually, that's the maximum and not sure because it might have been switched to k8s and then back later [18:50:36] I should, though at the moment, not highly motivated to do so while feedback has been offered in several other places over the years, it's been made pretty clear there will be no further improvements anyway, at least that's how it feels. [18:50:54] But I also appreciate the attention being given to keep it up, such as it is! :D [18:52:32] DominicBM: things often change though and not all feedback is an official statement [18:54:23] For sure, it's a good reminder! Thanks again, I am logging off now for other work, just didn't want to leave sounding mad, because your efforts are appreciated. :) [19:54:15] OK, created https://phabricator.wikimedia.org/T381918 for commons-query.wikidata.org monitoring [20:01:09] small CR to cleanup some host hieradata if anyone has a chance to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1101888 [20:05:43] +1ed [22:41:31] jhathaway Muchas gracĂ­as