[08:40:06] o/ [08:46:08] o/ [10:12:18] that's what I plan to run to salvage the data when we're close to merge gmodena's retention patch: P74203 [10:12:51] tested using a copy of the data in my user space, ran the insertInto twice to be sure, it seemed to have worked fine [10:13:23] hm stashbot does not like paste [10:13:25] https://phabricator.wikimedia.org/P74203 [10:23:23] dcausse ack! [10:23:57] dcausse should I rollback the changes for filtering partitions, and let the training happen on the full dataset? [10:24:03] so to unblock you [10:24:30] I'm having some issues with mjolnir CI, that might take me a while to troubleshoot [10:24:32] https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/jobs/460011/viewer [10:24:53] gmodena: if that's ok with you? unless you believe that's an easy change enough to add into mjolnir? [10:25:01] dcausse the script LGTM! [10:26:11] re CI, you tried twice already, sigh... [10:26:20] the change _should_ have been easy but Ci is not happy (not sure if related TBH) https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/merge_requests/11/diffs [10:26:48] the main thing would be refactoring mjolnir to allow different training window per wiki [10:27:07] gmodena: let's try to fix CI and get this filtering merged then? [10:27:19] sounds good! [10:27:32] we can work on a more granular per-wiki filter after I suppose [10:28:28] last successul CI ran 3months ago with https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/merge_requests/10 [10:28:30] yep. That'd require refactoring mjolnir [10:28:56] nothing too hard I guess, but large enough to have his own phab IMO [10:29:03] +A [10:29:08] 1 even [10:29:14] :) [10:29:27] locally mjolnir's main is failing for me [10:29:36] with a bunch of Exception: Java gateway process exited before sending its port number [10:29:46] but it could be a macos/java issue [10:29:56] not sure I ever tried to run the build locally, let me see [10:30:13] ack [10:38:45] conda's so slow... wondering if my setup is broken [10:42:54] still spinning in "Solving environment", 100% cpu usage, 4.2G mem [10:47:35] https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/jobs/460069 is green. There was a flake8 nit that got swallowed in logs output. [10:48:33] dcausse conda is a pain. the meta for speeding up dep resolution is to change the resolver to conda-libmambda-solver [10:48:34] https://www.anaconda.com/blog/a-faster-conda-for-a-growing-community [10:49:01] gmodena: ah thanks! [10:50:14] i ditched conda for miniforge (FOSS fork); that one bundles the new resolver [10:51:46] can't even remember what I installed, just remember it was painful :) [10:57:28] :) [10:59:21] lunch [11:01:46] mmm... query_clicks_daily is stuck waiting on deps since march 6 https://airflow-search.wikimedia.org/dags/query_clicks_daily/grid?dag_run_id=scheduled__2025-03-06T00%3A00%3A00%2B00%3A00 [11:01:58] spotted it while testing mjolnir's patch [11:10:05] we miss two partitions for 2025-03-05 in query_clicks_hourly [11:10:11] hours 4 and 5 are not there [11:11:38] 11 and 12 are also missing [11:12:24] https://airflow-search.wikimedia.org/dags/query_clicks_hourly/grid?execution_date=2025-03-06+12%3A00%3A00%2B00%3A00&dag_run_id=scheduled__2025-03-06T04%3A00%3A00%2B00%3A00&tab=graph&task_id=wait_for_cirrus_requests [11:17:27] they failed waiting on wait_for_cirrus_requests partitions that are now available [11:18:55] I cleared the query_clicks_hourly failed tasks, let's see if they'll get picked up. [11:21:04] mmm... query_clicks_hourly is instantiated with max_active_runs=1. That might end up with the queued task starving :| [12:19:29] dcausse / brouberol : we have an incident that might relate to CirrusSesrch. Any chance you could help ? [12:19:54] Can you have a look in _security ? [12:23:58] query_clicks_hourly and query_clicks_daily have caught up [12:24:31] I can assist and follow, but I don't knowledgeable enough to take point FWIW [12:24:42] cc gehel dcausse [12:49:51] dcausse, gmodena : didnwe recently change something in the SUP that could explain increased pressure on the databases ? Did we increase parallelism? [12:53:15] gehel we did increase parallelism (2 -> 3 task managers) for consumer-search [12:53:30] gehel where do we see increase pressure? Which DBs? [12:54:03] we might be generating more requests to MW Action APIs, and increased writes to ES [12:54:24] the increased requests to MW Action API might be the issue [12:56:02] gmodena: was this mainly to keep up with the initial load of article country? [12:56:18] can we reduce that back 3->2? [12:56:50] gmodena: additional conversation in the security channel, I'll see if I can find someone to invite you [12:56:59] AFAIK yes [12:57:22] if we are done with article country, it should be safe to shrink the flink deployment [12:57:42] checking consumer lag metrics rn [12:57:48] gmodena: could you prepare that change? [12:57:54] gehel yep [12:58:13] brouberol: has paused the SUP and the DB overload has resolved, so there is at least a pretty strong correlation. [12:59:38] SUP is currently paused, so we're not ingesting anything into the search indices, which is obviously not something we want to keep for too long. [13:16:40] :/ [13:24:43] dcausse: you missed all the fun! [13:25:09] trying to understand what happened [13:25:20] dcausse: but there is still some fun happening in #wikimedia-sre [13:26:16] I don't think we understand well what has been happening, but SUP seem to have been a contributing factor on excessive load on mw-api and underlying databases [13:27:15] eqiad or codfw? [13:27:39] I'm actually unsure [13:28:07] brouberol: might have understood more. Or ask in -sre [13:28:31] rate of cirrus jobs remained unchanged for quite a while https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-kubernetes_namespace=cirrus-streaming-updater&var-app=All&var-destination=All&from=now-12h&to=now [13:28:45] I stopped them in both DCs and restarted them in both DCs with the reduced amount of replicas [13:29:44] The root cause might be somewhere else, but SUP being a major consumer of mw-api, it shows there more than other places [13:29:51] brouberol: do you know if the incident stopped when you stopped the pipeline? [13:30:24] it seemed to have helped. Amir and e.ffie (in #-sre) might know more [13:30:25] yes we run a lot of parses [13:30:33] there was a positive impact when SUP was stopped, but the discussion in -sre indicates that there might still be an issue. I don't think that things are super clear. [13:34:36] dcausse prob you've already seen the patch - we reduced parallelism to both consumer-search and consumer-cloudelastic [13:35:29] gmodena: yes but the rate of cirrus might stay the same, the parallelism increase helped mainly a bottleneck on the elastic sink [13:37:30] dcausse I that would also remove pressure from mw apis (rendering?) ? [13:38:18] it could if elastic becomes the bottleneck again, but the mw api query rate is controlled by something else [13:39:01] reading backscrolls to understand [13:50:52] seems like it was caused by a mw job categoryMembershipChange not the SUP [13:58:24] dcausse Cirrus was the initial suspect: https://docs.google.com/document/d/1dO3TQVl7vSUV-4YPkEQhAwTUa4mZnDEypOTkTdEqZl0/edit?tab=t.0 [13:58:43] shall we revert the SUP patch once the issue is resolved for good? [14:05:41] I think we should! [14:07:37] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127013 [14:09:52] \o [14:10:12] o/ [14:10:17] dcausse: amir1 was asking if isolating the SUP (mostly the saneitizer) for commons and wikidata would be possible? [14:12:16] o/ [14:12:24] gehel: the sanitizer is already isolated [14:16:02] can we disable sanitizer for just commons / wikidata? [14:17:05] I'm realizing that I don't know enough about how the various components from SUP fit together... [14:18:31] gehel: not sure we have an option to exclude wiki from the sanetizer looking, but that would not be ideal I guess? [14:19:46] I'm not sure this is something we should be acting one. But we know that commons / wikidata are larger and potentially more problematic in lots of ways. In case of emergency, being able to reduce load on some without shutting down everything seem appealing. [14:22:32] not 100% sure but I suspect that the sanitizer is not the cause of many mw-api-int requests [14:22:33] oof. conda-analytics, quarto & c really do not play well with sudo [14:22:44] i'm not entirely sure how though...it could be paused in saneitizer which will save a tiny bit, but the rest is from the update streams [14:22:52] yes [14:22:53] saneitizer is reporting ~70/s right now, and our upstream req rate is 300 [14:23:15] it was 600 a couple hours ago [14:23:25] oh, actually that was 15 minutes ago [14:23:41] it was catching up after being paused I think [14:23:48] ahh, that makes sense [14:25:30] maybe some sort of heuristics that can detect "trivial" changes and not run a full render...but that sounds incredibly error prone :P [14:26:14] yes... [14:47:43] we could keep the paralelism to two 2 as well, seems like eqiad caught up properly, the slowness of the elastic sink might have been related to the slowdow of eqiad couple weeks ago... [14:48:59] but I think 2 or 3 have a very minor incidence on mw-api-int query rate, this is mainly controlled by the capacity of the async http client which is set separately [14:52:25] dcausse FWIW I looked at flink UI and there was _some_ backpressure on the fetch/rendering tasks, but you are right. I could not see much difference in query rates after the parallelism change [14:58:09] * pfischer will be 5 min late for Wednesday meeting [14:58:22] gmodena: indeed, I see that the sinks were the bottleneck still... so while backfilling if we went with 3 again we might have unleashed a bit more rps than the ~600 to mw-api-int perhaps [14:59:37] dcausse that was my hunch (hence the scale down), but with not exactly high confidence :) [15:01:41] prior to the outage it was doing less than 400rps [15:02:48] did we already go back to 3 taskmanagers? [15:03:11] we're perhaps close to a limit somewhere but since the system works ok when the sup runs at 600rps I'm really sure what it is [15:03:17] gehel: no I don't think so