[08:40:06] <gmodena>	 o/
[08:46:08] <dcausse>	 o/
[10:12:18] <dcausse>	 that's what I plan to run to salvage the data when we're close to merge gmodena's retention patch: P74203
[10:12:51] <dcausse>	 tested using a copy of the data in my user space, ran the insertInto twice to be sure, it seemed to have worked fine
[10:13:23] <dcausse>	 hm stashbot does not like paste
[10:13:25] <dcausse>	 https://phabricator.wikimedia.org/P74203
[10:23:23] <gmodena>	 dcausse ack!
[10:23:57] <gmodena>	 dcausse should I rollback the changes for filtering partitions, and let the training happen on the full dataset?
[10:24:03] <gmodena>	 so to unblock you
[10:24:30] <gmodena>	 I'm having some issues with mjolnir CI, that might take me a while to troubleshoot
[10:24:32] <gmodena>	 https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/jobs/460011/viewer
[10:24:53] <dcausse>	 gmodena: if that's ok with you? unless you believe that's an easy change enough to add into mjolnir?
[10:25:01] <gmodena>	 dcausse the script LGTM!
[10:26:11] <dcausse>	 re CI, you tried twice already, sigh...
[10:26:20] <gmodena>	 the change _should_ have been easy but Ci is not happy (not sure if related TBH) https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/merge_requests/11/diffs
[10:26:48] <gmodena>	 the main thing would be refactoring mjolnir to allow different training window per wiki
[10:27:07] <dcausse>	 gmodena: let's try to fix CI and get this filtering merged then?
[10:27:19] <gmodena>	 sounds good!
[10:27:32] <dcausse>	 we can work on a more granular per-wiki filter after I suppose
[10:28:28] <dcausse>	 last successul CI ran 3months ago with https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/merge_requests/10
[10:28:30] <gmodena>	 yep. That'd require refactoring mjolnir
[10:28:56] <gmodena>	 nothing too hard I guess, but large enough to have his own phab IMO
[10:29:03] <dcausse>	 +A
[10:29:08] <dcausse>	 1 even
[10:29:14] <gmodena>	 :)
[10:29:27] <gmodena>	 locally mjolnir's main is failing for me
[10:29:36] <gmodena>	 with a bunch of Exception: Java gateway process exited before sending its port number
[10:29:46] <gmodena>	 but it could be a macos/java issue
[10:29:56] <dcausse>	 not sure I ever tried to run the build locally, let me see
[10:30:13] <gmodena>	 ack
[10:38:45] <dcausse>	 conda's so slow... wondering if my setup is broken
[10:42:54] <dcausse>	 still spinning in "Solving environment", 100% cpu usage, 4.2G mem
[10:47:35] <gmodena>	 https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/jobs/460069 is green. There was a flake8 nit that got swallowed in logs output.
[10:48:33] <gmodena>	 dcausse conda is a pain. the meta for speeding up dep resolution is to change the resolver to conda-libmambda-solver
[10:48:34] <gmodena>	 https://www.anaconda.com/blog/a-faster-conda-for-a-growing-community
[10:49:01] <dcausse>	 gmodena: ah thanks!
[10:50:14] <gmodena>	 i ditched conda for miniforge (FOSS fork); that one bundles the new resolver
[10:51:46] <dcausse>	 can't even remember what I installed, just remember it was painful :)
[10:57:28] <gmodena>	 :)
[10:59:21] <dcausse>	 lunch
[11:01:46] <gmodena>	 mmm... query_clicks_daily is stuck waiting on deps since march 6 https://airflow-search.wikimedia.org/dags/query_clicks_daily/grid?dag_run_id=scheduled__2025-03-06T00%3A00%3A00%2B00%3A00
[11:01:58] <gmodena>	 spotted it while testing mjolnir's patch
[11:10:05] <gmodena>	 we miss two partitions for 2025-03-05 in query_clicks_hourly
[11:10:11] <gmodena>	 hours 4 and 5 are not there
[11:11:38] <gmodena>	 11 and 12 are also missing
[11:12:24] <gmodena>	 https://airflow-search.wikimedia.org/dags/query_clicks_hourly/grid?execution_date=2025-03-06+12%3A00%3A00%2B00%3A00&dag_run_id=scheduled__2025-03-06T04%3A00%3A00%2B00%3A00&tab=graph&task_id=wait_for_cirrus_requests
[11:17:27] <gmodena>	 they failed waiting on wait_for_cirrus_requests partitions that are now available
[11:18:55] <gmodena>	 I cleared the query_clicks_hourly failed tasks, let's see if they'll get picked up.
[11:21:04] <gmodena>	 mmm... query_clicks_hourly  is  instantiated with max_active_runs=1. That might end up with the queued task starving :|
[12:19:29] <gehel>	 dcausse / brouberol : we have an incident that might relate to CirrusSesrch. Any chance you could help ?
[12:19:54] <gehel>	 Can you have a look in _security ?
[12:23:58] <gmodena>	 query_clicks_hourly and query_clicks_daily have caught up
[12:24:31] <brouberol>	 I can assist and follow, but I don't knowledgeable enough to take point FWIW
[12:24:42] <brouberol>	 cc gehel dcausse
[12:49:51] <gehel>	 dcausse, gmodena : didnwe recently change something in the SUP that could explain increased pressure on the databases ? Did we increase parallelism?
[12:53:15] <gmodena>	 gehel we did increase parallelism (2 -> 3 task managers) for consumer-search
[12:53:30] <gmodena>	 gehel where do we see increase pressure? Which DBs?
[12:54:03] <gmodena>	 we might be generating more requests to MW Action APIs, and increased writes to ES
[12:54:24] <gehel>	 the increased requests to MW Action API might be the issue
[12:56:02] <gehel>	 gmodena: was this mainly to keep up with the initial load of article country?
[12:56:18] <gehel>	 can we reduce that back 3->2?
[12:56:50] <gehel>	 gmodena: additional conversation in the security channel, I'll see if I can find someone to invite you
[12:56:59] <gmodena>	 AFAIK yes
[12:57:22] <gmodena>	 if we are done with article country, it should be safe to shrink the flink deployment
[12:57:42] <gmodena>	 checking consumer lag metrics rn
[12:57:48] <gehel>	 gmodena: could you prepare that change?
[12:57:54] <gmodena>	 gehel yep
[12:58:13] <gehel>	 brouberol: has paused the SUP and the DB overload has resolved, so there is at least a pretty strong correlation.
[12:59:38] <gehel>	 SUP is currently paused, so we're not ingesting anything into the search indices, which is obviously not something we want to keep for too long.
[13:16:40] <dcausse>	 :/
[13:24:43] <gehel>	 dcausse: you missed all the fun!
[13:25:09] <dcausse>	 trying to understand what happened
[13:25:20] <gehel>	 dcausse: but there is still some fun happening in #wikimedia-sre
[13:26:16] <gehel>	 I don't think we understand well what has been happening, but SUP seem to have been a contributing factor on excessive load on mw-api and underlying databases
[13:27:15] <dcausse>	 eqiad or codfw?
[13:27:39] <gehel>	 I'm actually unsure
[13:28:07] <gehel>	 brouberol: might have understood more. Or ask in -sre
[13:28:31] <dcausse>	 rate of cirrus jobs remained unchanged for quite a while https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-kubernetes_namespace=cirrus-streaming-updater&var-app=All&var-destination=All&from=now-12h&to=now
[13:28:45] <brouberol>	 I stopped them in both DCs and restarted them in both DCs with the reduced amount of replicas
[13:29:44] <gehel>	 The root cause might be somewhere else, but SUP being a major consumer of mw-api, it shows there more than other places
[13:29:51] <dcausse>	 brouberol: do you know if the incident stopped when you stopped the pipeline?
[13:30:24] <brouberol>	 it seemed to have helped. Amir and e.ffie (in #-sre) might know more
[13:30:25] <dcausse>	 yes we run a lot of parses
[13:30:33] <gehel>	 there was a positive impact when SUP was stopped, but the discussion in -sre indicates that there might still be an issue. I don't think that things are super clear.
[13:34:36] <gmodena>	 dcausse prob you've already seen the patch - we reduced parallelism to both consumer-search and consumer-cloudelastic
[13:35:29] <dcausse>	 gmodena: yes but the rate of cirrus might stay the same, the parallelism increase helped mainly a bottleneck on the elastic sink
[13:37:30] <gmodena>	 dcausse I that would also remove pressure from mw apis (rendering?) ?
[13:38:18] <dcausse>	 it could if elastic becomes the bottleneck again, but the mw api query rate is controlled by something else
[13:39:01] <dcausse>	 reading backscrolls to understand
[13:50:52] <dcausse>	 seems like it was caused by a mw job categoryMembershipChange not the SUP
[13:58:24] <gmodena>	 dcausse Cirrus was the initial suspect: https://docs.google.com/document/d/1dO3TQVl7vSUV-4YPkEQhAwTUa4mZnDEypOTkTdEqZl0/edit?tab=t.0 
[13:58:43] <gmodena>	 shall we revert the SUP patch once the issue is resolved for good?
[14:05:41] <gehel>	 I think we should!
[14:07:37] <gmodena>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127013
[14:09:52] <ebernhardson>	 \o
[14:10:12] <gmodena>	 o/
[14:10:17] <gehel>	 dcausse: amir1 was asking if isolating the SUP (mostly the saneitizer) for commons and wikidata would be possible? 
[14:12:16] <dcausse>	 o/
[14:12:24] <dcausse>	 gehel: the sanitizer is already isolated
[14:16:02] <gehel>	 can we disable sanitizer for just commons / wikidata?
[14:17:05] <gehel>	 I'm realizing that I don't know enough about how the various components from SUP fit together...
[14:18:31] <dcausse>	 gehel: not sure we have an option to exclude wiki from the sanetizer looking, but that would not be ideal I guess?
[14:19:46] <gehel>	 I'm not sure this is something we should be acting one. But we know that commons / wikidata are larger and potentially more problematic in lots of ways. In case of emergency, being able to reduce load on some without shutting down everything seem appealing.
[14:22:32] <dcausse>	 not 100% sure but I suspect that the sanitizer is not the cause of many mw-api-int requests
[14:22:33] <gmodena>	 oof. conda-analytics, quarto & c really do not play well with sudo
[14:22:44] <ebernhardson>	 i'm not entirely sure how though...it could be paused in saneitizer which will save a tiny bit, but the rest is from the update streams
[14:22:52] <dcausse>	 yes
[14:22:53] <ebernhardson>	 saneitizer is reporting ~70/s right now, and our upstream req rate is 300
[14:23:15] <ebernhardson>	 it was 600 a couple hours ago
[14:23:25] <ebernhardson>	 oh, actually that was 15 minutes ago
[14:23:41] <dcausse>	 it was catching up after being paused I think
[14:23:48] <ebernhardson>	 ahh, that makes sense
[14:25:30] <ebernhardson>	 maybe some sort of heuristics that can detect "trivial" changes and not run a full render...but that sounds incredibly error prone :P
[14:26:14] <dcausse>	 yes...
[14:47:43] <dcausse>	 we could keep the paralelism to two 2 as well, seems like eqiad caught up properly, the slowness of the elastic sink might have been related to the slowdow of eqiad couple weeks ago...
[14:48:59] <dcausse>	 but I think 2 or 3 have a very minor incidence on mw-api-int query rate, this is mainly controlled by the capacity of the async http client which is set separately
[14:52:25] <gmodena>	 dcausse FWIW I looked at flink UI and there was _some_ backpressure on the fetch/rendering tasks, but you are right. I could not see much difference in query rates after the parallelism change
[14:58:09] * pfischer will be 5 min late for Wednesday meeting
[14:58:22] <dcausse>	 gmodena: indeed, I see that the sinks were the bottleneck still... so while backfilling if we went with 3 again we might have unleashed a bit more rps than the ~600 to mw-api-int perhaps
[14:59:37] <gmodena>	 dcausse that was my hunch (hence the scale down), but with not exactly high confidence :) 
[15:01:41] <dcausse>	 prior to the outage it was doing less than 400rps
[15:02:48] <gehel>	 did we already go back to 3 taskmanagers?
[15:03:11] <dcausse>	 we're perhaps close to a limit somewhere but since the system works ok when the sup runs at 600rps I'm really sure what it is
[15:03:17] <dcausse>	 gehel: no I don't think so
[15:10:15] <gehel>	 should we? (I think so!)
[15:12:49] <dcausse>	 yes I think we should and possibly explicitly reduce the async http pool if we want to reduce the max rps to mw-api-int-ro
[16:45:27] <gmodena>	 dcausse i'll be around tonight after dinner. I could deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127013 if you'd like. Or would you rather to do it together with async http pool changes?
[16:45:59] <dcausse>	 gmodena: if nobody objects please feel free to deploy
[16:46:19] <dcausse>	 we can do the async pool change after if we need to
[16:54:40] <gehel>	 dinner time! I'm going out with the kids tonight!
[17:01:56] <Trey314159>	 ebernhardson: https://arxiv.org/abs/2501.11301
[17:50:00] <dcausse>	 dinner
[19:30:24] <ebernhardson>	 read over santosh's paper, certainly interesting. I'm reasonably certain we could build a production system that does roughly that if we wanted to, although compute requirements and such are a complete unknown
[19:30:54] <ebernhardson>	 could maybe start with a narrower context, like the set of editing related essays and documentation, or the help docs on mediawiki.org, or whatnot
[19:46:18] <gmodena>	 looks like there are no obections to increasing SUP parallelism. I'll be bold and deploy the patch
[19:46:43] <ebernhardson>	 +1
[19:54:46] <gmodena>	 ebernhardson ack
[20:17:44] <gmodena>	 deployed both consumer-search and consumer-cloudelastic, we are back to three TMs
[20:39:04] <gmodena>	 ebernhardson I took a stab at adding a discolytics dep to mjolnir. Locally, tox is happy. Let's see in CI
[21:05:06] <ebernhardson>	 failed, and the log output is meh :S "Job's log exceeded limit of 4194304 bytes."
[21:07:32] <ebernhardson>	 separately i'm having no luck reproducing the errors where the cirrus-check-sanity api is failing due to not having a "proper page".  Can re-run the code from a prod shell with same inputs and see it generate redirectInIndex errors, but the pages returned are all proper
[21:08:47] <ebernhardson>	 it's also a bit odd because the page comes from the db...all pages from db should be proper
[21:23:43] <ebernhardson>	 oh...no i'm a dummy.  It's funny pages like https://en.wikipedia.org/wiki/User:Firefly/sandbox (which redirects to Special:Log
[21:25:33] <ebernhardson>	 same with https://pt.wikipedia.org/wiki/Usu%C3%A1rio:T1gr3