[13:07:54] I think we are ready to reenable EQIAD search endpoints. Probably do that tomorrow, but LMK if y'all have concerns [13:30:46] \o [14:28:59] inflatador: seems reasonable, we will need https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1154828 to turn the traffic back on [14:29:37] also we are still getting the title suggest index too old alerts, but at least in puppet it looks like it should be running for both DC's. will have to poke at it [14:31:56] curious, looking at kubectl can see three of the pods were OOMKilled [14:44:59] I'm asking because we had some weird issues with jobManager pods being killed on a different flink application, ref https://wikimedia.slack.com/archives/C055QGPTC69/p1749072189569349?thread_ts=1748973281.430939&cid=C055QGPTC69 [14:46:19] i imagine we just need to increase a number somewhere, but i haven't yet deciphered how exactly this deploys via deployment-charts [14:52:24] requests are set to a single GB of memory, but still not sure where that is set... [14:58:45] I posted something on T388538, optimistic clement has good ideas about how to set the limits [14:58:46] T388538: Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538 [17:30:32] yea numbers from more days, with CI...not better :S significant change from 33.9 to 35.6% of search queries shown a DYM, but clickthrough rates were (24.9,25.4) on one, and (24.8, 25.2) on the other [17:31:10] feel like i need a different metric though...still not happy with the chosen set [17:46:20] ryankemper just a heads-up that we need to clean up the CODFW master config or we're gonna have quorum issues like we did last week in EQIAD, ref. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/codfw/elasticsearch/cirrus.yaml#11 . Let's talk about it at pairing today if that works [17:56:56] sounds good [17:56:58] dog walk [18:18:07] thanks [18:19:08] ryankemper https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138400 is up for review if you have time. Mostly it's just gonna be verifying host membership in host hiera vs cirrussearch hiera, more details in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138400/comments/1d76699e_18e7d75a [19:13:01] Will take an indepth look later [19:27:04] cool, less important is the relforge deployment-charts patch we had to rollback, ref https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1147893/3/helmfile.d/services/cirrus-streaming-updater/values-consumer-search-staging.yaml . So it looks like we will have to define the new relforge host IPs like we did for the old ones, OR... [19:29:53] ...define in external services a la https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/external-services/.fixtures/services.yaml [19:31:17] and to digress even further, I'm wondering why `flink-codfw` appears to work in this values file: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/rdf-streaming-updater/values-codfw.yaml#21 , yet `flink-codfw` doesn't appear to be defined in the above services file [19:31:40] probably missing something...anyway, I'm headed out to pick up my son from camp, back in ~1h [20:34:34] baCK [21:17:34] https://phabricator.wikimedia.org/T394432