[07:53:32] hare: no one that I'm aware of sadly [08:58:15] Trey314159 gehel I rephrased T385971. Hopefully, now it's less ambiguous. [08:58:15] T385971: [SPIKE] Can the results of abandoned query analysis be used to improve MLR performance? - https://phabricator.wikimedia.org/T385971 [10:12:29] gmodena: T360536 might be a good last task! Description is minimal, we'll need to chat to give you more context. [10:12:29] T360536: Increase retention of training data - https://phabricator.wikimedia.org/T360536 [10:45:07] errand+lunch [10:58:49] gehel ack! [11:20:44] o/ [11:42:16] o/ [13:09:03] errand+lunch [14:50:07] o/ [15:00:54] \p [15:00:56] \o even [15:01:01] o/ [15:08:05] o/ [15:10:45] Just added the 3 nodes I borrowed for relforge back into production eqiad (elastic1108-1110) [15:11:05] \o/ [15:12:04] now we gotta figure out why things fell over and how that affects our capacity planning ;( [15:20:10] one thing we've pondered in the past, spin up a cluster in k8s and move completion suggester there. The theory is that elasticsearch uses https://en.wikipedia.org/wiki/Little's_law to decide if a node is falling behind and it should route queries elsewhere, but that law expects all items to have the same latency [15:20:28] the problem is, how to prove that separating the 5ms and 500ms requests will improve things :P [15:20:47] completion suggester would be reasonable there because the indices are created mostly new each day, and don't receive updates [15:21:03] well, mostly don't receive updates [15:21:52] i suppose the wider context is it really seems like if one node has a full queue the cluster should be routing queries for those shards to other hosts that do not have a full queue, and then the queue should drain. but it does not [15:22:23] ebernhardson gehel I was just taking a look at T360536. It points to recommendations made in T235858, but I can't find them. Have those recommendations been shared privately? [15:22:23] T360536: Increase retention of training data - https://phabricator.wikimedia.org/T360536 [15:22:24] T235858: Increase of training data retention (>90 days) is validated with Legal / Privacy - https://phabricator.wikimedia.org/T235858 [15:23:09] gmodena: hmm, i will have to poke around but i think it's somewhere in the google docs suite [15:23:20] legal doesn't really work in phab i guess :) [15:23:40] ebernhardson ack. I'll dig a bit more in google drive [15:24:54] https://docs.google.com/spreadsheets/d/1bbhcfKJdop5G_qZn-G9hHDgMt9RDRj6wJmnWYU0mxJ0/edit?gid=32059142#gid=32059142 was the initial request, but that doesn't have the summary [15:26:36] ebernhardson thanks! [15:27:45] gmodena: forwarded you a pair of email threads that i think has everything else [15:29:18] Amusingly, I searched for "privacy query clicks" in google docs and nothing showed up. However, since I opened the link, it does appear in the search results. :) [15:31:56] ebernhardson ah, that's an interesting take, maybe we can toss that around at pairing this afternoon [15:33:57] re completion only elastic cluster, inflatador could you send the email received from Willy to Erik asking if we we could revisit our hardware requirements (doing more with fewer servers) [15:41:09] gmodena: T385971 looks good! [15:41:09] T385971: [SPIKE] Can the results of abandoned query analysis be used to improve MLR performance? - https://phabricator.wikimedia.org/T385971 [15:41:11] Thanks! [15:41:42] Trey314159 terrific! Thanks for checking [15:44:10] inflatador ACK, will do [15:55:11] OK, just fwded the email [15:58:02] thanks! [16:03:04] at a high level, my assumption is we should be able to scale more vertically. IMO the main constraint is enough memory to keep the hot parts of the index in memory, with enough CPU to service all requests as the second constraint. I know you can get single sticks of 128G memory these days, and servers often have 8 slots or more. Ought to be doable [16:03:28] ryankemper we're in SRE standup if you wanna join [16:04:22] i also do wonder sometimes about the memory needs if the disks were fast enough, i'm not sure about the DC space but in consumer drives 5GB/sec is now common [16:04:45] well, common at the high end at least [16:06:54] I added kernel PSI metrics to the node exporter dashboard, should be helpful to figure out resource limitations https://grafana.wikimedia.org/goto/LaFwP-pHR?orgId=1 [16:07:10] https://www.kernel.org/doc/html/latest/accounting/psi.html#psi [16:07:32] I think I worry about cpu (esp. if we have to use single sockets) but if completion moves away more mem could allow to scale vertically very well [16:08:21] dcausse: it's now possibly to buy up to 192 cores per socket though, although we probably aren't quite spending that kind of money :) [16:09:21] and thats cores, it's 384 threads. although thats AMD epyc, i suppose we've traditionally run intel everywhere [16:09:38] that'd be nice :) [16:09:49] i suppose i dont know how single vs dual core effects memory bandwidth, [16:10:20] s/dual core/dual socket/ [16:12:12] i'm not sure how to prove out that those things would work though, theory is all good but once we are on the path it's going to be hard to change [16:12:19] true [16:13:13] also elastic doesn't like mixed clusters, so if only 25% of the cluster is replaced each year we might see problems up front [16:13:56] yes, transition to a new hardware layout might be costly, esp. if we have to specialize more per use-case [16:14:33] dumbest thing possible: run 2 instances of elasticsearch :P [16:15:04] :) [16:15:13] it actually works really welll [16:15:33] i'm not sure the current state, but years ago the suggestion for people that were running logging on big boxes was to run many copies of elastic, instead of trying to make a heap that would effectively use the 1tb of memory [16:15:51] in mtg but we are talking about opportunities for larger VMs or k8s [17:13:15] inflatador: around to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124481 ? [17:16:48] gmodena: looking at the A/B test reports again I found some weird stuff.. Seems that sessions should be at most 10^4.5 seconds long (50 searches x 10 minutes == 500 minutes == 30000 seconds), but some are 10^6 (~11 days) or 10^8 (3.2 years) and some are 10^9 (~32 years—or maybe 10^9.2, which is how long ago Unix epoch time started). Something weird is going on, and maybe we should filter these outliers. (Other recent changes [17:16:48] look great!) [17:19:33] I changed some defaults in the SUP and would like deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124483 so that they are explicit and won't cause any problem if someone deploys a new version of it [17:20:31] looking [17:21:48] dcausse: seems reasonable [17:22:18] consumer-cloudelastic is currently running a new release of sup, i stuck some extra logging in there, but sadly it didn't give enough info to know whats going on. [17:27:40] ebernhardson: thanks [17:28:47] I quickly skimmed through flink code regarding that source state, did not find anything except perhaps a suspicious if != null on the byte array containing the split state that might prevent it from starting [17:29:14] gehel: inflatador: (wdqs legacy full) alright the DNS and TLS certs seem to be working, requests can successfully be made to `https://query-legacy-full.wikidata.org/sparql`. The UI at `https://query-legacy-full.wikidata.org/` doesn't seem to be working yet, will have to do some further investigation [17:29:18] wondering if it's because it runs bounded when inactive and that state become empty [17:29:27] actually might turn on some trace logging for SourceReaderBase, to try and understand whats going on. But it looks like the logging config is provided by flink-kubernetes-operator and not the app [17:29:39] :/ [17:30:10] hmm, if it ran bounded that is a misconfiguration. The intent was for bounded to only ever be used in the unit tests [17:30:12] dcausse: I'm about to take the dog out, you don't need an sre to deploy that updated chart right? or do I need to deploy that [17:30:26] i kinda would like to remove that option from a production build, but not sure how [17:30:34] ryankemper: no worry I can deploy it thanks! [17:30:37] ack [17:30:43] oh, actually it has to exist for backfills [17:30:47] hmm [17:31:22] oh I thought that when it's inactive it's running bounded I probably misread [17:32:48] the intent is that when it's paused everything still connects to the graph and starts up normally, but inside the individual SplitReader the SaneitizeLoop gets replaced with a NoopLoop. So basically all the flink machinery is unchanged, but nothing happens [17:33:37] it's actually curious how the state went away...that wasn't supposed to happen which implies i did something wrong putting this together [17:33:58] ebernhardson: but looking at org.wikimedia.discovery.cirrus.updater.consumer.graph.ConsumerGraphFactory#endSaneitizerAt seems like it runs bounded? [17:34:25] i spent a little time yesterday seeing if i could inspect the historical states, but the summary i got from reading things is there is no tooling and you will have to dual with rocksdb directly [17:34:33] s/dual/deal/ [17:35:44] ebernhardson: yes reading the rockdsdb state is kind of painful... esp for state/source operators... I barely remember writing painful to load it and inspect it with the debugger... [17:36:25] dcausse: as long as "should saneitize" is set it should return null which gives the bounded state. There is another property `paused` on SanitySource that contrls the noop loop bits. Maybe thats not wired together right, looking [17:36:25] ryankemper nice! Sorry I was AFK [17:36:33] s/bounded/unbounded/ [17:37:06] hmm, indeed that looks suspiciously wrong :S [17:37:16] ebernhardson: yes but since I disabled it it might have run bounded with Instant endAt = config.clock().get(); and marked the stream as ended? [17:37:40] inflatador: nw! I talked to j.elto sounds like we need to deploy the admin_ng part of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1122678/5/helmfile.d/admin_ng/values/main.yaml like so: https://wikitech.wikimedia.org/wiki/Kubernetes/Remove_a_service#Deploy_changes_to_helmfile.d/admin_ng [17:37:49] dcausse: maybe. One thing i have been wondering is if somehow the state thinks the existing split has been completed and doesn't need to start [17:37:51] and by "we" I mean service-ops actually :P [17:37:53] I suspect this might have corrupted the state? [17:38:16] dcausse: could possibly get around that by using a randomized split id instead of the constant "all wikis". afair the split id is arbitrary [17:38:24] ebernhardson: that'd be my theory but haven't a proof of that reading the code [17:39:26] ryankemper ACK, we can look at this at pairing today. I've done it before, should be pretty pain-free [17:39:40] I wonder if "log.info("Saneitizer is disabled and bounded to {}", endAt);" is actually necessary when disabled? [17:40:38] i'll try and write some tests around these options, it certainly looks like it's not quite pausing in the way I intended it to. Might try and rethink how that is configured since the intent is for it to not be possible to remove from graph or be bonuded in production, except during backfills. Maybe backfills avoid adding it or some such since graph changes aren't important there [17:41:06] sure [17:41:35] for disabled and bounded, i think that was explicitly about backfill making sure it shut down on startup because without that the backfill never finished [17:41:58] oh ok [17:42:38] for turning it back on ... i suppose instead of a random split id could just put a different constant in there and see if it starts up. Would confirm that the problem is state remembers that split id is complete [17:43:06] yes [18:00:55] * ebernhardson is realizing part of the problem of writing tests is that it's not clear how to stop the application when running unbounded in a test :P [18:03:20] indeed, we often use bounded streams to make tests easier :/ [18:04:04] trying to figure out now if maybe we can inspect the graph and determine if it's configured as expected without actually starting [18:04:05] there are a couple test harness but unsure that's helpful in your case... [18:04:24] but probably lots of hackery there too [18:06:17] there's a SourceOperatorTestHarness but no clue if that helps [18:06:42] partially, but i suppose i was also hoping to test ConsumerGraphFactory sets it up properly [18:06:52] using the external params [18:07:19] hm.. that becomes harder indeed [18:07:41] heading out, back later tonight [18:10:08] .o/ [18:35:11] reading the helm config and this...it certainly seems like i was thinking of two different things when writing the SUP config bits and the helmfile bits :( It seems a relateively easy fix is to completely ignore config.shouldSaneitize() in endSaneitizerAt(), reduce the surprises. Still not clear on a test plan :P [18:36:44] in helmfile for backfill we explicitly set saneitize-max-runtime, there should be no need to try and guess boundedness based on the saneitize boolean [18:37:00] Is everyone OK with leaving Relforge in a mixed state, or would y'all prefer that I put 1003 on opensearch as well? The reason I ask is that the hosts are so old, we have to reimage manually and we're due to replace them very soon anyway [18:37:05] ref T380752 [18:37:05] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [18:37:34] inflatador: imo it can stay mixed [18:39:41] ebernhardson ACK, I'll wait for d-causse and ryankemper to weigh in before closing, but sounds like a plan [19:06:02] lunch, back in ~40 [19:25:07] I think i have a fix...what I don't have is any test the proves it :S [19:32:17] yup mixed is fine. maybe arguably better [19:49:23] back [20:55:15] realizing while attempting to write tests that allow the bounded SanitySource to complete, then run it a second time unbounded, that the long-poll nature of it might be delaying checkpoints [20:55:33] at least, i'm having trouble convinving it to checkpoint :P [21:05:24] oh, actually no, the problem is it doesn't take a checkpoint after the source completes, perhaps because the graph is so simple it shuts down prior to the next checkpoint barrier [21:30:15] having trouble figuring out https://puppet-compiler.wmflabs.org/output/1124501/3152/cloudelastic1007.eqiad.wmnet/change.cloudelastic1007.eqiad.wmnet.err from CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124501/7 . Seems like it should be collecting the data https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/opensearch/cirrus/server.pp#50 , but something's missing [21:33:16] * inflatador yearns for ansible's debug module [21:34:06] looking [21:37:28] ah, I guess it's cloudelastic specific since it needs an ro port [21:37:36] I must've missed that in the hiera somewhere? [21:37:58] inflatador: from what i see so far, indeed line 204 doesn't provide an upstream_port. The plain version above does, but not -ro [21:38:22] i was trying to poke through the prior elastic bits to verify how it used to work though [21:38:53] inflatador: oh, the old proxy_params had upsteam_port in it, the new one does not [21:39:18] in profile/manifests/elasticsearch/cirrus.pp:84 proxy_params adds in the upstream/tls/http2 parameters [21:39:45] in profile/manifests/opensearch/cirrus/server.pp nothing gets merged into proxy_params [21:42:51] ACK, looks like I can add it to the server.pp...giving it a shot [21:43:03] my best guess would be to mimic the old cirrus/server.pp, change the existing `$proxy_params = $ssl_provider ? ...` into `$proxy_cert_params = $ssl_provider ? ...`, and then use `$proxy_params = merge($proxy_cert_params, { ... })` to add the appropriate args in [21:48:45] ebernhardson Good call! It seems PCC is happy now https://puppet-compiler.wmflabs.org/output/1124501/3154/ [21:50:03] \o/ [22:37:33] sigh, changing the split id didn't change anything :( [23:08:13] hmm, perhaps a bit scary. Apparently the suggested method to change log levels for an app is to run, roughly `kubectl edit cm flink-config-flink-app-consumer-cloudelastic` and edit the deployed log4j config directly [23:08:24] but not clear how to make that exist during bootup :S [23:09:16] this only checks for the log4j-console.properties to exist every 300s, so not like editing after deploy will be particularly useful