[09:59:08] <dcausse> lunch [12:46:06] <ebernhardson> \o [12:58:13] <dcausse> o/ [13:18:10] <inflatador> <o/ [13:19:11] <dcausse> o/ [13:19:18] <ebernhardson> dcausse: random fun thing i learned...there are 490M unique statement_keyword values in wikidata [13:19:30] <ebernhardson> was considering if the 5M commons urls really made a difference or not :P [13:19:53] <ebernhardson> if anything, i suppose i'm surprised nothing broke [13:22:01] <dcausse> ebernhardson: yes... I'm not sure that these 490M unique tokens are particularly useful so why I'm not a big fan about adding another set of 5m entries [13:22:21] <dcausse> not super convinced that these are useful for search [13:22:36] <ebernhardson> yea, i suspect very few have ever been searched for [13:23:16] <dcausse> what I see quite often is something like haswbstatement:P31=Q123 [13:23:22] <dcausse> which makes sense [13:24:23] <ebernhardson> yea, that seems like the primary use case [13:27:34] <inflatador> welcome back dcausse ! [13:28:34] <dcausse> thx! [13:31:42] <dcausse> when looking at https://phabricator.wikimedia.org/P67741 I'm not sure that this makes sense to keep them as-is [13:32:32] <dcausse> one thing is that if you allow a property to get indexed it gets copied to the all field, which I think makes sense for some the textual entries there [13:34:05] <ebernhardson> hmm, i hadn't considered the all field [13:34:10] <dcausse> but not convinced that keeping the exact tokens for haswbstatement [13:34:17] <dcausse> is worthwhile [13:38:53] <ebernhardson> not sure how we pull back on those either though, i'm going to guess those mostly amount to the 'string' and 'external-id' types [13:39:36] <ebernhardson> i vaguely remember the external-id, it was solving a problem where you couldn't search for a thing by a known id [13:43:18] <dcausse> yes, we'd have to dig into old phab tickets, but I vaguely remember that this seemed useful where you copy/paste a random id in the search bar [13:44:06] <dcausse> we could scan query logs to see how much time haswbstatement is used with something that does not look like a Qid [13:44:25] <ebernhardson> yea, might be worthwhile just to understand how things are being used [15:13:44] <ebernhardson> i wonder if we need to shuffle shards around on cloudelastic, it keeps alerting for high gc. Or maybe just roll a restart through it [15:14:17] <inflatador> ebernhardson yeah, I've been restarting the smaller ones here and there, doesn't help much. We probably need to detune that alert a bit [15:14:45] <ebernhardson> inflatador: hmm, better would be to fix the gc problems :P We could potentially give it more memory [15:16:54] <inflatador> ebernhardson ACK ;) , if you wanna make a puppet patch to give more mem LMK, otherwise I can look once I get out of mtg [15:17:03] <ebernhardson> or maybe it is being a bit quick, the dashboars claim we only peak at ~10 gc's/hr [15:20:02] <ebernhardson> oh, actually it is being bad. 1006 is doing 400/hr in the old pool. A well behaving instance does <1 [15:24:29] <inflatador> Ò_Ó [15:25:16] <dcausse> :/ [15:26:05] <ebernhardson> it's how it works when it runs out of memory, it keeps running the gc to try and free some, but doesn't get any back. Looking over some stats, 1006 is more frequent, it does have much more disk used (probably means larger indices) than other instances...but balancing is tedious :P Probably give it another 2G of memory (10G->12G) [15:30:28] <ebernhardson> oh i suppose disk isn't a good proxy, since these are the smaller clusters but all 3 clusters run on each machine. Anyways puppet patch up to increase memory 2g (turns out its 12g->14g) [15:30:40] <inflatador> looks like 1005/1006 are older hosts FWiW [15:31:20] <inflatador> same amount of RAM tho [15:31:35] <ebernhardson> i wouldn't expect being an older host to effect GC though, it might be able to run the GC faster but it would get the same results. 400 old gc/hr vs <1 old gc/hr is memory pressure in the heap itself [15:31:51] <ebernhardson> something taking up more memory, could potentially dig into it but historically that takes a long time [15:33:10] <inflatador> ebernhardson no worries, will add my +1 shortly [15:37:47] <inflatador> OK, change merged/puppet-merged/applied in puppet. Will roll restart cloudelastic shortly [16:00:21] <inflatador> gehel dcausse just depooled CODFW completely from `wdqs-main`. Is it OK for me to start a data transfer from 2022->2021 now? [16:00:29] <inflatador> ref T373791 [16:00:30] <stashbot> T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791 [16:01:20] <dcausse> inflatador: yes I think so [16:02:46] <inflatador> dcausse ACK, will get that started soon [16:03:20] <inflatador> ebernhardson I'm restarting cloudelastic to apply the new cluster settings now [16:05:25] <ebernhardson> inflatador: thanks! [16:32:16] <inflatador> wdqs-main outage just p-aged all SREs ;( fixing that now [16:32:29] <dcausse> oops [16:33:27] <dcausse> heading out for dinner, back later tonigh [16:35:07] <inflatador> so I'm turning off p-aging for wdqs-main/wdqs-scholarly. Should I turn it off for old wdqs services as well? [16:40:43] <gehel> We definitely don't want anyone to be woken up by WDQS going down. Our SLO is low enough [16:40:59] <inflatador> gehel ACK, will update https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070301 [16:47:11] <inflatador> OK, that's merged...should no longer p-age for any wdqs endpoints [16:57:12] <ebernhardson> what an odd error...my local wikibasecirrussearch tests fail because it expected "Arabic" and "Hebrew", but my test rendered "العربية" and "עברית" [16:57:22] <ebernhardson> must be some config flag somewhere... [17:00:22] <inflatador> cloudelastic heap settings are applied [17:00:39] <ebernhardson> inflatador: excellent! will probably take a few days to find out if it fixed anything [17:00:47] <ebernhardson> usually the heap takes some time to fill up [17:03:08] <inflatador> dcausse the wdqs-main data xfer is done, LMK if it looks OK, we can repool CODFW at that point [17:46:35] <inflatador> should we be running categories on the graph split hosts now? [17:47:30] <ebernhardson> inflatador: hmm, i suppose at some point? I could be mistaken but i think cirrus is the only consumer of the categories [17:48:38] <inflatador> just wondering as I believe we disabled categories when we started doing the tests [17:49:47] <inflatador> I see a `categories.jnl` on `wdqs2021`...hmm [17:53:11] <inflatador> ref https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DCategories%20update%20lag [17:53:15] <inflatador> heading to lunch, back in ~40 [18:16:30] <inflatador> back [20:31:12] <inflatador> re: pairing session, systemd docs on user resource control: https://www.freedesktop.org/software/systemd/man/latest/user@.service.html [23:24:38] <dcausse> inflatador: thanks! there's still something wrong with it... it's consuming from the wrong topic so better to keep it depooled for now [23:24:44] <dcausse> will take a closer look tomorrow [23:38:01] <inflatador> dcausse np [23:41:30] <inflatador> we can reimage tomorrow if you like