[08:06:19] o/ [08:09:59] o/ [08:28:56] dcausse: Would you mind, if I removed the unused airflow dependency (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1595/diffs - merge conflict) as part of (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1700)? [08:30:14] pfischer: no worries, please feel free to close my PR if you clean this on yours [08:58:54] dcausse: done; Somewhat unrelated: I looked into enforcer rules to prevent flink update hic ups but couldn’t find anything ready to use, so here’s an attempt to detect misaligned version expectations: https://gitlab.wikimedia.org/repos/maven/wmf-maven-tool-configs/-/merge_requests/16 [09:36:19] pfischer: thanks for looking into this! will take a closer look this afternoon [09:36:22] errand+lunch [10:49:45] Hi folks - there's ongoing discussion on the community wishlist around weighting for non-File pages on Commons ... you know the way that if Page_A is a redirect to Page_B then Page_B and Page_A will have the same data in the search index? Might there be a way to do similar if it’s not a “proper” redirect but a template (in particular https://commons.wikimedia.org/wiki/Template:Category_redirect)? [12:17:34] cormacparle: it's not that Page_A & Page_B have the same data it's that Page_A is not in the search index [12:17:39] cormacparle: Redirects are not represented as full search documents. [12:18:06] Page_A would become a link (property) of Page_B [12:19:02] doing the same for Template:Category_redirect seems unlikely as it would mean parsing the page way too early and way too often [12:21:25] cormacparle: relatedly, do you remember why you made the decision to include everything but NS_FILE (including all Talk namespaces) in "Categories and Pages" in MediaSearch? [12:22:06] I would consider narrowing a bit the set of namespaces there (at least excluding talk pages by default) [12:53:19] > Page_A would become a link (property) of Page_B [12:53:20] Aha ok [12:53:32] > do you remember why you made the decision to include everything but NS_FILE (including all Talk namespaces) in "Categories and Pages" in MediaSearch? [12:53:37] no I don't remember tbh [12:54:21] probably we just included everything that wasn't covered by the other tabs - it's just the "miscellaneous" tab [12:56:07] narrowing the namespaces seems reasonable, but right now there's no team responsible for Commons and if we were to introduce exclusion-by-default we'd need a UI element to re-include talk pages, and there's nobody to implement that [12:57:21] > doing the same for Template:Category_redirect seems unlikely as it would mean parsing the page way too early and way too often [12:57:28] dcausse: could you explain that to me a bit more? [12:57:34] cormacparle: I see a namespace filter on the UI [12:58:23] oh! you're right! [12:58:29] I had forgotten all about that! [12:58:36] ok cool, I can definitely look into that [12:58:48] cormacparle: when Page_A is a redirect to Page_B we simply know this from a table in mariadb, for soft redirect we might have to parse the page I think [12:59:28] hmmm ok [13:56:32] dcausse: did we include Gabriele in the Flink operations tomorrow? [13:56:51] no :/ [13:58:10] he's subscribed to T404605, will add a comment there about what I plan to do tomorrow morning [13:58:11] T404605: Plan flink-app recovery process (upcoming wikikube eqiad upgrade) - https://phabricator.wikimedia.org/T404605 [14:08:04] .o/ [14:13:03] \o [14:16:39] dcausse: I've pinged him on Slack, so at least he knows :( [14:16:47] sure [14:16:49] o/ [14:31:07] o/ [14:49:33] dcausse: Just saw https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1191306 - Does that imply that the elimination of ORES from the tag names worked, since the old query no longer matched anything? [14:50:33] pfischer: yes it should have been handled by the last cluster reindex which have copied ores tags to their new names [14:51:10] dcausse: great, thank you! [15:00:19] probably totally unnecessary, but can also verify that kind of thing with a query in superset sqllab [15:02:13] yes... I wanted to do that but got distracted :/ [15:04:23] perhaps randomly curious, mined hard negatives by result position in the relaxed query, matches about what i would expect: https://phabricator.wikimedia.org/P83487 [15:04:43] not really going to keep 20, just started with 20 to get decent numbers [15:05:22] from these maybe top 5 will be plausible, next up is collect some feature vectors for all of these and work out unioning the datasets and running the normal pipeline [15:07:54] ebernhardson: not sure I understand what is "result position" here? [15:09:07] result position of what we presented to users? [15:10:16] dcausse: i ran the normal mjolnir es_hits query (very similar to prod bag_of_words) but with the relaxed filter options, then anti-joined back to the set of queries we know to get queries that are only returned in the relaxed query, this is the position those results were found in the relaxed query [15:10:40] makes sense at position 1 we only get 33k, because most of those were found in the existing dataset, but as we go down the list we get more and more new results [15:11:20] err, not just queries but (wikiid, query, page_id) tuples [15:14:22] not sure how to know how many hard negatives we need...might need to experiment with a few runs. The source data for enwiki is something like 1M queries and 27M hits [15:16:11] when you say anti-join, this is vs the hits from backend logs? [15:16:54] i'm starting from the collected feature vectors table, to ensure i'm only collecting queries we will be able to union into the result dataset(enwiki is downsampled, so not all queries in the raw dataset are needed) [15:17:30] so basically we collect (wiki, query, page_id) from the feature vectors table, run them all through the relaxed query, than anti-join that initial table to filter any page_id's we already know [15:18:08] (and extra fun, yesterday i forgot to .dropDuplicates() on that first step so had 20+ copies of every query...it took a lot longer that way :P) [15:20:19] you grouped by (wiki,query) before shipping the relaxed query? [15:20:43] basically, its .select('wikiid', 'query').dropDuplicates(), and then pass into mjolnir.es_hits.transform_from_elasticsearch [15:20:52] ack [15:21:06] (but the mjolnir code wasn't flexible enough, so i've copied several mjolnir modules directly into the notebook for now) [15:21:21] will work that out though once we know what we are doing [15:22:11] ok so a hit of the say 33k is one hit that's at pos 1 from the relaxed query but seen anywhere else for that query in the feature vector [15:22:43] yea, thats a brand new hit that we don't have in the feature vectors dataset that we can then assign a low grade and collect feature vectors for [15:23:03] i'm also intending to review the feature set, i'm not sure if we have any good signals for "full match", but maybe we just happen to have them in the giant pile of random stats we collect [15:23:31] i guess what i mean is it needs a feature that says "100% title match" vs "60% title match", and not just a summed number. Maybe a title match that is always 0 unless it matches everything [15:24:31] true, it has unique token count and the number of matches iirc but might be easier if it has a single feature tracking this? [15:24:58] i think it has to because it doesn't have the ability to compare two features, a split has to be feat_x > value_x, it can't be feat_x > feat_y [15:25:07] although we culd add a derived feature i guess [15:26:27] but I'm not sure that the table you show would be what I would have expected [15:26:36] how so? [15:27:02] means to me that we often agree with the relaxed query [15:27:35] hmm, yea there were just under 1M source queries, so that is significant overlap if the last result is 650k matches [15:27:47] the explanation I could see is that we have a lot more elements at higher positions because of recall [15:28:13] i suppose one thing this is not doing is running the ltr that we use in prod, maybe it should? [15:28:28] that would give it the things the model actually things are good, instead of what the basic field weighting does [15:28:33] and the featurevector contains a lot more pos1-pos5 because recall is worse [15:29:53] if you exported the same group by position on the feature vector (not sure you retain the original position tho) you might get the inverse of this? [15:30:14] hmm, actually i might have mucked up the adjustment to the query filter, i'm running it again to see (will take 10-15min), but that might have been the standard retrieval query [15:31:22] sec will see what the inverse looks like [15:31:52] meh, i can't just change 'left_anti' to 'right_anti', have to flip the whole thing :P [15:34:35] hm, actually we dont have hit positions in the feature vectors table anymore [15:34:51] so hard to say where they would have been [15:38:13] np! [15:39:49] I would not be too worried to run exactly the same thing we run in prod for mining negatives [15:40:21] and might make more sense to mine what the model will actually see in the rescore window as opposed to what users might see after the model is applied [15:40:47] re-ran with what should be the proper retrieval query (now with minimum_should_match correctly applied), but it looks different but very correlated: https://phabricator.wikimedia.org/P83492 [15:46:36] the likelyhood for a top-1 relaxed result to be in the original dataset is higher, but somehow I would not have expected the position to matter that much [15:55:33] it seems plausible to me, matches with more terms will have higher numbers, i think part of the problem with prod ltr is that xgboost is non-linear, so more terms matching doesn't mean higher scores [16:10:37] workout, back in ~40 [16:25:56] :S jupyter on stat1008 just disapeared, can't connect to the http port [16:31:18] at least it came back on its own eventually [16:43:18] * ebernhardson sighs at conda and creates yet another new env...last one lasted almost 3 months [16:59:48] back [17:13:54] I guess we no longer have ores weighted tags but we should not have these __DELETE_GROUPING__ in there (https://phabricator.wikimedia.org/P83504) :/ [17:14:29] :S [17:14:48] not sure what happened... could be due to early deployment of this feature? [17:15:16] i hope so? Otherwise i also have no clue, that shouldn't be possible [17:15:42] we could resend "classification.ores.articletopic/__DELETE_GROUPING__" on those pages this should clean this up [17:15:58] will check tomorrow if I see this on other (newer) weighted tags [17:16:04] dinner [17:16:11] sounds good [17:16:57] i'm just fighting conda here....it really doesn't like trying to inject mjolnir into the conda-analytics env...we don't agree on versions and then conda-pack refuses to package the env because the mjolnir install changed versions of things :( [17:17:15] maybe i'll just give in and copy the rest of mjolnir into the notebook... [17:41:24] finally got it working, but stupid hacks. Installed mjolnir with --no-deps and then manually installed a few things with `conda install ...` [17:55:04] hmm, msearch daemon is being way too coy with eqiad cluster...it's assuming eqiad has traffic and is only using ~50 threads, when it could be using 600 [17:55:31] but i dont think we put anything in the msearch daemon that knows a cluster is idle, since that only happens twice a year for a week at a time [18:02:55] lunch, back in ~40 [18:04:11] hmm, actualy reading the msearch daemon...i guess i completely removed the load monitor and we always have it work at a slow pace [18:04:42] i guess i just forget too much, wrote a 2022 patch: msearch_daemon: Remove cluster selection/load monitor [18:12:54] paused puppet on search-loader1002 and changed the msearch daemon args. Will try and remember to put it back [18:50:19] back [19:38:00] re-enabled puppet, ran the agent, it looks to have put everything back how it was (i think). [19:38:37] one random curiousity, the puppet agent refreshed the daemon service after seeing it had to update the systemd files, but it didn't actually restart the @0 and @1 instances [19:45:20] interesting. I wonder if that is due to a home-grown systemd module? Random guess [19:48:49] i'm thinking it somehow didn't propagate from the parent to the @n instances, but not sure why [19:48:59] (but i also haven't read the relevant puppet code) [19:54:43] unrelated, but I'm getting a Puppet failure on a newly-reimaged wdqs-scholarly host. Looks like scap is unhappy...I thought we fixed this already ;( [19:55:21] ryankemper ^^ do you remember if we need a scap deploy on new wdqs hosts? [19:59:07] `scap deploy -l wdqs2016.codfw.wmnet` throws the error `No targets selected, check limits and dsh_targets`. `scap.cfg` says `dsh_targets: wdqs`. I vaguely remember this needs to be set in Puppet somewhere [20:04:55] ah yes, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/scap/dsh.yaml [20:08:15] inflatador_: yup gotta manually add them [20:13:38] sigh....installing `discolytics` can now sometimes install some random dicsord thing :S [20:13:48] inflatador: not around comp but you have my verbal +1 on the patch [20:14:19] ryankemper THX, will merge if/when puppet is happy [21:06:42] ryankemper we're in pairing if you wanna join [21:07:17] inflatador: ack still in bed for the timebeing, won’t make it [21:08:20] ryankemper ACK, get well soon! [21:16:29] inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1169210