[07:08:17] inflatador: I found a small bug where doing a categories data-xfer with the --force flag enabled causes the `/srv/wdqs/data_loaded` flag to get wiped out. Fix here: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1198206 [10:04:52] lunch [12:48:27] with Peter out today, I propose we skip the retrospective. [12:49:07] +1 [13:18:50] o/ [13:36:09] \o [14:00:26] i was pondering how terrible idea it would be to stick the original html in an unindexed field...but the more i ponder the more i think it's a terrible idea :P [14:02:29] o/ [14:03:24] but i dunno...it's hard to say for sure. Tempted to run some analysis an estimate just how much the _source fields end up expanding [14:03:24] yes, would be mainly useful for others? [14:03:33] the idea is it would flow to the dumps [14:03:57] kinda extending on the idea to improve how `text` is stored, some users really want the html and do it themselves [14:04:29] one the one hand, returning the html to insert in the search engine would be fairly easy, we already have the html and return the metadata and plain text [14:04:52] save as unindexed field, let it flow through the dumps...seems like a potential easy win on a long-term request for html dumps...but i'm not entirely sure the implications [14:05:04] that will then have to be decompressed and deserialized for every search request that returns _source fields [14:06:48] i suppose the only reason it isn't absolutely terrible is because we already store bulk content of wikitext + plain text, so the increase is maybe 2x, and not 10x [14:07:48] but on balance...i dunno there's risk there and it's not really aligned with what we are "supposed" to do...so probably a dead ned [14:07:48] and this needs special case handling, you don't want the html output of wikidata for instance [14:08:08] yea it would have to be decided per-ContentHandler implementation. I would probably only adjust wikitext [14:08:21] although i dunno, does wikidata use a custom handler? [14:08:31] i thought they did to get custom rendering, but not 100% on how they implement [14:08:32] yes they should [14:09:49] but overall I'm a bit worried that we do this mainly for others but they may not use it [14:10:12] prefering future solution with html dumps and/or html fat-event into hdfs [14:10:13] yea thats a possibility to, although there has been requests for at least a decade now for html dumps [14:25:00] plausibly less risky would be use mapping to put html into a stored_field, and then exclude from _source, which would avoid the increased IO for normal requests that deserialize _source...but still weird [14:30:05] meh, would also be wierd for reindexing, probably not worth complexity again [16:03:50] hmm, according to wikitech we can have a directory that auth's via SSO and can require users to have NDA. In theory can publish relforge reports that way? [16:06:58] (on people.wikimedia.org) [16:11:26] interesting [16:12:14] not having any luck though, best i can do i make a directory always return 401 unauthorized :P [16:23:46] :) [16:26:56] dcausse: i was going to link you the relforge reports trey and i went over yesterday, i guess alternatively you can scp from `stat1008.eqiad.wmnet:~ebernhardson/relforge/commonswiki*`. Probably not super important to review, but will move forward with AB test of doubling commonswiki near match weight [16:27:47] (after comp suggest ab test) [16:32:42] ebernhardson: ah, thanks! will take a look [17:10:28] ouch should probably have transfered a tar, there are about 30k diff files :) [17:12:09] oh, yea i didn't htink of that. Even though it only links a couple of them, it renders every diff [17:13:20] restarted with tar much faster :) [17:13:56] done, will look at them tomorrow morning [17:13:58] heading out [17:14:04] \o [19:08:51] * ebernhardson is mildly amused that even the gerrit syntax highlighting is choking on the 4-byte characters