[13:39:54] Hi, I'm wondering why my SQL query here does not keep out bots from the result list: https://dpaste.de/Esmt [13:40:14] Ideas? [13:43:13] Niharika: simple, not all bots have a bot flag. https://meta.wikimedia.org/wiki/Special:CentralAuth/FuzzyBot [13:44:39] Nemo_bis: Ah. Any suggestions on improving this https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Editors_eligible_for_Autopatrol_privilege ? [13:46:07] No, I have no idea what en.wiki thinks about eligibility to autopatrol. [13:46:55] Eligibility is simple enough. I could filter through names with 'bot' in it... [14:35:26] o/ [14:37:24] halfak, hypothesis? [14:38:11] If we're seeing the first two activity peaks in pageviews post-search that we see with pageviews overall, and a similar initial breakpoint between those curves, we can just work under the belief that post-search behaviour closely replicates non-post-search behaviour and use the 30/60 minute timeouts [14:38:18] (just spitballing here) [14:41:49] Ironholds, I think so. I think that even the 30 minute timeout is due to overfitting the data. [14:41:59] I'm not sure there's a good way to verify, but it makes sense to me. [14:42:22] halfak, I mean, we can quasi-verify in the sense that if we find a disconnect between the outcome of implicit and explicit measures one of the hypotheses for why is "our metric was wack" [14:42:43] (obviously the other hypothesis is "this is just really not a good way of quantifying pageviews") [14:42:49] *quantifying satisfaction [14:43:52] Indeed. We could measure satisfaction more directly too. [14:43:58] da! [14:44:02] Any chance you guys can get on Abbey's schedule? [14:44:07] that's the plan! [14:44:11] cool. [14:44:14] and I was thinking that actually makes detecting a breakpoint moot [14:44:28] insofar as; we do explicit and implicit testing of the same population, using a survey and an EL schema respectively [14:44:29] I'd like to correlate the behavior with qualitative evidence. [14:44:37] yeah [14:44:38] and we just see what threshold matches most strongly the user answers [14:44:39] That [14:44:45] and if that match is any good ; [14:44:46] Well.. I'm not sure that [14:44:46] *;p [14:45:02] halfak, sorry, ambiguity; not session length threshold but "time on page before maybe they've found what they wanted" [14:45:04] I mean, a bad match of threshold and outcome doesn't suggest that the threshold is bad. [14:45:09] Oh yes [14:45:09] :) [14:45:11] That [14:45:13] we need more words [14:45:50] (solution; publish, make up as many words as you want. After this :D) [14:48:38] Agreed on the need for more jargon [14:49:36] halfak, propose "minimum dwell-time" [14:49:56] I like Morning Oliver's brain, it actually has ideas and hypotheses! And how are you this fine rainy Friday? [14:50:17] Hi [14:51:13] a Morning YuviPanda and his brain! [14:51:16] * Ironholds waves [14:51:39] * YuviPanda particles [14:52:39] can report glorious sunshine on the West coast [14:52:45] * Nettrom sips some more tea [14:53:09] YuviPanda, but now that you've told me I don't know how fast! :( [14:53:20] Nettrom, it's funny because I like rain ;p [14:53:25] Ironholds, I'm great. I tested out the camping hammock for cold tolerance last night. [14:53:31] Got down to 40F. [14:53:32] oooh [14:53:34] Isn't it enough that you know what I really am [14:53:36] woah! [14:53:48] Nettrom, 0/ [14:54:01] YuviPanda, I mean, our friendship has never been limited by such things as the inherent impossibility of knowing both substance and velocity [14:54:04] and I don't intend to start now [14:54:05] Ironholds: I don't mind rain either, so I think I'll do well here [14:54:07] Nettrom, You're not supposed to be awake yet. [14:54:10] halfak: o/ [14:54:10] Screw physics, this is friends! [14:54:16] West coasters don't wake up until 10AM [14:54:22] * YuviPanda plays music [14:54:31] halfak: hahaha, still somewhat on Central time [14:54:33] halfak: I'm waking up early from jetlag these days [14:54:48] * YuviPanda is in European time? [14:55:03] When did you fly YuviPanda ? [14:55:21] Wednesday [14:55:25] Your brain moves at two timezones per day [14:55:28] At 6am [14:55:30] Heh [14:55:36] So I think you're almost to the east coast. [14:55:42] But I have so far managed to sleep 12 to 8 [14:55:45] Both days [14:55:50] I've been here [14:55:52] Nice! [14:55:52] YuviPanda, see, my trick used to be to sleep on the plane [14:55:57] UK-SF it adjusts nicely [14:56:01] ...for about a week and then you crash [14:56:03] But I usually sleep 3 to 11 [14:56:07] Ah [14:56:08] I see [14:56:37] Ironholds: I did that too. Slept on the plane for a full 9h [14:56:48] Wow [14:56:54] I'm envious. [14:57:02] I always struggle to sleep. [14:57:05] halfak, oh, I can't do it any more without heavy medication [14:57:10] got scared of flying [14:57:25] Well.. I take that back. I did get a great sleep in on one trip at the beginning of the summer. [14:57:32] I slept through the entire 8 hours flight to AMS [14:57:35] halfak: yeah i have always been a 'close eyes and fall asleep' guy [14:57:37] Everywhere [14:57:47] It was glorious to wake up (stiff neck) and be done with this flight. [14:57:54] Years of sleeping through mandatory class [14:57:58] ha [14:57:58] halfak, my favourite was my last SF-UK flight [14:58:02] Yeah I always do that [14:58:10] I took two pills, sat down in my seat... [14:58:12] I don't even remember the takeoff [14:58:14] ...and I was on the tarmac in London [14:58:31] * halfak runs off to meeting [14:58:44] When they invent brain manipulatey things I'm going to write code that lets me just not form memories when flying [14:58:45] BTW, YuviPanda ORES is still looking good since de-pool of -02 [14:59:05] Yeah I've made it out of bed now [15:55:07] good morning Ironholds lzia guillom halfak Nettrom et al.! :D how are y'all feeling? [15:55:38] hello bearloga :) A bit tired, and happy it's Friday. [15:57:09] bearloga, pretty good but nobody got me an americaversary present! [15:57:11] hmph! [15:58:52] Americaversary? [15:59:26] adventures in production-land with halfa.k [15:59:47] we've so far found problems with linux kernel settings, redis settings and a timeout [15:59:49] How long have you been in the US, Ironholds? [15:59:50] this is fun :D [16:00:00] (do join #wikimedia-ai if you have not already) [16:01:56] guillom, two years, three days [16:02:29] Ironholds: so when you moved to boston, how did your h1b work out? did you have to make any changes? [16:03:07] my visa worked out okay but it was a special case; I wouldn't try replicating it [16:05:05] YuviPanda: Best bet is to start the green card process as early as possible once you're in the US with your H1B. [16:05:45] hmm [16:05:59] but see I'm still at a point where I'm cringing at any forms of 'commitment' [16:06:06] and the green card process is a fair bit of commitment [16:07:51] * guillom is at a point where he wants to get rid of that sword of Damocles called "you must leave the US immediately if you're let go". [16:08:00] heh [16:08:25] I guess I'm trying to not get rid of the 'you must leave immediately!' 'oh that is alright then I have only one bag to pack' [16:08:27] but.... [16:12:11] YuviPanda, when you're renting a place in SF "no commitment" is gone [16:12:20] unless you can also find a subletter your landlord is okay with in about 12 hours ;p [16:12:30] I know :'( [16:12:31] :'( [16:12:45] * guillom remembers the Wiki House of Love. [16:16:15] the what? [16:19:23] Sue's big idea of having a Wikimedia-run long-term hotel-like place for Wikimedians (volunteers and staff) visiting or temporarily-in-residence in the office. [16:32:33] guillom: https://blog.archive.org/2015/03/17/open-source-housing-for-good/ [16:32:41] Brewster Kahle has bigger ideas. ;) [16:33:10] It doesn't (yet) feature extraterritoriality as regards immigration rules though. [16:34:04] Nemo_bis: Interesting! Thanks for sharing. [16:35:46] "10% of all employees in the US working in the non-profit sector" [16:35:47] huh [16:35:57] That's way higher than I expected. [16:36:26] Comment: "Rather than go through all that just to find affordable housing, move the company to someplace where the housing is affordable like Rochester NY. Beautiful place to love, plenty of nature to enjoy, close to NYC, Toronto, thriving wine region, thriving arts community, a place where you can grow." [16:36:29] +1. [16:36:46] But no building shaped like the org's logo! [16:38:15] heh [16:41:36] Instead, it should be shaped like the USS Enterprise: https://www.youtube.com/watch?v=cxXTTcFzu98 [20:18:56] J-Mo: halfak if you guys can pick an 'interesting' dataset from http://catalog.data.gov/dataset I'd like to try loading it into quarr [20:18:57] y [20:19:02] preferably one that's CSV / TSV [20:19:28] +1. BTW, what do you think about setting up a different instance of quarry> [20:19:29] ? [20:19:38] Would you rather do that or just have us use the main quarry? [20:20:02] halfak: I would rather just use main quarry. [20:20:15] mostly so that we don't need to deal with requirements for a big other db [20:20:22] Yeah. [20:20:23] halfak: although it depends - settign up a new instance is fairly trivial [20:20:30] halfak: I'm worried about hardware, mostly :) [20:20:45] YuviPanda, happy to find an interesting dataset. [20:20:50] J-Mo: yup! please do :) [20:20:58] This would be good incentive to streamline the process of custom datasets that live beside the replicas [20:20:59] don't worry about the format, I can figure that out :) [20:21:04] inded [20:21:05] *indeed [20:21:11] J-Mo, find an XML one [20:21:13] ;) [20:21:15] :P [20:21:26] so I can build a simple custom pipeline [20:22:35] halfak: J-Mo fun weekend project, I think : [20:22:36] :) [20:23:00] nice :) [20:25:10] J-Mo: halfak also prioritization for the WDQS -> QUarry setup vs loading random new datasets? [20:25:22] https://etherpad.wikimedia.org/p/quarry-wdqs-integration is my ideas for WDQS -> Quarry btw [20:26:01] It seems like the two would be related, but I think that loading new datasets should take priority. [20:26:04] J-Mo, thoughts? [20:26:50] yeah, quarry with more data in it is an MVP, as far as I'm concerned. Adding Wikidata would be a cool bonus. [20:27:52] guillom: there are many, many nonprofit organizations [20:27:58] ok [20:28:12] I deem it blocked on you guys picking a dataset for me to import then, halfak and J-Mo :) today if possible [20:28:13] for example, every single church [20:28:28] harej: heh; I'm not used to that system. [20:28:57] YuviPanda: no problem. will ahve you 1 or more sample datasets today [20:30:24] YuviPanda, I've got a couple. Is 150 million rows too many? [20:30:31] yes [20:30:46] definitely under a million. smaller the better, to start with [20:30:50] once we have a pipeline in place we can expand [20:31:58] I have a 3.1 million row dataset that has ORES article quality measures. [20:33:01] If you wanted, I could get a dramatically smaller one just for testing. [20:34:15] halfak: how long did the ORES qualtiy thing take? [20:34:27] A little more than 24h [20:34:30] to build, that is [20:34:30] halfak: also what's the size? [20:34:55] * halfak checks. [20:35:37] 200MB [20:35:40] hah [20:35:41] then yes [20:35:43] let's do that [20:36:51] http://datasets.wikimedia.org/public-datasets/enwiki/article_quality/article_period_stats.tsv.bz2 [20:36:53] YuviPanda, ^ [20:37:28] It contains two predictions per article that was edited in a 6 month period. [20:37:39] The prediction before the start of the period and the prediction after. [20:37:57] I have been using it to look for editing dynamics that produce high quality content and minimize effort. [20:39:57] halfak: oh. I see. It isn't a straight up 'revid / pageid / scores' thing [20:40:12] halfak: eventually we want to get this on the statistics nfs dumps thing [20:40:25] wut [20:40:54] Yeah, doing it historically would produce 670 million rows [20:41:09] This one basically provides quality scores at two time-slices [20:43:11] halfak: not for all revids [20:43:15] Only current ones [20:43:21] At the time of doing [20:45:38] Oh. Well pretend that I did this analysis in July [20:45:40] ;) [20:46:03] Weird situation. https://stats.wikimedia.org/wiktionary/EN/PlotRevertsTrendsTR.png [20:46:33] (Explanation is easy, sadly. https://stats.wikimedia.org/wiktionary/EN/PlotEditsTrendsTR.png ) [20:47:00] Did they disable anonymous editing for a while? [20:47:34] ...and bots [20:47:37] Nemo_bis, ^ [20:48:21] I doubt. From the graph it seems active users disappeared in 2012 hence unregistered edits go unpatrolled and unreverted. [20:48:54] It's the 3rd (?) Wiktionary by absolute number of unregistered edits. :o [20:49:27] halfak: yes I'll put it on once I reach office [20:49:44] halfak: pick a name for it? [20:50:52] ores_wp10_enwiki_july2015 [20:51:29] halfak: k [20:51:56] Nemo_bis, might be reading the graph wrong, but it looks like Jan 2010 has zero-ish anon edits [20:52:03] And a boosted amount of registered edits. [20:58:39] AFAICS it's just the smoothing, making "very low" look like "zero". https://stats.wikimedia.org/wiktionary/EN/PlotEditsTR.png [21:04:10] Nemo_bis, not sure that's smoothing. It should be close to zero if the smoothing is that low. [21:05:28] halfak: am investigating doing importing now btw [21:05:32] halfak: also you should try http://overpass-turbo.eu/ at some point :) [21:05:48] is quite nice [21:05:55] YuviPanda, as soon as my meeting-headache goes away [21:06:19] halfak: +1 [21:06:23] halfak: no idea how you got through that [21:06:23] <3 [21:06:41] :) [21:09:28] halfak: interesting. so apparently I need to create a schema for that TSV (columname, datatype) [21:13:06] halfak: ideally I'd have a thing that goes through arbitrary TSV and figures out what types things are [21:14:22] halfak: is everything except 2nd col (page_title) integers? [21:21:03] int, int, str, float, int, str, float [21:21:16] YuviPanda, ^ [21:21:31] oh [21:21:38] I made it all int >_> [21:22:51] halfak: ok, importdone [21:23:06] halfak: you can try it from quarry: u2029__quarry.ores_wp10_enwiki_july2015 [21:23:16] I have hit a bug in quarry that fails when your user is renamed [21:23:17] lo [21:23:17] ll [21:24:29] halfak: might have wrong types [21:24:33] halfak: took less than 30s [21:24:33] so not bad [21:24:38] that really is millions of rows?! [21:25:25] It should be 3.1 million [21:25:37] Nice [21:25:53] I've to check if it handled nulls properly or not [21:26:11] And somehow do an integrity check [21:26:36] I would still like to write an inferer, mostly because it sounds like a lot of fun lol [21:26:58] Inferer? [21:27:58] halfak: given a tsv spits out a mysql table schema [21:28:06] Can't seem to query: http://quarry.wmflabs.org/query/5142 [21:28:10] Oh yeah! [21:28:12] R does that [21:28:25] Oh [21:28:28] Reads a whole dataset and decides, "That looks like an INT" [21:28:28] Ugh [21:28:30] Queued [21:28:36] Something is wrong with quarry [21:28:39] I should look [21:28:41] queueueued [21:28:46] I wonder if I should food first [21:28:50] TCP keep alive? ;) [21:28:54] Haha [21:28:56] No [21:30:46] halfak: we should try to deploy the timeout thing today [21:30:49] So it doesn't recur on weekend [21:31:08] Yes. [21:31:17] I'm working on that right now. [21:31:40] halfak: awesome [21:31:49] I'm going to work on food [21:31:51] And then quarry [21:32:05] kk [22:30:05] * halfak runs tests against ores-wikimedia-config [22:30:10] Almost ready to deploy :D [22:33:18] YuviPanda, when you have a minute: https://github.com/wiki-ai/revscoring/pull/184, https://github.com/wiki-ai/ores/pull/85, https://github.com/wiki-ai/ores-wikimedia-config/pull/27 [22:33:23] That should do it. [22:33:52] I've got to run for a few hours, but I'll run a test on staging later tonight. [22:34:06] If you don't wan't to merge, but you like what you see, just +1 and I'll self merge. [22:34:06] o/ [23:09:50] halfak: consider them +1 [23:09:52] 'd [23:58:21] halfak: fixed and your query worked btw [23:58:24] halfak: http://quarry.wmflabs.org/query/5142