[00:00:24] it's surely a humongous table, and this is a quite broad set of constraints [00:03:21] do you really need to do that casting? [00:04:31] not sure, I figure MariaDB wants their own data type, and mediawiki uses yyyymmddhhmmss [00:05:01] I've tried running a similar query against logging_userindex with a user and date range specified, and it worked [00:05:19] so I at least know the date comparison should work [00:15:57] MusikAnimal, um, by the way [00:16:25] selecting count(*) in this way makes the LIMIT 50 a bit pointless :) [00:23:27] mwpersistence 0.1.0 is ready. http://pythonhosted.org/mwpersistence/ [00:23:59] o/ MusikAnimal [00:24:15] Saw your questions. [00:24:20] I'd like to help. I think I can. [00:24:27] yeah? that'd be great! :) [00:24:43] sounds like we need some serious query optimization [00:24:47] or another way to do it entirely [00:25:50] I think the casting is the primary issue. [00:25:56] I'm scoping out the indexes right now. [00:26:09] Do you have a reference to the difference between logging_userindex and logging_logindex? [00:26:14] to be honest I tried "SELECT 1 FROM enwiki_p.logging_logindex WHERE log_timestamp < 20030101000000 AND log_type = "delete" AND log_namespace = 3 AND log_title RLIKE "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" LIMIT 1;" and I gave up waiting for it [00:26:31] Yeah. That one has a problem too :) [00:26:41] You don't want to give it a number [00:26:45] You want to give it a string [00:27:14] what? I was only looking for the time to find a single row [00:27:20] RLIKE also sounds very inefficient [00:27:29] RLIKE totally is. [00:27:39] We need to limit the rowscan with other stuff. [00:27:41] No biggie. [00:27:45] I'm sure we can :) [00:28:25] YuviPanda, can you point me where to learn what indexes logging_logindex has? [00:28:48] halfak: it's just a view, so it doesn't have any special indexes. [00:29:10] halfak: it just allows the underlying engine to use the index on the underlying table better... [00:29:15] by removing rows rather than nulling out values [00:29:25] Yea. There's logging_userindex and logging_logindex [00:29:31] There's something different about those, right? [00:29:43] yeah, I think with logindex user can be null if it was suppressed [00:29:48] and vice versa for userindex [00:29:54] Gotcha. [00:30:53] Krenair & MusikAnimal, this returns True: select "5" rlike "\\d" [00:31:05] You needed another "\" [00:31:08] ah, just the other \ [00:31:09] ok [00:31:15] * halfak attempts a query. [00:31:44] MusikAnimal, the answer to your query is 6091. Now to figure out how to run it on quarry. [00:32:31] * YuviPanda wishes there was more hardware for quarry [00:32:48] YuviPanda, why does quarry need more? [00:32:55] What's the limitation? [00:33:06] halfak: just more for labsdb [00:33:15] people are reporting that queries are slowing down [00:33:22] and that queires that used to run are now timing out [00:33:29] partially because there's more data now and not bigger labsdb [00:33:37] Seems like a good reason to invest in more hardware. [00:33:47] yeah [00:33:56] no budget for it nor DBA time to evaluate what's going on however [00:34:30] we went from 2 DBAs back to 1 [00:34:38] and production DBs take precedence [00:35:19] http://quarry.wmflabs.org/query/5210 [00:35:25] MusikAnimal & Krenair ^ [00:35:29] It finished on quarry. [00:35:33] I've got to run. [00:35:37] Good luck! [00:36:24] log_timestamp < '20080101'... works? [00:36:30] Great timing FreeNode :| [00:46:19] haha [00:46:20] yes [00:49:52] halfak: back [00:50:01] let me know if you have any ideas [00:50:54] I just had my bot restore about 2500 deleted IP talk pages that an admin had compiled, then I got to thinking that maybe we could locate others to restore [00:52:18] MusikAnimal: http://quarry.wmflabs.org/query/5210 [00:53:04] yes!! [00:53:32] can I clone this quarry? I want to select log_title [00:54:01] An enwiki admin compiled that list of 2,500 pages manually? [00:54:16] yeah... [00:54:27] they were the ones they deleted, so easier to find [00:54:27] ... Don't they have better things to be doing? :| [00:55:06] Krenair: like what? Wikipedia's finished. [00:55:13] MusikAnimal: you can copy and paste, I guess. no forking implemented yet [00:55:43] YuviPanda: yeah that works, no problem. How long did it take for you? Currently running the `SELECT log_title` [00:56:09] MusikAnimal: halfak did it :) didn't take him much long I think [00:56:37] I was doing this with MySQL Workbench which annoyingly looses connection after a minute or so [00:56:45] so for all I know it would have finished [00:56:52] bam [00:57:02] thanks for Quarry, it kicks ass! [00:57:34] :) [00:57:39] MusikAnimal: yw [00:59:16] some interesting stuff here, lots of willy on wheels vandalism that was deleted [00:59:55] guess I'm not going to be able to straight restore everything, some of this needs to be revdel'd [14:37:49] Hi milimetric. I had a few questions about page view stats. I saw https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_medical_pages which shows mobile stats taken from /other/pagecounts-all-sites/ and I also saw https://phabricator.wikimedia.org/T44259 which seems to indicate that mobile page view stats are not yet publicly available. What's correct? [14:39:07] Niharika: Andrew West does get data that includes mobile pageviews [14:40:09] any indication that mobile page view data is not publicly available is just a misunderstanding, but an understandable misunderstanding. What data is available and where is quite confusing [14:40:43] pagecounts-all-sites is the best public data available right now [14:41:08] milimetric: Okay. [14:42:40] milimetric: So T44259 is about providing an improved version of the data via the API? [14:43:06] yes, Niharika [14:45:02] milimetric: I saw on https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-raw#Aggregation_for_.mw that .mw data doesn't differentiate between page titles. So any idea how Andrew West's data does have hits for different pages on mobile? [14:45:40] It also says it's best to ignore .mw lines. [14:46:08] .mw is not mobile, one sec let me read that page [14:48:11] oh, Niharika, I didn't even know about that weird mw thing. the pagecounts-all-sites indicates mobile requests with .m [14:48:51] this .mw thing is just a weird cross-project aggregation that should probably be ignored [14:49:53] milimetric: Okay. To sum up, what Andrew West probably does is - download the gz from http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/ which is provided hourly and query that for mobile page view stats. Right? [14:50:29] Niharika: yes. But tell us what you need to do, we may be able to help [14:53:17] milimetric: There's a bot which produces hundreds of reports like these: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Michigan/Popular_pages which needs mobile view data as well now. It says this data comes from domas' data which I assume points to http://dumps.wikimedia.org/other/pagecounts-raw/. I was wondering whether we should completely hold off [14:53:17] on this task until T44259 is done. [14:56:25] I see. Niharika, the pagecounts-all-sites was built specifically so it could be easily swapped for pagecounts-raw. So the bot should be easy to update. If you want to wait for the pageview API, that may take another month or so to be ready. We're tied up in some slow moving discussions right now [14:56:58] milimetric: Okay, so pagecounts-raw basically contains the same information as pagecount-all-sites? [14:57:38] Niharika: same format, but with mobile pageviews [14:58:20] milimetric: That's what I want, mobile hits, so pagecounts-raw makes more sense to stick with, right? [14:59:22] Ah! Sorry, I meant pagecounts-all-sites has mobile, pagecounts-raw has NO mobile, both use the same format [15:00:09] milimetric: Okay. That clears up my confusion. Thanks! [15:00:24] np [15:02:01] o/ guillom and other sciency people [15:02:14] hello halfak, milimetric, Niharika [15:02:40] Science! [15:02:51] hm, what? :) [15:03:42] :D [15:04:20] I have "science" as a pingword in my client, so I'm always harassed by it during "Science Tuesdays" at the office. [15:04:35] And jelly of the SF folk who get to go have a science day :( [15:45:58] Hi guillom. [15:48:21] Niharika, do you guys have a channel for your team>? [15:49:07] halfak: Yep. #wikimedia-commtech [16:06:51] Oh, today is Research Showcase day! [16:07:15] ! [16:07:25] Not on my calendar somehow [16:07:46] Not on the Research calendar either. [16:08:42] Looks like we have a streaming link in advance this time. [16:08:43] http://www.youtube.com/watch?v=eJk6mxJZhH8 [16:08:52] It's on the staff calendar, I copied it into mine. [16:08:58] Just did the same [16:09:08] I dunno what the purpose of the staff calendar is. [16:09:20] I never keep it visible. [16:10:18] There are few enough events in it to keep it visible, I've found. [16:10:33] Hmm... Maybe I'll turn it on more often. [16:11:05] It's also an opportunity to see how often the "Breakfasts with Lila" are cancelled / postponed :) [16:12:19] heh. [16:12:38] Looks like we overlap with the "brand awareness in South Africa" talk. [16:12:40] That's too bad. [16:12:57] Seems like there would be overlap in interest. [16:50:53] What time is the research showcase? [16:50:59] I'll try to be at the office [16:52:31] YuviPanda, 11:30 PDT [16:52:43] Ok [16:58:57] YuviPanda: FYI if you're attending it in the office, it's in R51 (big room on the 5th floor), not in the 5th floor lounge (probably because that's used for the other talk). [16:59:13] Ah [16:59:32] I might not make it. Just woke up and I need to do a run now. Run vs research showcase... [18:00:49] hi halfak. If you're planning to be around for the whole time during the research showcase, can you relay IRC questions to speakers? [18:01:03] If not, I'll do it and I'm sure guillom will help if he's around. ;-) [18:01:07] leila, sure [18:01:14] perfect. thank you, halfak. [18:02:04] halfak: just added you to the event with the Hangout on Air link [18:02:20] I'll be in ~ 5 minutes before we start [18:02:26] Sound good? [18:02:27] sounds good, halfak [18:06:32] oww Nettrom. hi. :D [18:07:22] o/ leila [18:07:33] figured I should perhaps drop in here :) [18:08:04] Nettrom: question about bio. I'll say PhD candidate in GroupLens, WMF fellow, creator of SuggestBot, what else? [18:08:19] Netter of Troms [18:08:26] halfak: :) [18:08:45] leila: "creator of SuggestBot" is not correct, "maintainer" is better [18:08:49] Fjord Forder [18:09:02] Dan Cosley made it [18:09:44] Maintainer doesn't sound as good as "primary dev for the last 7 years [18:10:00] halfak: yeah, and there's not that much code left over either... hmmm [18:10:46] maybe "primary dev of SuggestBot" is a good compromise? [18:11:19] got it, Nettrom. [18:11:19] SuggestBot's Master [18:11:22] ;) [18:11:32] thanks leila [18:11:46] I'll guarantee at least maintainer [18:11:46] :D [18:12:05] if I get too confused, I may go with primary dev of SuggestBot. Who knows? you may get a new title during this process, too. :D [18:12:15] :D [18:12:23] good that you ask, since I'll refer to it in the talk [18:12:30] halfak and ORES gets a shoutout too [18:12:33] \o/ [18:12:51] Nettrom: if you prefer I don't say anything, you can also say it yourself. Whatever you prefer. [18:13:17] I'll give a quick intro for Besnik though. [18:13:43] leila: I'd prefer that you introduce me, I haven't prepped a "who am I?" slide like halfak does [18:13:49] and btw, we should expect the event to start few min late, the room is reserved until 11:30 (though we've been told they try to go out 5 min ealier) [18:13:56] :D [18:13:56] ooki [18:14:19] leila: sounds good, aren't all meetings in the foundation 5 mins late anyways? ;) [18:14:36] Nettrom: no, only the researchy ones. :D [18:14:47] Nettrom: so you'll be first, okay? [18:14:55] leila: sure thing! [18:15:10] looking forward to it, I finally read your paper last night. [18:15:43] leila: cool, should be fun, it's a topic I like talking about :) [18:15:49] All researchy meetings start late -- academia/industry w/e [18:16:10] and at ICWSM I only had 10 mins, so that version was very condensed [18:16:23] yeah, totally [18:17:41] * Nettrom brb [18:18:01] leila: Sorry, I'm working from home today so I'll be one of the "remotes" for the showcase :) [18:19:15] ooki, guillom. [18:25:06] soo, halfak, do you have a sense that we are all good with tech? this will be the first presentation after such a long time with no glitches. :D [18:28:42] Seemingly [18:30:47] halfak: here is the YouTube link if you want to add it to the title: https://www.youtube.com/watch?v=eJk6mxJZhH8 [18:30:51] we are not live, yet. [18:31:06] leila, someone said we are live [18:31:08] we are live [18:31:12] gogogo [18:31:16] leila [18:31:36] Hey folks. Direct your questions at me. [18:31:40] The feed should be live [18:31:45] https://www.youtube.com/watch?v=eJk6mxJZhH8 [18:31:46] it is halfak! [18:31:47] thank you [18:32:14] to post it here too https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#September_2015 [18:33:18] and here is the research newsletter's review btw ;) https://meta.wikimedia.org/wiki/Research:Newsletter/2015/April#Popularity_does_not_breed_quality_.28and_vice_versa.29 [18:34:21] * halfak adds a ref [18:34:44] ^ to the showcase page [18:40:52] leila, can you hide the presenters? [18:40:59] The thumbnails at the bottom [18:41:11] We'll get a better view on youtube [18:42:30] halfak: I've already hidden it on my end [18:42:37] like, 8 minutes ago [18:42:41] OK. I'll check the stream again [18:42:43] do you still see it? [18:42:52] Nope. Looks good. [18:43:04] ah! it can be that Nettrom should do it, since we're showing his screen? [18:43:21] No pinging the speaker :P [18:43:23] ook. then. :D [18:44:04] I can always tell when I say something controversial while I'm presenting because there will be a sudden flurry of pings ;) [18:44:51] * Emufarmers pings halfak now to get it out of his system [18:44:58] just turn the IRC off, halfak. :D [18:44:59] Could it be that high popularity articles may either be very complex, or controversial, and therefore more difficult to bring to high quality status? [18:45:09] o/ Emufarmers [18:45:19] guillom, super good question [18:45:20] * J-Mo pings halfak now, for the lulz [18:45:36] I wanna measure labor hours going into those articles to work out /efficiency/ :) [18:46:19] ORES thinks that the article on Vietnam War is FA class [18:46:39] halfak, does view exclude bots/spiders as best as we can? [18:46:49] I'd guess that's relatively common among "highly misaligned" articles. [18:47:17] leila, I think that Nettrom is using the old hourly viewlog data. I don't think they did a good job of filtering spiders in that. [18:47:57] got it, halfak. not sure how much impact it can have, but it's good to run the same analysis on the cleaner data. (no pinging of speakers, please. :D) [18:48:36] "Categories are messy." Yup. [18:48:50] Bah! You're right! I did it. :( Sorry presenter-who-will-not-be-named [18:49:31] haha! halfak, don't worry about it. [18:50:22] guillom: yes, that point was brought up by readers of the newsletter too when the paper was reviewed https://twitter.com/WikiResearch/status/598030486632443904 [18:51:07] ... back then Nettrom already commented on it [18:51:14] Thanks HaeB [18:51:29] ...it would still be interesting to hear further thoughts on that [18:52:28] Questions are queued :) [18:53:49] 3 min, Nettrom, for a 5 min discussion. [18:56:29] The Vietnam War article was last put up for GA...9 years ago? >.> [18:56:51] Just to jump off the earlier comment from guillom, I suspect some low popularity, high quality articles end up that way because they are less subject to editorial disruption and disagreement. They can be good practice for improving high quality articles. : ) [18:57:11] er high popularity articles, not high quality* [18:59:33] prime examples (of articles that are important but require an above-average effort) are countries, which happen to be on top of the misalignment list too ;) [18:59:42] https://twitter.com/phoebe_ayers/status/598191169894359040 "I worked for weeks on [[Uganda]]. 80K viewers. Total mess (& sources aren't v. good either)." [19:00:48] Another question if there's time: What do people think about article forking/branching/merging (e.g. like in software development branches) as a mechanism for working on complex/controversial articles? [19:01:14] (I know it's a larger discussion, just wondering how it could apply in this context.) [19:02:49] e.g. collaborative sprints on a separate page or space to work on high-popularity articles. [19:03:21] +1, was just thinkign the same https://en.wikipedia.org/wiki/Vietnam_War - it was a good article formely btw [19:04:27] guillom - Could work pretty well assuming there is existing agreement on what needs merging or forking. I've been through a number of RfCs where those details can get very particular. [19:04:29] HaeB: you're +1ing halfa.k's comment about outdatedness, I presume? [19:04:31] HaeB: I see it was nominated. Was it ever actually made one? [19:05:35] Oh, structured and unstructured data! [19:05:36] Hey folks, I forgot to note that it was question time. If you asked a question and I missed it know that Nettrom is a local in this channel, so you can always ask him directly on IRC. [19:06:05] Emufarmers: right, nominated only. still, start-class or c-class seems pretty harsh ;) [19:06:40] release of the usage of the name Nettrom :D [19:07:04] HaeB: only 434 notes? needs moar [19:07:14] topic [19:07:49] research newsletter review of besnik's paper https://meta.wikimedia.org/wiki/Research:Newsletter/2015/August#.22Automated_News_Suggestions_for_Populating_Wikipedia_entity_Pages.22 [19:08:41] * Nettrom is looking forward to scrolling through the backlog, interesting chat [19:16:03] not sure that is a great example (business insider as source for a biography of a major historical figurre ;) [19:16:33] (http://www.businessinsider.com/tesla-predicted-smartphones-in-1926-2015-7 --> https://en.wikipedia.org/wiki/Nikola_Tesla ) [19:20:00] HaeB - Agreed, that article was written by an intern undergraduate. [19:20:20] Heh, Odisha did mention the cyclone, until someone removed it: https://en.wikipedia.org/w/index.php?title=Odisha&diff=497760938&oldid=497756493 [19:28:44] HaeB: by the way, some days ago I added some stuff to https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-09-30/Recent_research [19:29:10] It's quite close to being a special issue on Wiktionary research. ^^ [19:29:41] We only need one Hungarian speaker and one German speaker to clear the backlog of Wiktionary papers on the etherpad. [19:29:47] +1 to the ides of "Special Issue" research newsletters. [19:31:24] Questions for Besnik, please post them here and halfak will relay. :-) [19:31:33] ^ that [19:31:37] Nemo_bis: cool, thanks! yes i was mulling that idea too (collecting wiktionary paper in one issue) [19:31:39] Besnik is just wrapping up [19:32:21] HaeB: oh, better so :) [19:32:44] yeah, totally halfak. that was my comment, too. [19:38:50] 10Quarry: Too much whitespace in Quarry - https://phabricator.wikimedia.org/T112803#1646597 (10Spage) 3NEW [19:39:27] thanks, halfak, for taking care of the questions. :-) [19:39:30] thanks guys! [19:39:45] Great showcase, thank you all: [19:39:46] ! [19:39:52] thanks atgo-lunch and guillom. [19:39:55] \p/ [19:39:58] p face [19:40:00] p values [19:40:48] great fun, thanks everyone! :) [19:40:56] Indeed. Thanks for presenting. [19:41:08] Oh! Nettrom, let me get you a link to that ORES dataset [19:41:26] http://datasets.wikimedia.org/public-datasets/enwiki/article_quality/article_period_quality_scores.tsv.bz2 [19:41:46] hmm, YouTube now wants me to switch to this South Africa thing [19:41:49] * Emufarmers clicks [19:42:17] The weighted sum is the result of this: sum(p*CLASS_VALUES[k] for k, p in score['probability'].items()) [19:42:48] class value for a particular class is numbered 0-5 (Stub, Start, ..., GA, FA). [19:43:17] halfak: cool, lemme have a look at that [19:43:54] So the dataset has two columns. A sample on 20150101 and a sample on 2015070101. [19:44:00] So 6 months inbetween. [19:44:28] I'll be doing a writeup for this soon. [19:44:37] And I have yuvipanda loading it into Quarry >:) [19:44:53] halfak: so there's two sets of predictions/rating pairs? [19:44:56] Mwahahaha [19:44:58] Yup [19:45:08] also, the last column should be labelled "end_weighted_sum"? [19:45:38] * halfak tries to figure out wat [19:46:01] Bah! [19:46:05] halfak: last column has the same name as the other column, sorry ;) [19:46:08] prev_weighted_sum -- AGAIN [19:46:58] BTW, you'll see weirdness when there was no parent revision, it will make the prediction based on a blank page. That has some known issues. [19:47:01] halfak: anyways, we've got a cool dataset, I'll have a look at it after I've done some Wiki-Class work :) [19:48:15] Woot. Oh! BTW, did you notice that I merged in the big refactor. [19:48:21] The codebase is way cleaner now. [19:48:30] https://github.com/halfak/wikiclass [19:48:38] See also http://pythonhosted.org/wikiclass/ [19:49:13] The utilities are really nice. I can extract all class labelings from enwiki overnight. [19:49:36] There's also an extract_text utility that will cache the text so you can iterate quickly with new features. [19:49:44] See http://pythonhosted.org/wikiclass/utilities.html [19:49:50] halfak: nice, making a note to look at those as well! [19:52:58] Nettrom, sorry to throw a bunch of things at you, but I wanted to show you one more thing in wikiclass. [19:53:09] So I've been working on making it easier to support new wikis. [19:53:15] Here's the template extractor for enwiki: https://github.com/halfak/wikiclass/blob/master/wikiclass/extractors/enwiki.py [19:53:30] I wrote one up for frwiki and it seems to work really well: https://github.com/halfak/wikiclass/blob/master/wikiclass/extractors/frwiki.py [19:53:47] I'd like to implement an extractor for all the wikis you think have enough labelings. [19:54:44] halfak: no worries [19:57:35] halfak: I like this type of setup, seems straightforward to work with, hopefully other wikis are similarly structured so it works there too [19:57:40] (which I suspect is the case) [19:58:46] +1 I've got some abstraction in place in case that isn't true. [20:04:58] CA DMV is where hopes come to die [20:06:55] * Nettrom lunch [20:14:34] bearloga: :( [20:14:44] * guillom still needs to get that driver's license. [20:18:09] nothing is more fun than explaining randomness to people [20:23:43] halfak|Lunch: I think I'll spend sometime later this month / early next month moving Quarry to run on Kubernetes [20:23:45] the new toollabs thing [21:06:24] is kubernetes like docker/vagrant? [21:07:06] yuvipanda, ^ [21:07:18] halfak: it uses docker. [21:07:25] It basically lets you say [21:07:40] 'Run x copies of these containers in this configuration forever' [21:09:57] Gotcha. [21:10:02] Cool :) [21:15:06] * halfak goes to re-compress 600GB of data bz2 because I'll probably never be able to use snappy on the analytics machines. [21:16:00] halfak: heh is this the thing that was brought up on sos? [21:16:07] git 1.7.1 is super weird. [21:16:12] yup [21:16:25] It's going to require compiling me a custom binary. [21:16:50] I'd offer to help but am already a bit filled up [21:16:53] Sorry :( [21:17:42] No worries. [21:17:43] :) [21:18:27] If you can find a Debian package somewhere I can help import it :) [21:22:01] I do need a deb. http://packages.ubuntu.com/trusty/libsnappy-dev [21:22:06] This one is one of the dependencies. [21:23:19] https://github.com/kubo/snzip [21:23:26] That's the actual code [21:23:37] They've WONTFIX'd debs. [21:24:01] And provide a tarball https://bintray.com/artifact/download/kubo/generic/snzip-1.0.2.tar.gz [21:24:06] "bintray"??? [21:24:23] For some reason, this differs from the tarball that they host on github [21:24:30] -._o_.- [21:24:38] I've learned my lesson. No more snappy! [21:28:20] Oh god. I've just realized... I can't just recompress these files. I have to sort and shuffle them again. [21:28:30] Or hadoop will reorder and shuffle the lines [21:28:32] Arg [21:30:07] halfak: oh, I see. [21:30:12] not much I can do then :( [21:30:32] No worries. Thanks anyway yuvipanda [21:30:41] Really I just one one beefy machine that I can root on. [21:30:51] I suppose I could set up such a machine in labs. [21:30:58] But then I couldn't transfer the data files.