[06:20:52] is wikipedia and wikimedia are of the same management and the same purposes? [06:23:40] is there any available operators of this channel? [14:04:08] halfak: hi! :-) [14:04:17] hey Helder. [14:04:26] I saw your profanity lists. Thanks :) [14:04:53] I remembered another (small) one: [14:04:56] I'm going to have to lean on you and Henrique to build up a good one for a classifier though. [14:04:59] https://pt.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:Huggle/Config#Previs.C3.A3o [14:05:39] I thought about starting it, but Henrique said it would start that, and he was not online when I was here so I don't know if he did something [14:05:41] halfak: btw, Quarry's been dead since friday because of a labsdb issue :( Coren is looking into it [14:05:44] (just a fyi) [14:06:13] YuviPanda, thanks for the FYI. I haven't been on it recently, so I didn't notice. [14:06:21] yeah [14:06:30] I need to build in some form of notification for people trying to use it :( [14:06:32] Helder, the idea of different scores for different badwords is interesting. [14:06:33] halfak: would a list of words work too? Or do you really need stems? (is that the term?) [14:06:50] The words work fine. I'll run it through a stemmer anyway. [14:06:58] I was reading the WP article to get the idea :-) [14:07:03] ah [14:07:15] :) <3 stemmin [14:08:09] So, given that the Salebot list already has some scores, should we/I start by the ones with higher scores? (and keep separate lists for each score?) [14:10:11] halfak: Would a list with lines containing "-30 foo", "-30 bar", "-25 baz" work for you? I could dig int those regexes to separate the words they match [14:11:05] Hmmm... For the time being, I think that I can work from the word lists you have provided. [14:11:24] We might want to re-address this on another iteration with the classifiers. [14:11:40] ok [14:12:02] if you need my help in something, ping me here or on ptwiki :-) [14:12:30] Will do. Thanks. :) [14:13:19] BTW: what will these classifiers be used for? what are the evil plans related to this? [14:17:27] halfak: also, does "the system" works with words only? or pairs of words are also useful for something? [14:18:33] For plans, honestly, I'm solving my own problem in a general way so that other people can take advantage of it. [14:18:42] I want Snuggle running on ptwiki. [14:18:57] great! [14:19:03] https://en.wikipedia.org/wiki/Wikipedia:Snuggle [14:19:21] As for pairs of words, we could do ngrams. I didn't plan for that, but it shouldn't be so hard. [14:19:34] * Helder wants ClueBot NG [14:19:55] Helder, I'm hoping that's where this will lead. [14:20:02] Or at least ClueBot NG like things. [14:20:03] wohooo!!! :-) [14:20:03] * YuviPanda should sit with halfak at some point (after the current big-refactor-of-ops/puppet is done) and puppetize all the things [14:20:22] YuviPanda, <3 [14:20:25] That [14:20:36] 'll help me get more competent with puppet. [14:22:20] halfak: I know at least one other user in ptwiki who would help training "the system" by evaluating a random set of revisions as good/bad edits [14:22:47] YuviPanda: in short, what it means to puppetize something? [14:22:48] Awesome! [14:23:25] Helder: essentially, we declare (using puppet code), what all needs to be done to automatically 'set up' everything needed to run the service [14:23:29] I'm hoping we'll looking to train some classifiers soon. In the short term, I can generate some randoms samples of revisions. [14:23:32] Gotta run. [14:23:40] might involve installing things, putting files in places, starting up services, etc [14:23:42] * halfak goes off to give a lecture [14:24:13] halfak: ok please give me the random sample when available :-) [14:24:53] YuviPanda: thanks [14:40:57] If a given article get as spike in views, is it possible to know what is the MOST common source of the visitis? (referrer?) [14:41:17] Does the privacy police allow one to publicize that? [17:55:30] Helder, re PVs [17:55:39] I'm actually doing some work with the signpost writers now to pin that down [17:55:46] so if you have an urgent thing about a specific article, poke me [18:20:51] Hey leila, can you add a call to the prep meeting? [18:23:04] lzia, can you add a call to the prep meeting? [18:23:37] I did. [18:24:20] halfak, do you have it? [18:24:24] * halfak refreshes [18:24:25] ewulczyn, I solved your doublecounting problem, I think [18:24:33] (sorry, in a meeting. Will reply to thread shortly) [18:26:16] halfak, we can see and hear you [18:27:44] Ironholds:awesome [18:28:32] ewulczyn, quick check; do you care about all the wlf_2014 banners or just the explicit wlm_2014 banner? [18:30:20] Ironholds: the concerns are not specific to wlm_2014. Other banners also show discrepancies. [18:31:45] ewulczyn, such as? [18:33:13] Ironholds: the total counts for B14_0910_other_y_enUS_tab also differ drastically across wmf_raw and pgehres [18:34:13] gotcha [18:39:42] okay, back! [18:39:45] let me run a few more tests [18:39:55] ewulczyn, the specific question, though: name all the banners you're looking to extract [18:40:06] i.e., is it wlm_2014, or wlm_2014_fr, or both, or etc etc [18:43:05] Ironholds: ah I understand what you are getting at. I meant to just count wlm_2014. But the regex matches wlm_2014 + any country postfix, etc [18:43:40] ewulczyn, yep, and so that may be the source of the double-counting [18:43:53] will reply in the thread. Want to peer-program digging into the other one? [18:44:11] Ironholds: yes! [18:46:04] ewulczyn, cool! Okay, IRC or hangouts, pick :D [18:46:48] Ironholds: hangout [18:47:24] gotcha [18:47:31] okay, schedule a meeting so we have something the hangout is tied to? [18:47:42] I'll go for a smoke and throw it in trello (yay work-logging :D) [18:55:50] ewulczyn, in the meeting [19:38:19] lzia, you're back! yay! [19:38:24] we've got the band back together [19:38:43] just think, for the all-staff we will have the highest researcher density the WMF has ever hosted [19:49:04] ewulczyn, ping [19:49:45] heading to the store, but: can I suggest looking at the timestamps of each request in hive/fr-logs? [19:50:05] i.e., if we pull out the first 5 minutes, do we see a massive disparity? Is this some script kiddie or a uniform pseudo-double-counting thing or? [19:50:18] I suspect status codes may be part of the deal but we'll see. [20:23:36] hey tnegrin, wanna hear something cool? [20:23:57] yes! [20:24:02] So I got bored working on mobile and, on the side, decided we needed deliverables on PVs for the quarterly review [20:24:18] So I built a prototype implementation of the new definition running off the sampled logs [20:24:22] That a good deliverable? ;p [20:24:38] streaming it into the MySQL dbs so we can manipulate it easily. [20:24:54] let's talk -- in a meeting [20:24:56] but very cool [20:25:05] kk [20:36:56] * halfak --> errands. [20:37:07] I'll be back on in a couple hours and working late tonight to make it up. [22:45:11] phew [22:45:14] last CHI submission is done [22:45:24] and now I can do what everyone does after CHI submission! DRINK UNTIL I GO BLIND. [22:55:35] What is the best year to get a csv of the unique vistitors per country per month (going back to 2012)? [22:56:20] (visitors on any project) [22:58:52] Ironholds, ^ [23:00:19] ewulczyn, hahahahahaha [23:00:21] ...sorry. [23:00:32] we don't do unique tracking. We've been trying to do unique tracking for months, but... [23:00:54] we can do pagecounts per country. I have data on that since March 2013, and it's even in a MySQL table for your browsing convenience. [23:01:01] *pageviews [23:01:15] forget unique, total views are fine [23:01:26] SELECT country, SUM(pageviews) AS views FROM staging.pageviews GROUP BY country; [23:01:30] ack, sorry [23:01:47] Ironholds: sweet! Thank you. [23:01:49] you also want a DATE_FORMAT(timestamp, '%Y%m%d') AS month in there [23:01:54] then group by month, country [23:02:06] I actually only finished generating this dataset this morning, so it still needs to be checked. You're the alpha tester :D [23:02:39] Ironholds, LEFT(timestamp, 8) will perform better. [23:02:47] As opposed to the DATE_FORMAT call. [23:03:00] halfak, sensible [23:03:07] that also answers the question I emailed you with! [23:03:11] The LEFT() function will make use if the btree index. [23:03:13] woot. [23:03:18] the data is sampled right? [23:03:54] tnegrin, yeah, 1:1000 [23:04:03] I haven't tried consuming the raw logs yet. Although I could. [23:04:04] hmmn [23:04:07] * Ironholds thinks [23:04:24] yeah, I can see a way of doing that without blowing up stat2. But it's probably overkill and wouldn't give us the temporal depth of data. [23:12:48] let's talk offline about this [23:19:03] tnegrin, totally [23:19:07] post-review, I think ;p [23:19:14] alright, back in an hour or so. Gotta see a man about a dog. [23:19:32] kk [23:23:59] Ironholds is getting a dog? [23:24:19] (or is this another euphemism I'm missing)