[00:00:34] marktraceur: you can create some tables (I'd have to check if your user has permissions to do that) and insert the results of your queries in there [00:00:55] and then you can just select max(datestring) from those tables [00:01:20] or you can use the files you're appending to and tail the last line + parse the date [00:01:36] milimetric: I'm just using the read-only user, though [00:01:48] oh... this makes me think of something else. You could, of course, use bash as a way to "template" sql [00:01:55] Eugh [00:01:58] :) [00:02:10] sorry, I never know how people feel about bash [00:02:17] i definitely fall in the Eugh category [00:02:25] It's great for doing things but not great for programming [00:02:40] right, so read only user [00:02:58] uh, then the simplest thing would be to tail the file and parse the last date [00:03:09] In bash [00:03:14] Or in SQL somehow? [00:03:38] sql is a one-trick pony [00:03:45] I figure...wait [00:03:46] No [00:03:46] so bash is one option, node, python, etc. [00:03:55] I think there actually might be a way to do this [00:04:36] Agh, no, INTO OUTFILE is only for new files. Eff. [00:05:44] milimetric: NVM, tnegrin says eff it and run it on everything [00:05:56] certainly, that's an option [00:06:04] your tables will eventually get too big [00:06:09] but you can deal with that when you get to it [00:06:14] and hopefully by then [00:06:21] we'll have this pipeline smoother than a baby's [00:06:22] ... [00:06:59] Yuuuup [00:09:09] milimetric: I think it's complaining about the union now [00:09:13] ERROR 1222 (21000) at line 3: The used SELECT statements have a different number of columns [00:09:31] ah, of course [00:09:40] you gotta change * to the columns you need [00:09:57] Oh, 'kay [00:10:06] in each of the select * from ... that are unioned together [00:11:21] Oh cool, I has data. [00:12:12] Hm, zeroes for the perf data though. Sad. [00:15:25] Maybe they're actually zero [00:15:45] Well crap, they actually are [00:16:01] I could swear I saw nonzero data there in the console [00:18:08] Agh, bugs. [00:32:39] sounds like your query's paying off the tribulations that birthed it. congrats marktraceur [00:32:46] see you tomorrow, I'm off around the house [00:33:54] 'kay [00:34:11] Yeah, major bug means I have zero useful data for performance >.< [00:34:30] Fixed it at least [12:50:58] (CR) Nuria: [V: 1] "I hope a +1 is what is expected here, not very sure of my way arround gerrit yet." (1 comment) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105869 (owner: Jdlrobson) [15:01:36] thanks milimetric [15:31:42] brainwane, you're welcome, but who are you and what'd I do :) [15:34:12] milimetric: It's Sumana! you helped me this morning with setup [15:35:00] oh! Sumana, what happened to sumanah [15:36:18] milimetric: I'm retrieving my old Freenode pw now, I should recover my identity soon :) [15:36:26] yay :) [15:59:00] FYI: New thread here: https://www.mediawiki.org/wiki/Talk:EventLogging/UserAgentAnonymization [15:59:17] I couldn't find a public thread, so I took it to the wiki. [15:59:45] nuria: ping [16:00:06] hello [16:01:17] Hi nuria. See my link above [16:01:26] readingggg [16:01:29] :) [16:01:58] ya, I see, i think you are thinking of preprocessing ua to determine major/minor which is what i do not want to do for the very reason you state [16:02:26] I thought it was clear that you were planning to strip things from the ua [16:02:27] and an additional one [16:02:37] So long as the processing is lossy, there's a potential issue. [16:02:46] strip not required info to determine device/browser minor major [16:02:55] for example the "language" [16:03:14] es_ES [16:03:30] Indeed. I can imagine that the work done for performance testing does not make use of this information, but I have for tracking new editors in the past. [16:03:47] but then - in this case- [16:03:56] I think is better to make it part of the schema [16:04:13] let me see if i can find a good example... [16:04:18] Can you gather the user agent from javascript? [16:04:49] Hmmm.. It appears that you can. [16:04:55] You can, but in this case the UA we log is part of the request, we log from server side on varnish [16:05:01] HTTP request, sorry, [16:05:23] so we are logging the User-Agent header raw [16:05:31] Yes... [16:05:59] In two cases (varnishncsa) and pure server side PHP logging [16:06:04] Makes sense? [16:06:14] For your language example [16:06:29] I'm not sure what would make sense. I might be missing the point you are making. [16:06:57] Sorry, let me explain myself better [16:07:09] For using "language" to identify editors [16:07:38] I would language as part of the event schema (these changes are restricted to event logging) [16:07:50] So, I don't use language so much as the hash of the UA as a fingerprint. [16:07:55] https://meta.wikimedia.org/wiki/Schema:NewEditorEdit this represents an event for example [16:09:09] So , what we are removing is precisely the fingerprinting yes. But in this case additional fields in event logging should allow you to track [16:09:17] an individual user i believe [16:09:24] Hmmm. Why are we removing this *before* storing. [16:09:35] Why don't we removed this before release? [16:10:08] because , according to our privacy policy we cannot hold onto raw user agent data, i believe [16:10:11] (Sorry if I sound aggressive. I don't mean to, but reading over my last few messages I can see that I sound that way.) [16:10:54] Sure. But we can use it for a substantial period of time. [16:10:59] no worries at alll. Any release dataset *I think* would have post processed UAs so that should be a given [16:11:47] it is only 90 days, i believe [16:12:52] So the purpose of processing before storage is so we don't have to build and maintain a system for anonymizing UAs in our database when the 90 day mark is reached? [16:15:18] Yes, so we have made a best effort at anonymizing the data set somewhat (having in mind that no system will be perfect) [16:16:07] I will talk about it with ori but maybe is worth clarifying that on the wiki, eh? [16:16:12] OK. This I can accept since I can't imagine maintaining a system that anonymizes at 90 days. Thanks! Can you add this to the lead of the proposal? [16:16:15] Yes please :) [16:19:54] Will do, right now. Let me make sure policy is 90 days as my memory is terrible and i just answer that from memory. [16:20:40] ok, yes in our actual policy is 90 days. [16:20:59] That was my memory too. I believe the new privacy policy is still up for discussion. [16:21:04] For purely "raw" data [16:21:15] https://meta.wikimedia.org/wiki/Privacy_policy [16:22:16] I don't believe our current policy has these restrictions. [16:24:21] the soon-to-be-applied-one? [16:24:56] Yeah... I'm not sure how soon that is. Do you see a time table? [16:25:55] nuria: can you add some links to gerrit changes somewhere ? [16:26:14] to what changes sorry? [16:26:22] I wanted to have a look at the code that went into the release you guys made yesterday and I didn't even know where to look [16:26:32] I might be missing something, but I don't see the 90 day restriction. FWIW, I worked on an earlier draft for this privacy policy and that is where my memory comes from. [16:27:15] but meh, I was just curious, so.. nevermind [16:28:05] we did not release any code regarding anonymization , just a varnish config change, let me look for it [16:28:16] https://gerrit.wikimedia.org/r/#/c/102817/ [16:28:23] thanks [21:41:11] DarTar, but also maybe milimetric, where is a good place to stick TSVs on stat1? I think DarTar told me but that was some time ago. [21:42:11] marktraceur: /a/public-datasets [21:42:19] Ah K. [21:42:21] or any relevant subdir of it [21:42:26] DarTar: And where are those on the web? [21:42:33] that will teleport it to stat1001 [21:42:45] http://stat1001.wikimedia.org/public-datasets/ [21:43:03] Fun [21:43:13] Is it a cronjob that does the teleportation? [21:43:16] rsync'ed every 30 mins [21:43:29] 'kay [22:18:50] (CR) Jdlrobson: [C: -2] Use NewEditorEdit schema [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105869 (owner: Jdlrobson) [22:56:27] (PS1) Milimetric: Fixes Bug 58450 [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/106448 [22:56:41] (CR) Milimetric: [C: 2 V: 2] Fixes Bug 58450 [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/106448 (owner: Milimetric) [23:10:25] milimetric: I am sumanah again :) [23:10:34] woo hoo! [23:10:46] I gotta run now, but congrats on your new selfdom [23:11:00] you'll have to update me on how you're faring with wikimetrics tomorrow [23:11:04] deal? [23:11:05] laters all [23:11:28] Sure! [23:28:09] (PS2) Jdlrobson: Use NewEditorEdit schema for new graph [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105869 [23:28:10] (PS2) Jdlrobson: Story 1481: Collect graph data only for current month [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105851 [23:30:49] Ironholds: Got a minute to talk blocks? [23:31:20] halfak, totally! [23:31:50] where/when/...? [23:31:54] Do you know how far back the block log goes [23:31:56] Here nto? [23:31:58] *now [23:32:46] how far back it goes? good question [23:32:51] I mean, in my research it goes to early 2006 [23:33:00] but that's mostly limited by things like when registration data became reliably available. [23:33:17] it may in fact go earlier, and I just don't use some of the data. I do not recall :( [23:33:26] Did you notice any interesting anomalies since 2006? [23:34:35] in terms of data reliability, or just in the data overall? [23:34:46] the former, not really; the latter, the big peak and spike in 2009 is...pretty noteworthy. [23:35:42] Gotcha. [23:35:55] thanks! [23:36:08] It looks like the first block I can find in the log for enwiki is 20041223035333 [23:37:58] neat!