[10:18:40] (PS1) Gilles: Add options menu metrics [analytics/multimedia] - https://gerrit.wikimedia.org/r/169667 [10:19:15] (CR) Gilles: [C: 2] "Tested by querying the production SQL server" [analytics/multimedia] - https://gerrit.wikimedia.org/r/169667 (owner: Gilles) [10:19:31] (Merged) jenkins-bot: Add options menu metrics [analytics/multimedia] - https://gerrit.wikimedia.org/r/169667 (owner: Gilles) [11:04:48] (PS1) Gilles: Add options menu metrics [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/169673 [11:05:27] (CR) Gilles: [C: 2] "Tested locally" [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/169673 (owner: Gilles) [11:06:44] (CR) Gilles: [V: 2] Add options menu metrics [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/169673 (owner: Gilles) [13:55:06] (CR) Ottomata: [WIP] Add schema for edit fact table (1 comment) [analytics/data-warehouse] - https://gerrit.wikimedia.org/r/167839 (owner: QChris) [13:57:10] Analytics / Refinery: Spike: Assess feasibility and effort to add fields to webrequest logs - https://bugzilla.wikimedia.org/72651#c1 (christian) NEW>RESO/FIX Not sure where to respond since it covers Trello, Etherpad, Email, IRC, and now bugzilla. Responding in bugzilla, since this is at least a... [13:57:51] (PS1) Gergő Tisza: Fix optout graph description [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/169698 [13:59:53] Analytics / Refinery: Spike: Assess feasibility and effort to add fields to webrequest logs - https://bugzilla.wikimedia.org/72651#c2 (christian) The estimations from comment #1 assume that having those fields in HDFS (not udp2log) is sufficient. [14:02:30] uhh [14:32:31] (CR) Gilles: [C: 2] Fix optout graph description [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/169698 (owner: Gergő Tisza) [14:33:14] (CR) Gilles: [V: 2] Fix optout graph description [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/169698 (owner: Gergő Tisza) [15:18:41] Analytics / Refinery: Make webrequest partition validation handle races between time and sequence numbers - https://bugzilla.wikimedia.org/69615#c16 (christian) Happened again for: 2014-10-29T01/2H (on upload) [15:19:48] !log Marked raw upload webrequest partition for 2014-10-29T01/2H ok (See {{bug|69615#c16}}) [15:40:25] Analytics / Refinery: Raw webrequest bits and upload partition for 2014-10-28T19/1H not marked successful - https://bugzilla.wikimedia.org/72679 (christian) NEW p:Unprio s:normal a:None The bits and upload webrequest partitions [1] for 2014-10-28T19/1H have not been marked successful. What... [15:41:11] Analytics / Refinery: Raw webrequest bits and upload partition for 2014-10-28T19/1H not marked successful - https://bugzilla.wikimedia.org/72679#c1 (christian) NEW>RESO/FIX For bits, only cp1056 was affected. The affected period is the second 2014-10-28T19:52:57. For that second, we saw 178 dupli... [15:41:23] Analytics / Refinery: analytics1021 getting kicked out of kafka partition leader role on 2014-10-27 ~07:12 - https://bugzilla.wikimedia.org/72550 (christian) [15:41:31] qchris_away: [15:41:38] should we maybe just try to reinstall an21? [15:41:40] or, hm [15:41:49] it was kinda doing this before the previous kafka cluster install too, wasn't it? [15:42:18] the fact that this really only happens on this node (at least, almost always) is pretty annoying [15:44:50] ottomata: is there something special about the puppet applied there? [15:45:59] no [15:46:18] i hear so much about that box :) [15:46:38] maybe run that puppet on a different physical box? [15:46:56] i had a computer once that wouldn't turn on every time I tried to give it away, but was fine otherwise [15:50:19] well, there are 4 brokers [15:50:25] and this is really the only one that causes problems [15:50:39] i'd like to just replace it with another machine and see what happens [15:50:42] but i don't have another one like it [15:50:46] i could take a hadoop worker down for it [15:53:32] (PS18) Mforns: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [16:03:46] (CR) Milimetric: Add ability to global query a user's wikis (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [16:41:17] ottomata: if you have spare cycles to reinstall analytics1021 ... sure, why not. [16:41:47] However, once we have deduplication in place, the issue won't affect us any longer. [16:42:02] (And we need to do deduplication anyways) [16:42:11] qchris, i know the issue won't affect us, but i don't want the issue to happen [16:42:13] it shoudlnt' happen [16:42:20] Agreed. [16:42:23] i'm not so sure reinstalling will help us [16:42:31] Neither am I. [16:42:40] i suspect getting a different node will [16:42:41] but i do not know [16:42:49] it could be something in the row's networking [16:42:51] who knows [16:43:00] ottomata: can we get reedy access to stat1002? [16:43:02] With the current load on the cluster, we could sacrifice a worker. [16:43:19] milimetric: sure, RT ticket please! :) [16:43:26] ja we could [16:51:14] ottomata: So about your idea of sacrificing a worker for analytics1021 ... should I file an RT ticket for that? [16:52:55] Analytics / Refinery: analytics1021 getting kicked out of kafka partition leader role on 2014-10-27 ~07:12 - https://bugzilla.wikimedia.org/72550#c3 (christian) NEW>RESO/FIX (In reply to christian from comment #1) > (Still pending on check whether leader re-election caused loss/duplicates) Bug 7... [16:53:50] qchris, let's discuss tomorrow [16:53:54] k [16:54:31] qchris, ellery is asking about kafkatee [16:54:46] i looked at one day worht of edit logs yesterday [16:54:51] and it looked pretty good to me [16:54:55] what do we need to do to vet? [16:55:21] Make sure the output roughly matches the udp2log output (minus ssl) [16:55:33] In terms of checks ... [16:55:51] I would verify that the distribution of values per field roughly matches. [16:56:01] The volume of rows should match. [16:56:15] values per field, interesting [16:56:17] i looked at hostname [16:56:21] and that roughly matched [16:56:32] but, i only checked edits [16:56:55] Just repeat that with "cut -f $ROW" instead of "cut -f 1" :-) [16:57:00] s/ROW/COLUMN/ [16:57:19] eyeballing the counts is good enough? [16:57:24] should I check just one day? [16:57:32] I think that eyeballing is sufficient. [16:57:34] if I did that for 1 day on a few outputs we have now [16:57:35] that should be good? [16:57:41] (At least for the sampled files) [16:57:48] edits is unsampled [16:57:51] that's why i was cheking it [16:57:55] The unsampled should match pretty closely (sorting might be needed of course :-) ) [16:58:15] zero is unsampled too, and matched exceptionally well. [16:58:20] (last time) [16:59:10] I would consider more than one day. [16:59:36] what does exceptionally well mean? [16:59:50] I'd probably check the most recent day, another day in the week, and the same day in a different week. [17:00:16] like ,off by .06% ok? [17:00:18] for a given host? [17:00:21] "exceptionally well" means that (when ignoring the difference in logrotation) diff had 0 exit status :-) [17:00:37] yeah i didn't ignore logrotation [17:00:38] hm [17:00:50] you would trim the files so that the containing timestamps are the same? [17:01:03] For diffing: yes. Otherwise, no. [17:01:13] right [17:01:14] hm, ok [17:02:21] Being off by .06% is perfectly ok, if every line of udp2log output is in kafkatee output. [17:02:52] (for unsampled files that is :-) ) [17:05:27] ottomata: he filed https://rt.wikimedia.org/Ticket/Display.html?id=8773 [17:06:10] Analytics / Refinery: Raw webrequest bits and upload partition for 2014-10-28T19/1H not marked successful - https://bugzilla.wikimedia.org/72679 (christian) [17:06:11] Analytics / Refinery: Kafka partition leader elections causing a drop of a few log lines - https://bugzilla.wikimedia.org/70087 (christian) [17:06:23] (PS19) Mforns: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [17:14:42] !log Marked raw bits webrequest partition for 2014-10-28T19/1H ok (See {{bug|72679}}) [17:14:48] !log Marked raw upload webrequest partition for 2014-10-28T19/1H ok (See {{bug|72679}}) [17:18:09] (PS1) Yurik: updated scripts (misc changes) [analytics/zero-sms] - https://gerrit.wikimedia.org/r/169750 [17:19:01] (CR) Yurik: [C: 2 V: 2] updated scripts (misc changes) [analytics/zero-sms] - https://gerrit.wikimedia.org/r/169750 (owner: Yurik) [17:24:34] (CR) Milimetric: [C: 2] Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [17:24:43] (Merged) jenkins-bot: Add ability to global query a user's wikis [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129858 (owner: Terrrydactyl) [17:25:36] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages since 2014-08-16 ~07:30 - https://bugzilla.wikimedia.org/69666 (christian) [17:25:36] Analytics / Refinery: Raw webrequest partition monitoring did not flag data for 2014-08-18T13:..:.. as valid for text caches - https://bugzilla.wikimedia.org/69854 (christian) [17:25:39] Analytics / Refinery: Kafka partition leader elections causing a drop of a few log lines - https://bugzilla.wikimedia.org/70087 (christian) [17:28:30] ottomata, fun question for you [17:28:44] is there any way for me to suppress the endless load messages that come from launching hive on the command line? [17:28:55] Analytics / Refinery: Kafka partition leader elections causing a drop of a few log lines - https://bugzilla.wikimedia.org/70087 (christian) [17:29:10] Analytics / Refinery: Raw webrequest partitions for 2014-10-20T02:xx:xx not marked successful - https://bugzilla.wikimedia.org/72252 (christian) [17:29:24] Analytics / Refinery: Raw webrequest partitions for 2014-10-20T10/1H not marked successful - https://bugzilla.wikimedia.org/72295 (christian) [17:29:38] Analytics / Refinery: Raw webrequest bits and upload partition for 2014-10-28T19/1H not marked successful - https://bugzilla.wikimedia.org/72679 (christian) [17:29:54] Analytics / Refinery: Raw webrequest partitions for 2014-10-20T10/1H not marked successful - https://bugzilla.wikimedia.org/72295 (christian) [17:30:10] Analytics / Refinery: Kafka partition leader elections causing a drop of a few log lines - https://bugzilla.wikimedia.org/70087 (christian) [17:30:11] Analytics / Refinery: Make webrequest partition validation handle races between time and sequence numbers - https://bugzilla.wikimedia.org/69615#c17 (christian) Happened again for: 2014-10-29T08/2H (on bits) [17:31:03] !log Marked raw bits webrequest partition for 2014-10-29T08/1H ok (See {{bug|69615#c17}}) [17:31:08] !log Marked raw bits webrequest partition for 2014-10-29T09/1H ok (See {{bug|69615#c17}}) [17:57:35] milimetric, nuria__: I think there is no way to mock validate_users as it is now... [17:58:04] what do you think about making this method an instance method for ValidateCohort class? [17:58:25] mforns_lunch i need to look at the code but sure, it might be in order to mock objects we have had to refcators here & there [17:58:30] no other code uses it [17:58:46] ok [17:58:49] mforns_lunch: i thought mock let you mock module members the same way as class members [17:58:58] one sec, lemme read up on that [17:59:00] milimetric, mforns : it does yes [17:59:16] ok [17:59:35] I've tried that too [17:59:43] heading to cafe, back shortly [18:00:09] mforns, milimetric but i see what you mean, method is just a utility one for the class [18:01:06] mforns, milimetric : it could be aclass method and should receive a session, not create one [18:01:17] milimetric, nuria: I tried - @mock.patch('wikimetrics.models.validate_users', side_effect=mock_validate_users) [18:01:30] before the test, substituting the method [18:01:38] mforns, i tried that before w/o sucess [18:01:43] mforns: It could be an instance method, that would be fine with me [18:01:44] and this works fine from the test scope [18:01:47] but mocking should be simple, like: [18:01:54] but not from the ValidateCohort instance scope [18:01:58] import ...validate_cohort [18:02:07] validate_cohort.validate_users = Mock() [18:02:12] milimetric: only if the test dependecy tree [18:02:13] I tried that too, and no luck [18:02:25] for that class gets instantiated BEFORE teh app dependency tree [18:02:26] sorry [18:02:27] i did it just now in iPython, what was the problem you ran into [18:02:36] it's teh app loading modules [18:02:39] I did not try: validate_cohort.validate_users = Mock() [18:03:33] mforns: the patch approach above is missing 'validate_cohort' from the path to 'validate_users' [18:03:44] ok [18:03:49] should be wikimetrics.models.validate_cohort.validate_users [18:03:51] trying it now [18:03:54] ok [18:04:19] but any of those should work, if they're not let us know what problems come up - maybe they're signs of something worse [18:04:51] milimetric: you can only have 1 python module per name [18:05:04] if the app module gets loaded "before" your mock [18:05:12] i do not think you can mock it [18:05:19] what's the "app module" [18:05:34] teh "real" module [18:05:36] *the [18:05:46] the-no-mock-module [18:05:50] this seemed to work [18:05:51] !!! [18:05:55] we're not mocking the module though [18:06:01] mforns: all right, all set then [18:06:02] we're just mocking a method from it [18:06:29] mforns: make sure to run all the tests though, I'm not sure if this affects the other execution paths (like the mock sticks around but for how long... [18:06:33] ) [18:06:46] ok, I'll make sure [18:07:20] thanks guys [18:30:12] see you tomorrow folks [19:34:25] Is there, or is there going to be, an API for wikimetrics for creating cohorts and getting data about them? [19:35:25] I've got plans for developing a dashboard that would show lots of relevant information about users who are part of the same Wikipedia Education Program course. [19:36:12] and one of the things I'd like to be able to do is to show regularly updated statistics like total edits and total bytes added, for a cohort of all the students in a course. [20:06:08] ragesoss, good question! milimetric? [20:06:12] also, holy hell how is it 4pm? [20:06:22] sorry i'm on a call [20:06:27] ragesoss: i'll try to help soon [20:31:12] milimetric, when you get out: where does the stuff stuck in public-datasets come out, well, publicly? What uri_host etc would I need to point someone to a file in that directory? [20:31:39] Ironholds: http://datasets.wikimedia.org/ [20:31:41] that's the root [20:31:50] with various things being sent there via rsync [20:31:53] so it won't happen right away [20:31:58] ta [20:32:39] so http://datasets.wikimedia.org/public-datasets/ equals the combination of stat2 and stat3's /a/public-datasets/? [20:35:18] ragesoss: you do not need an API for that , you can alredy get that data if it's public [20:35:43] *already get that data if your report is public [20:36:42] ragesoss: for example, [20:37:28] this ui (made for a different use case) shows public data from wikimetrics: https://metrics.wmflabs.org/static/public/dash/#projects=ruwiki,itwiki,dewiki,frwiki,enwiki,eswiki,jawiki/metrics=RollingActiveEditor [20:50:33] nuria__: for a specific cohort of users? [20:51:11] that's my use case: I want regular data on (for example) the 20 users participating in X course. [20:51:17] ragesoss: yes. That functionality is been there for a while. [20:52:06] ragesoss: see for example: https://metrics.wmflabs.org/static/public/167668.json [20:53:00] nuria__: isn't that just a public report on a static cohort? with the cohort itself added manually? [20:54:02] ragesoss: if you create a bytes added report with your cohort and make it public & recurrent (such data is calculated daily) [20:54:21] my use case involves a large number of courses (which may be adding or removing members at times) [20:54:22] you can have a set [20:54:42] ragesoss: wikimetrics does not support adding or removing users from cohorts [20:55:04] ragesoss: removing we think we will support for privacy reasons in the near future [20:55:10] nuria__: and it also does not support creating cohorts via any method except the web interface? [20:55:13] ragesoss: adding will not be supported [20:55:41] ragesoss: our rationale is that cohorts that change render past reports non relevant [20:56:14] ragesoss: for example, if i add 20% more users and compute a "total" for bytes added for today, results are not comprable to yesterday's [20:56:35] ragesoss: we consider that use case to be a new cohort. Hopefully this makes sense. Please [20:56:40] let me know otherwise. [20:57:33] nuria__: yes, that was my understanding. which is why I'm interested in any plans that would facilitate automatic creation of new cohorts. [20:58:13] ragesoss: we have no plans to add that feature in the near term but you can talk to kevinator about it [20:59:42] nuria__: If it's not on the roadmap yet, I expect it will be more expedient to just figure ways of querying the data directly from a replica db. [21:00:21] ragesoss: for data, you have the public files. Does that not suit your use case? [21:01:54] ragesoss: data is not persistent as you probably know, it's only present for 30 days. To be available for ever the best option is a public file that holds your report. let us know if you think that will not work for you. [21:02:12] nuria__: you mean dumps.wikimedia.org? [21:02:41] or the public files from wikimetrics? [21:02:43] ragesoss: no, i mean https://metrics.wmflabs.org/static/public/ [21:02:57] these are all public reports, see for example a timeseries: [21:03:22] ragesoss: https://metrics.wmflabs.org/static/public/datafiles/RollingActiveEditor/ruwiki.json [21:03:27] nuria__: those would work, if we could automate the creation of new cohorts (so that every time the composition of a class changes, or a new classes is added, we create a new cohort. [21:04:13] ragesoss: as i said there are no plans to automate creation of cohorts but you can talk to kevinator about it. [21:04:38] ragesoss: if you can do it by hand for now the public data files will enable to create dashboards/reports [21:05:34] thansk much nuria__. The timeline for my project is early 2015, so I think I'll plan on finding ways other than wikimetrics to get the data. [21:08:30] Analytics / EventLogging: Add test flag to EventLogging - https://bugzilla.wikimedia.org/72365#c4 (christian) (In reply to nuria from comment #3) > >Vagrant can run EventLogging. > > Can run the client code, yes. And it can also run the relevant server code. EventLogging comes with a dedicated “devse... [21:10:40] milimetric, ^ [21:13:10] qchris: still around? [21:13:17] yup [21:13:23] wassaaap? [21:13:57] i wrote a little crappy script to automatically take a kafkateefile and a udp2log file based on name and date and group by hosutname and day and count [21:14:03] wanna vet the vetter? [21:14:14] haha :-) [21:14:15] Sure. [21:14:32] stat1002:/home/otto/kafkatee_vet.sh and kafaktee_vet_functions.sh [21:14:49] * qchris likes files that end in ".sh" [21:14:53] * qchris looks [21:18:28] really just check the 2 functions at the bottom of that functions file [21:18:38] 3 functions [21:18:57] grep -vE '^ssl' might miss the ulsfo ssl terminators. [21:19:05] milimetric: yt? [21:19:10] But ulsfo is currently low volume. [21:19:21] (Or has been when I last checked) [21:19:45] And also some udp2log tsvs don't have ssl included. So it might not be relevant. [21:20:22] hmm [21:20:30] i was getting them in my checks [21:20:36] with edits, anyway [21:20:38] hi DarTar [21:20:42] sorry, I'm back [21:20:45] hey [21:21:01] ulsfo ssl terminators? [21:21:15] ragesoss: Hi, ragesoss! it is not on our roadmap. How is your SQL… have you seen Quarry, here’s a neat example: http://quarry.wmflabs.org/query/290 [21:21:18] You'd only detect them by IP, as they are plain caches. [21:21:21] aren't they? [21:21:37] no [21:21:40] not ulsfos [21:21:50] ssl hostnames [21:21:51] real quick, I saw the proposal – all of the options on the table kind of make sense to me, I think we should work with Product to help them figure out what an app “RNAE” looks like [21:22:15] qchris, should I just exclude ulsfo from coutning? [21:22:24] right not, I have no clear sense of what this data will feed into [21:22:38] 'ssl4' does not match anything in operations/dns. [21:23:00] kevinator: yeah, I think the SQL route will be the way we'll go. My own SQL is pretty basic, but I'll have developers implementing my project who should be able to handle it. (And there's always the tried-and-true "beg for favors from halfak or yuvipanda when you get stuck" method. [21:23:13] milimetric: we actually know that (at least for now) the two populations of desktop vs mobile NAE are fairly segregated [21:23:20] Also ... previously ssl requests were terminated by cp4... machines. [21:23:23] ragesoss, :P [21:23:31] I like hacking on SQL with people :) [21:23:41] ditto ragesoss [21:23:57] Analytics / EventLogging: Add test flag to EventLogging - https://bugzilla.wikimedia.org/72365#c5 (nuria) You are right, my bad. Should have looked at code instead of relying on memory. Now, Validation point being made I see little value in adding a test flag and I am of the opinion that we should not... [21:24:05] :D [21:24:24] milimetric: so I guess the short answer is: I don’t know :-/ [21:24:36] DarTar: that makes sense, and your findings are valuable for the "newly" metrics [21:24:40] ottomata: Run [21:24:42] ottomata: zgrep cp4.*https:// /a/squid/archive/edits/edits.tsv.log-20141002.gz | head [21:24:48] (I don’t know if the idea of fractions of an active editor is the right way to go) [21:24:53] but my point was more practical - like, if we don't have to use ServerSideAccountCreation, let's not [21:24:56] ottomata: on stat1002. That shows that ssl is terminated by the plain caches in ulsfo. [21:25:05] because it makes it easier for us to get this breakdown done [21:25:08] ottomata: That is special about ulsfo IIRC. [21:25:29] because the notion of an “active editor” is an artifact we shouldn’t take too seriously ;) [21:25:56] gotcha on the fractions [21:25:57] milimetric: yes that’s a valid point, so how about we start prioritizing those metrics that are purely based on MW tags [21:26:11] and where the definition is fairly straightforward [21:26:11] DarTar: that works for me if it works for kevinator [21:26:12] ottomata: I guess the [21:26:14] ottomata: sed -n -e "${start_timestamp},${end_timestamp}p" [21:26:31] it looks like the edge cases are also those with dependencies on EL data [21:26:31] yeah, i'm not using that [21:26:33] ottomata: in truncate_files is not doing what it is mean to do. [21:26:36] i decided to just grep for a single day within two files [21:26:38] so let’s not work on those yet ;) [21:26:40] ottomata: ok. [21:26:46] kevinator: ^ [21:27:14] yeah, and if there's a definition that lets us get around the EL data without losing much (like the stuff halfak and I were chatting about) then that seems like a good next step [21:28:20] I think there’s no problem with having, say, mRAE limited to editors who meet the 5 edit threshold on that specific target site [21:28:39] milimetric, DarTar: that prioritization works for me (do the metrics that aren’t based on MW tags) [21:28:54] we want to do the ones that *are* [21:28:56] that *are* based on MW tags, right? :) [21:29:02] qchris [21:29:06] not on EL, that is [21:29:09] how about, isntead of grep -v ssl [21:29:12] ah yes, missread [21:29:14] awk '$9 !~ "^https" { print $1 " " $3 }' [21:29:19] k, sounds good [21:29:25] ottomata: not sure about the "jobs -p" invocation. Is the job guaranteed to show up there when it is launched. Can it be that machine is quicker? [21:29:34] it can be qchris [21:29:35] but i do not care [21:29:42] :) [21:29:43] I’m about to put my story writting hat on and log some bugs… which metrics depend on MW tags? [21:29:51] if it is quicker, then the pid won't be there, and the script won't wait [21:29:52] so then DarTar, I can write the SQL for that and I'll just double check with Aaron before we deploy anything [21:29:53] but then the FAIL computation might be wrong. [21:30:03] if there is error exit code, it won't catch it [21:30:07] yeah, but i don't care [21:30:10] or better - just poke the list when it's deployed in staging so you guys can play around [21:30:10] milimetric: sgtm [21:30:13] one off vet [21:30:16] yeah [21:30:18] k. [21:30:21] k [21:30:27] i just wanted it to wait until the jobs were done [21:31:18] qchris: awk '$9 !~ "^https" { print $1 " " $3 }' [21:31:18] ? [21:31:28] do you thikn I need that and a ! ^ssl check? [21:31:39] I think the awk is fine. [21:31:53] but just 'grep -v https://' would be fine too [21:32:09] i think that is less precise [21:32:13] referrer field might match [21:32:40] you're right ... you're grepping before the cut. [21:32:54] when cutting first, grep is much faster. [21:33:14] like cut -f 1,3,9 | grep -v https:// | cut -f 1,2 [21:33:16] But it's ok. [21:33:36] Whether or not you need the ssl thing depends on the file you vet. [21:33:54] For the edit tsvs, I think it should matter. [21:34:12] but, for ssls, $9 will be https, right? [21:34:14] so this awk should be enough? [21:34:19] But you can try without ... and if vetting fails, one can filter for ssl. [21:34:26] true! [21:34:30] Right about $9 and https [21:35:56] I think the vetter is fine. [21:36:02] ottomata: ^ [21:36:06] danke! [21:36:11] thanks [21:36:18] maybe my counts will be even closer with the https check [21:36:20] i'm trying zero now [23:36:57] Analytics / Wikimetrics: Dashiki uses Mediawiki for storage - https://bugzilla.wikimedia.org/68448#c1 (Kevin Leduc) Collaborative tasking logged @ http://etherpad.wikimedia.org/p/analytics-68448 [23:41:12] Analytics / Wikimetrics: Story: Dashiki uses Mediawiki for storage - https://bugzilla.wikimedia.org/68448 (Kevin Leduc) [23:47:54] hi analytics, I've forgotten how I gained access to stat1001, what is its bastion?