[09:25:44] Analytics / Tech community metrics: "Volume of open changesets" graph should show reviews pending every month - https://bugzilla.wikimedia.org/70278#c29 (Alvaro) Quim, we plan to add the Verified -1 and -2 filtering today but the generation of the new metrics won't be available until this night. As so... [13:07:34] qchris: hi! [13:07:39] enjoy your time off? [13:07:41] Hi milimetric [13:08:08] Ja, it was a nice week :-) [13:08:32] Finally seeing faces to the names of gerrit people :-) [13:08:46] oh wow, met them for the first time?! [13:08:51] Yup. [13:09:06] that's cool :) did you guys all bring passports? :P [13:09:24] For keysigning? [13:09:29] yea [13:09:40] Hahaha. No. [13:10:30] It seems I missed some fun around EEVS [13:10:41] or rather ... the thing we are no longer allowed to call EEVS. [13:10:56] well - latest news is that the place where mysql server's datafile is mounted just ran out of space [13:11:07] i was just trying to figure out how to fix that [13:11:17] Ouch :-/ [13:11:48] well, it only had 2G, I should've checked a long time ago [13:12:03] and there's a lot of data we can clean up, but in general we should probably have at least 10G [13:13:15] not on /srv? [13:13:51] it's on /dev/vda2 [13:14:03] assuming labs, use /srv! :) [13:14:43] what's /srv? I don't see that in df [13:14:51] i mean, the wikimetrics code is there [13:14:54] but nothing else [13:15:31] milimetric: if you enable the srv role, /srv has all the space that your instance had [13:15:47] by default labs instances don't use all the space available [13:18:55] YuviPanda: is that safe to enable and run puppet on a "production" machine? [13:19:13] depends on what you mean by 'production'? [13:19:25] it's the instancethat runs wikimetrics [13:19:28] and how long puppet has been off :) [13:19:36] puppet is run frequently [13:19:41] then it should be fine [13:19:55] I can help too if you want [13:20:43] so - current situation is mysql is erroring out because the datafile reached full size [13:21:01] like, it's at /var/lib/mysql and that used up its allocated 2G [13:21:34] so I'm assuming I'd enable lvm::srv and then somehow move the datafile to /srv/wikimetrics-mysql [13:21:44] yup [13:21:48] which project is this? [13:21:52] I can do this for you if you want :) [13:22:44] analytics project, wikimetrics1 instance [13:23:16] it's backed up as of the last time it was functional so it should survive a catastrophe [13:23:22] lemme check the backup real quick [13:24:06] milimetric: alright, there's a weird /srv there, let me know when I can proceed [13:25:01] YuviPanda: backup looks healthy [13:25:06] alright [13:25:08] * YuviPanda goes ahead [13:25:09] what do you mean by weird srv? [13:25:20] the code is checked out in /srv/wikimetrics [13:25:27] yeah [13:25:29] but it's just a folder /srv [13:25:37] it's not mounted on the /srv volume which has the extra space [13:25:45] oh, i see [13:30:01] milimetric: I'm logging my actions on -labs [13:47:47] Analytics / General/Unknown: X-Analytics header is "php=zend;php=zend" instead of "php=zend" on bits for some requests - https://bugzilla.wikimedia.org/70463#c3 (christian) The 'php=zend;php=zend' is gone. But it seems that now requests to https://bits.wikimedia.org/geoiplookup https://bits.wikime... [13:50:27] milimetric: check https://metrics.wmflabs.org/? [13:51:07] looks good YuviPanda [13:51:28] milimetric: cool. the puppet setup of mysql in wikimetrics isn't very ideal, I think I should submit a patch [13:51:43] /dev/mapper/vd-second--local--disk 68G 794M 64G 2% /srv [13:51:54] milimetric: so /srv has 64G of free space, so mysql shouldn't run out of space anytime soon [13:52:51] yeah, patches are welcome - ottomata or maybe qchris can look them over [13:52:58] yeah [13:53:10] it's directly using the mysql package instead of our module [13:54:45] milimetric: do you usualy leave puppet on or off on that machine? [13:55:00] off [13:55:08] why? [13:55:11] we just run it every time we upgrade it [13:55:13] it's generally bad practice... [13:55:44] it's on in qa and dev and it does really annoying things like restart everything [13:55:55] hmm [13:56:05] then puppet should be fixed, but I'll leave it disabled for now [13:56:07] restart apache, queue, scheduler, etc. [13:56:32] not to mention people break puppet constantly and we're running with 2 patches on a side-branch right now [13:57:14] heh [13:57:16] ok [14:00:39] YuviPanda: thanks a lot though [14:00:45] :D [14:00:47] milimetric: yw! [14:01:04] milimetric: do you want to send out an email of some sort? 'it was down, back up now'? [14:26:32] milimetric: we need a project in bugzilla for dashiki, how do we do that? [14:27:34] nuria_: https://www.mediawiki.org/wiki/Bug_management/Project_maintainers [14:28:01] Analytics / Visualization: Dashiki build needs to clean up ./dist directory before building - https://bugzilla.wikimedia.org/70845 (nuria) a:nuria [14:28:31] ahhhh, thanks YuviPanda [14:50:36] (PS1) Nuria: Cleaning up dist directory befory starting build [analytics/dashiki] - https://gerrit.wikimedia.org/r/160455 (https://bugzilla.wikimedia.org/70845) [14:54:07] (CR) Milimetric: [C: 2 V: 2] Cleaning up dist directory befory starting build [analytics/dashiki] - https://gerrit.wikimedia.org/r/160455 (https://bugzilla.wikimedia.org/70845) (owner: Nuria) [14:55:12] nuria: let's not worry about it since the phabricator move is coming up so soon [15:26:28] qchris: I am starting to think that message loss is not due to varnishkafka [15:26:39] which means I do not know where they are being lossed when leadership changes [15:27:49] ottomata: 1 sec [15:27:53] ja np [15:30:43] ottomata: Mhmmm ... [15:30:49] So the pipeline is: [15:30:54] varnish -> varnishkafka -> kafka -> camus -> hdfs [15:31:04] Any other parts? [15:31:50] that's it [15:32:02] so, we can be sure it isnt' varnish, varnishkafka is generating the sequence numbers here [15:32:11] milimetric: puppet ran on wikimetrics1 successfully. check to see if things are ok? [15:32:15] from a cursory glance they seem to be [15:32:18] ottomata: Good point. [15:32:29] and, i'm pretty sure it is kafka related somehow, as the loss times are correlated with kafka zookeeper timeouts [15:32:32] but [15:32:43] there are 0 reported txerrs and drerrs in varnishkafka [15:32:59] there are no logs in varnishkafka logs about message loss, which there usually are if there are dropped messages [15:33:38] Did you check with Snaps if he has seen such drops? [15:34:32] all good YuviPanda, looks like some errors while the changes were being made but nothing important [15:34:34] Or maybe (since he read more code), he has a clue where the issue could be? [15:35:13] i just pinged him on irc a few mins ago [15:35:16] YuviPanda: thanks again for the help! [15:35:27] milimetric: \o/ yw [15:35:32] ottomata: Cool. [15:58:46] hm, qchris [15:58:50] Yup. [15:59:19] so, i'm examining lost sequence for host cp3012.esams.wikimedia.org in the 2014-09-13T20 hour [16:00:23] kafka zk timed out at 2014-09-13 20:43:55 [16:00:31] Client session timed out, have not heard from server in 17237ms [16:00:49] the first missing seq is at [16:00:50] 2014-09-13T20:43:16 [16:00:55] and the last is at [16:00:55] 2014-09-13T20:43:49 [16:01:04] :-) [16:01:36] weird, eh? [16:01:36] Oh. That might be expected, right? [16:01:42] maybe? how? [16:01:52] Intermediate buffers [16:02:01] A packet from 2014-09-13T20:43:16 [16:02:11] need not arrive at the broker on 2014-09-13T20:43:16. [16:02:27] It might arrive later. [16:02:34] true, buffer time is at 30 seconds on vks now [16:02:42] So that's within reach. [16:03:01] hm [16:03:32] 20:43:55 - 17 seconds - 30 seconds = 20:43:08. Anything before that would be weird. [16:04:07] s/Anything/Any drop/ [16:47:51] qchris_away: when you get back, lets' sync up about hadoop pageview stuff [17:01:32] Analytics / Wikimetrics: Add MAILTO to backup crontab on puppet to be notified of failures - https://bugzilla.wikimedia.org/70853 (nuria) NEW p:Unprio s:normal a:None Add MAILTO to backup crontab to be notified of failures. Currently we have added the MAILTO by hand. [18:11:49] ottomata: back. [18:12:03] About the sync up ... irc, hangout, or trap? [18:12:04] in ops meeting :/ [18:12:07] k [19:00:53] qchris_meeting: no longer in meeting, but now you are! [19:48:33] ottomata: I guess now you're in a meeting again? [19:48:44] nope! [19:48:52] :-D [19:49:00] Sooo .... irc, hangout, or trap? [19:49:15] hm, irc! [19:49:19] Coolio. [19:49:37] You wanted to sync up on pageviews. [19:50:48] yes so sososo [20:01:34] ottomata1: I am not sure ... did the network disconnection cut off anything after the "yes so sososo"? [20:01:40] What do we want to discuss around pageviews. [20:02:05] From my point of view, we would not change definitions right now. [20:03:21] Instead, we'd switch the pagecount files from dumps.wikimedia.org to expose data from the kafka pipeline. [20:03:37] oh, guess so [20:03:53] Ok. Sync up done :-d [20:03:55] i'm basically not working on anything related, i'm doing some kafka stuff now, did lots of elasticsearch stuff last week [20:03:55] just want to make sure of where we are [20:03:55] oh, btw, i turned on sampled-1000 via kafkatee on analytics1003 last week [20:03:56] like you asked for [20:04:05] I saw that. Thanks. [20:04:13] (But did not find time to check the files) [20:04:14] qchris, i think that is a separate project [20:04:26] switching dumps [20:04:30] i don't think we should switch dumps at all at this time [20:04:58] someone (ahem, you? :) ) should get a hadoop (hive?) implementation of the logic in filter/collector to work [20:05:06] that's what i'm curious about :) [20:05:21] Ad dumps ... I was refering to https://dumps.wikimedia.org/other/pagecounts-raw/ [20:05:47] I think we should switch that to files built from kafka. [20:06:20] i think we should to, but why hurry? [20:06:32] i'd rather switch the other udp2log logs first [20:06:56] and, i don't think we should make any of that a dependency for generating hadoop pageview counts [20:07:01] Well no need to hurry. But if we could stop using udp2log at end of month ... that would be great. [20:08:06] And by what I saw up to now, that is well within reach. [20:09:08] qchris, if we want to do that, we should switch out the logs being used on stat1002 to generate things right now [20:09:28] those logs are being copied over there [20:09:49] now that we have sampled-1000 (once verified), we could even make wikistats start using them [20:09:51] Let's vet the tsvs first. [20:09:55] Right. [20:09:57] aye [20:10:13] but, one of the reasons i really don't want to make the switch yet, is analytics1003 is not real home for kafkatee [20:10:15] its our placeholder [20:10:32] It's not? Where would it live? [20:10:43] the current ud2plog hsots [20:10:44] hosts [20:10:45] oxygen, erbium, gadolinium [20:11:03] They are beefy enough? [20:11:07] so, we need to make sure we can do what we usually do with all the logs gnerated there [20:11:50] not sure! but perhaps, we can rearrange the way they do things though [20:11:59] ok. [20:12:03] if only some filters need some of the streams, one machine can do one [20:12:14] e.g., oxygen could do mobile/zero stuff, [20:12:14] etc. [20:12:23] we can keep analytics1003 doing this if we have to, but that was not the original intention [20:12:41] Ok. Agreed. [20:12:45] anyway, the next step for that project is to get stat1002 processes to start using the kafkatee generated log files [20:12:55] the zero dashboards are generated there from udp2log files, right? [20:12:59] Right. [20:13:05] and, geowiki? [20:13:14] Geowiki is using mediawiki databases directly. [20:13:23] ohoh kkk [20:13:25] so, what is there? [20:13:31] zero dashboards [20:13:31] wikistats [20:13:33] ...? [20:13:35] anything else? [20:13:41] I think that's it from the services we run. [20:13:42] mobile dashboards (from mobile-100s?) [20:14:08] Not sure what runs them. [20:14:18] ok, well, what do we need to do to get them moved over? [20:14:22] haha, shoudl I just switch out the dirs? [20:14:26] Yes. [20:14:32] and symlink /a/squid to /srv/log/webrequest? (or whatever it is?) [20:14:39] Once the data is vetted, that's what I'd do. [20:14:47] And adjust for the historic tsvs. [20:14:48] everytime you symlink something ori dies a little bit [20:14:50] :) [20:15:24] That way, the switch is transparent to services. [20:17:46] ok, what's left to do to vet? [20:18:06] the tsvs are not yet vetted at all. [20:18:48] So basically, one needs to look if the represent the same data. [20:25:09] ottomata: so what's your take on the status quo? [20:25:39] ok, i think the hadoop pageview stuff is more important [20:25:51] the udp2log replacement should not be a blocker for that [20:26:30] And "hadoop pageview stuff" refers to producing Webstatscollector through Hadoop/Hive? [20:26:46] (Not the pageview re-definition and some such) [20:27:05] well, yes, as a first step [20:27:15] the idea would be to plug in a new pageview implementation later [20:27:17] but yes [20:27:40] Ok, I think we are on the same page then. [20:35:54] so, i guess that's really what I want to sync up on, have you started on that at all? qchris? [20:37:02] Only partly. [20:37:21] Since it's webstatscollector only, the plan is to do it in Hive for the first part. [20:37:36] That allows to instrument the cluster, but still allow easy adaptions. [20:38:17] Webstatscollector is simple, so I have some HiveQL snippets that do the trick from the [20:38:31] webstatscollector checking we did on analytics1003. [20:38:41] ah right [20:38:42] ok cool [20:38:52] The interesting part is more on the scheduling part for now. [20:38:58] Will it be Oozie or not? [20:38:59] oh, the oozie side? [20:39:02] yes! :) [20:39:03] yes? [20:39:14] ideally we'd fire them off based on _SUCCESS file [20:39:18] Totally! [20:39:19] but, i think we can't rely on that, dunno... [20:39:21] so far anyway [20:39:29] But how to get the files into plain fs? [20:39:34] ? [20:39:40] oh, not hdfs [20:39:51] Cron would be easier, as we can use hive to fire the query and write directly to a directory. [20:39:59] Yes. The non-hdfs part. [20:40:38] Also, I am thinking about the trimming part on the domain name (e.g.: "en.wikibooks.org" -> "en.b") [20:40:49] That's easy in hive too, but sooooo verbose. [20:40:54] as for writing, i think just select into a table [20:41:00] keep it in hive [20:41:01] It makes the query pretty unreadable. [20:41:01] and we will figure out how to get it out [20:41:13] you could make your table tsv separated [20:41:18] then you'd have files in hdfs that are just tsvs [20:41:22] Why would we even persist the data into a hive table? [20:41:23] even though you wrote to them witha hive insert [20:41:56] why not? [20:42:01] well, the table will get large, so it is selectable. then it will also be easily sqoopable. I betcha we can even sqoop out of a hive query into a local file [20:42:02] maybe... [20:42:26] or maybe not [20:42:27] even so [20:42:33] I am not sure. [20:42:42] if we ahve files in hdfs (created via hive, or not), we can figure out how to just pull them out locally [20:42:57] Yes. [20:43:00] we can script some diff $(hdfs ls /path) $(ls /local/path) [20:43:30] But hive writes the data to temporary storage. If we require it in non-temporary storage only to copy it to the disk afterwards, [20:43:37] or maybe we could sqoop them out to mysql and export tsvs from there.. :) [20:43:46] it is an unnecessary copy from temporary to real hdfs-storage. [20:44:10] naw, its still useful in hdfs though, being able to query this table in hive will be useful, no? [20:45:15] I am not sure if such a table would buy us more problems than it solves issues. [20:45:23] why? [20:45:31] If the table exists, researchers will use it, and start relying on it. But when we have [20:45:49] the full pageview definition, we might not be producing such tables any longer. [20:46:26] So that way, we only bought us discussion about "removing data from researchers". [20:46:56] For me, the currenty implementation is more or less just to get the [20:47:06] {page,project}count files out of Hadoop. [20:47:23] The whole setup will change, once we have a good pageview definition. [20:47:57] So I would not want to design schemas and everything around the throw-away implementation, that we only use to [20:48:12] bridge the time between turning off udp2log and having a proper pageview implementation. [20:48:31] hmmmmm [20:48:38] i dunno. [20:48:39] But I guess I can easily be convinced of the opposite. [20:48:44] haha [20:48:45] i bet you can! [20:48:58] i don't see how giving the researchers this data in queryable form is different than giving it to them in text form [20:50:00] actually, qchris, i don't care so much [20:50:12] i thikn that's how i'd do it, mainly because all our tooling is around hive now, so why not keep it that way [20:50:19] but, if you were to do this not in hive...which would be totally fine [20:50:25] then maybe some other way is better [20:52:07] Let's just code towards it. If some problem is better solved with materializing in Hive, let's materialize it (for now). If no problem requires it, we can always add it as bonus, if we want to. [20:52:13] How does that sound? [20:52:30] sounds good to me :) [20:53:13] qchris, do you think this data is temporary anyway? the hadoop pageviews w webstats logic data? [20:53:25] Yes, I hope so. [20:53:31] ok, hm. [20:53:37] The webstatscollector pageview definition is wrong in so many cases. [20:53:42] so, it sounds like you think it is not worth building tooling around supporting it [20:53:43] well, yeah [20:53:53] e.g. oozie [20:54:05] hm, i envision the pageview def implementation to be a UDF [20:54:09] if we are using hive [20:54:14] not built into hive logic [20:54:30] or, more generally, Java code, with a UDF wrapper [20:54:35] like diederik's was [20:54:52] Java code + UDF wrapper sounds ideal to me too. [20:55:17] But I am not sure if the UDF would get used in the automatic jobs. [20:55:50] Hive might be too slow and too general purpose for that application. [20:56:11] The UDF would just help when trying to detect issues. [20:56:32] But I guess, we'll see once we get closer to it. [20:56:56] For the one-off webstatscollector, Hive seems like a good first choice. [20:57:39] hm ,qchris, i think part of the reason for doing this [20:57:48] is to start building the base for plugging in the new pageview def [20:58:04] it'd be nice if we built a class and a UDF wrapper [20:58:20] and the hive query would pretty much the same, once we had a new pageview implementation [20:58:49] or, maybe we wouldn't use hive, as you say [20:58:57] if we did raw mapreduce, or scalding, or whatever [20:59:06] As Hive cannot write out multiple files in one go, I am not sure if that would scale nicely if we want to produce multiple output variants. [20:59:08] we'd still be able to use the same method to classify a pageview [20:59:31] aye, that's fine, i think that's why the java code is the more important part here [20:59:34] of this effort [20:59:42] sure, the data we are going to generate now is going to use crappy logic [20:59:50] but we will have tooling around using a java class [20:59:55] via udf, or whatever [21:00:05] then when we get the new pageview def, its just a drop in [21:00:19] do we need anything more featureful than an isPageview() method right now? [21:01:09] Not really. There is stripping of domain names. But that's doable either way. [21:01:52] But I am not sure that the Java part is crucial for the first iteration. For me, the crucial thing is to produce value out of the cluster. [21:02:29] For now, that'd be easier using straight Hive. [21:02:47] No need to set up Maven, JUnit and so on. [21:03:23] That's something we can achieve in September. [21:03:49] And it's nice, useful, showcaseable, and it produces real value for the community right now. [21:04:57] hm, if we are going to release this to community, then we probably shoudl have good tooling around this [21:04:59] including oozie [21:05:34] Well ... it would replace the current way to produce {page,project}count files. [21:05:56] The tooling needs to be good enough to support for a few months. [21:06:37] i guess...btw we would not be replacing the existing ones [21:06:42] just providing an additional dataset [21:07:27] That's ok too. I'd just see udp2log usage go down sooner rather than later. [21:07:33] But having both is fine by me. [21:08:15] well, ok, hm. i could vet files...but I think you are much better at that than me :p...hm. how long do you think it would take to vet? [21:08:48] if we vet files, then I can work on gradually reducing usage of udp2log [21:08:49] For mobile+zero+sampled ... maybe two days? [21:09:08] ok, shoudl we ask kevin/toby which is better to work on first? [21:09:29] If you want to, ask them. But I think we can pull of both this months. [21:09:40] s/of/off/ [21:09:50] ok, well then, 2 days isn't so bad...which woudl you rather work on first? [21:10:19] I'd rather do the webstatscollector re-implementation first. [21:10:24] But I do not care too much. [21:10:30] ok, then do it! [21:10:31] :) [21:10:58] It's one of the top three things on my to-do list :-D [21:11:13] COOooOOOoOL [21:11:14] :) [21:11:32] i am trying to double super check that these missing seqs are not in kafka [21:11:52] to rule out camus problems...if they are in kafka then that will be very interseting [21:12:03] k [21:12:05] Cool. [21:13:18] Btw. ... now that the University stream is gone ... what about bringing over all udp2log filters to analytics1003? [21:13:52] That might help to benchmark the resources we need from erbium&co to run them through kafkatee, and [21:14:42] if we do that already now, we'd have more data to vet, when trying to bring over the remaining tsv (i.e.: non mobile+zero+sampled) over to kafkatee. [21:15:17] HMmm good idea [21:15:19] sure, don't see why not [21:15:34] really that's only the remaining filters from erbium and oxygen, ja? [21:15:51] I think so. Let me check puppet again. [21:18:04] The files that I know that analytics people care about are generated on erbium and oxygen. [21:18:22] yeah [21:18:26] ok, gonna make a commit, will merge it tomorrow [21:18:38] You tha man! [21:24:03] hmm, i'm going to leave the fundraising ones out of this commit, but those will come shortly [21:24:05] i want to talk to jeff green [21:26:08] Sure. [21:37:35] qchris, missing seqs confirmed missing from kafka [21:37:41] ok, that is expected, but at least I know that now [21:37:58] That's great! [21:38:03] So no issues on camus. [21:38:06] right [21:38:28] so, somehow kafka is missing seqs, but varnishkafka reports 0 lost messages [21:38:55] i asked snaps, he said that if varnishkafka txerr and drerr are 0, then for sure there are no lost messages [21:38:56] haha [21:38:56] well [21:39:03] :-D [21:39:11] "commercially guaranteed!" [21:39:12] haha [21:39:21] hahaha [21:39:25] so now we have a big ol' mystery. [21:40:38] I guess I'll sleep over the mystery :-) [21:41:22] Good night! [21:41:33] ok nighters! i'm about out too [21:41:35] thanks! lasters [21:41:36] laters. [22:11:49] (PS1) Milimetric: Match colors in graph with labels [analytics/dashiki] - https://gerrit.wikimedia.org/r/160532 [22:12:07] nuria_: I've gotta run soon but that's the colors patch ^ [22:12:45] I like the color scheme you set for the dashboard in general but I had to add an ugly white border around the color swatches to keep them from interfering with the red [22:12:49] *orange [22:12:58] feel free to change that style if you see a better way [22:20:13] milimetric, will look at it tomorrow as i want to make some progress on url stuff [22:58:46] Analytics / Dashiki: Update metric classification in Vital Signs dashboard - https://bugzilla.wikimedia.org/70871 (Kevin Leduc) NEW p:Unprio s:normal a:None Current behavior shows the following classes when clicking the add metric button: - All metrics - acquisition - community - retention... [22:59:00] Analytics / Dashiki: Update metric classification in Vital Signs dashboard - https://bugzilla.wikimedia.org/70871 (Kevin Leduc) p:Unprio>Highes [23:00:12]