[00:30:24] (CR) Nuria: "I looked at the changeset for and hour and i think that (minus the inside vega knowdledge when it comes to scales settings) the only part " (3 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/160532 (owner: Milimetric) [07:33:54] (CR) Gilles: [C: 2 V: 2] Fix relative UW funnel numbers [analytics/multimedia] - https://gerrit.wikimedia.org/r/160682 (owner: Gergő Tisza) [07:36:17] (CR) Gilles: [C: 2 V: 2] Remove useless columns from ordinal chart [analytics/multimedia] - https://gerrit.wikimedia.org/r/160684 (owner: Gergő Tisza) [07:39:05] (CR) Gilles: [C: 2] Add relative timeseries [analytics/multimedia] - https://gerrit.wikimedia.org/r/160779 (owner: Gergő Tisza) [07:39:08] (CR) jenkins-bot: [V: -1] Add relative timeseries [analytics/multimedia] - https://gerrit.wikimedia.org/r/160779 (owner: Gergő Tisza) [07:41:29] (PS2) Gilles: Add relative timeseries [analytics/multimedia] - https://gerrit.wikimedia.org/r/160779 (owner: Gergő Tisza) [07:42:20] (CR) Gilles: [C: 2] Add relative timeseries [analytics/multimedia] - https://gerrit.wikimedia.org/r/160779 (owner: Gergő Tisza) [07:42:25] (Merged) jenkins-bot: Add relative timeseries [analytics/multimedia] - https://gerrit.wikimedia.org/r/160779 (owner: Gergő Tisza) [08:06:20] (CR) Gilles: [C: -1] "The last timeseries shows nothing, but I'm not sure if it's not caused by the TSV itself, which contains some "NULL" values for September " [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/160785 (owner: Gergő Tisza) [08:11:19] (CR) Gilles: "Filed a card for the data collection issue: https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/894" [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/160785 (owner: Gergő Tisza) [11:58:30] (PS4) Milimetric: Match colors in graph with labels [analytics/dashiki] - https://gerrit.wikimedia.org/r/160532 [11:59:06] (CR) Milimetric: Match colors in graph with labels (3 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/160532 (owner: Milimetric) [11:59:31] (PS6) Milimetric: Match style from Pau's wireframes more closely [analytics/dashiki] - https://gerrit.wikimedia.org/r/160690 [14:23:31] heya milimetric, do we still need the metrics-api.wikimedia.org CNAME? [14:23:50] don't know of anyone using that ottomata [14:24:26] KKKIIIILLLLL ITTTT [14:24:32] k [14:24:38] (CR) Nuria: [C: 2] Match colors in graph with labels [analytics/dashiki] - https://gerrit.wikimedia.org/r/160532 (owner: Milimetric) [14:26:58] (PS2) Milimetric: Build Second Release [analytics/dashiki] - https://gerrit.wikimedia.org/r/160693 [14:27:00] (CR) Nuria: [V: 2] Match colors in graph with labels [analytics/dashiki] - https://gerrit.wikimedia.org/r/160532 (owner: Milimetric) [14:27:20] (CR) Nuria: [V: 2] Match style from Pau's wireframes more closely [analytics/dashiki] - https://gerrit.wikimedia.org/r/160690 (owner: Milimetric) [14:29:05] (PS3) Milimetric: Build Second Release [analytics/dashiki] - https://gerrit.wikimedia.org/r/160693 [14:30:58] milimetric, nuria_ : I deleted the “sprint planning” meeting that was set for this morning. I think Sara mistakenly set it for the wrong time. It also exists at the right time on Thursday [14:40:35] (CR) Milimetric: [C: 2 V: 2] "nuria let me self-merge :)" [analytics/dashiki] - https://gerrit.wikimedia.org/r/160693 (owner: Milimetric) [14:41:34] qchris: you are totally right! bits is not in udp2log! [14:42:11] Great. [14:42:25] I am not sure if it ever was, so [14:42:31] yeah, don't think it was [14:42:33] it's probably not worth changing it. [14:42:33] cool, ok. [14:42:36] Ok. [14:42:36] no certainly not [14:42:48] just didn't realize (or maybe I knew once and then forgot) [14:52:37] kevinator: sumo oranges look crazy! It's like a watermelon [14:52:57] also, kevinator, latest deployed: https://metrics.wmflabs.org/static/public/dash/ [14:52:59] they aren’t that big [14:53:09] oh maybe just the plate they're on... [14:53:16] cool, checking it out now [14:53:35] sumos are a little smaller than a navel orange [14:59:38] kevinator: you can probably shoot that back to Pau along with our comments on the etherpad and any thoughts you have about prioritization [15:00:03] but I was thinking it's a bit closer to the design now [15:00:10] (let me know what Jared / others think) [15:00:22] will do. And I’ll let Jared know too [15:12:48] YuviPanda: still around, I got another amendment to that patch [15:13:12] or maybe a separate patch. [15:13:47] * YuviPanda is around [15:17:01] YuviPanda: https://gerrit.wikimedia.org/r/#/c/160970/ [15:18:34] oof, i dunno, or I could wrap it with an if !defined(..) [15:18:34] :/ [15:19:21] ensure_resource maybe? [15:19:26] ottomata: you should poke andrewbogott [15:20:44] hm, oh cool, ensure_resource [15:21:12] YuviPanda: what happens if the resource is declared elsewhere without ensure_resource? [15:21:16] in the normal way? [15:21:18] it errors :D [15:21:27] or is order dependent [15:23:24] aye [15:23:30] just like if !defined [15:23:36] ya [15:25:56] pssh, actually i have a larger problem because the wikimetrics module manages my mount path [15:26:09] but i need the thing to be mounted first [15:26:13] oof [15:26:24] I still suggest making it a param to the class, and setting it to /srv in the puppet role :) [15:26:27] YuviPanda: i could solve this problem if I wrap it with if !defined() , and also do so in the module [15:26:47] ha, well, now that i'm lookking at this, i'd still ahve this problem [15:27:01] you won't have a different mountpoint then ;) [15:27:51] wikimetrics manages $var_directory (it is a param already) and ensures it is a directory, hmm, wait no that would be fine, i was thinking there was a dependency order problem... [15:28:03] hm, maybe you are right. I don't like it just cause its ugly, but not as ugly as dealing with this :p [15:28:07] yrrghh [15:28:30] it's not ugly at all, IMO [15:28:41] also way less ugly than ifdefines in a bunch of places [15:28:51] /srv/var/lib/wikimetrics? /srv/lib/wikimetrics? [15:28:51] hm [15:28:55] /srv/var/wikimetrics? [15:29:01] I keep mine at /srv/quarry :) [15:29:09] that's where the wikimetrcis source is [15:29:15] need an external datadir [15:29:32] ah [15:29:43] Quarry keeps output on NFS, no need to worry about filespace issues :) [15:30:09] hm, /project? [15:30:19] naw, we have multiple wikimetrics instances though [15:30:22] they are specific to the instance [15:31:05] /data/project [15:31:12] ah, right [15:31:14] aye [15:31:24] my test instances write to /data/project/test/output [15:34:13] gonna go with /srv/var/wikimetrics [15:34:15] :/ [15:46:52] ok nuria_, i'm abandoning the extra mount effort [15:47:00] i still think it is better, but it is being a pain [15:47:06] aha [15:47:08] going to go with /srv/var/wikimetrics [15:48:18] https://gerrit.wikimedia.org/r/#/c/160689/ [15:48:55] ottomata: /srv/ being the 'generic mount' on labs [15:49:14] theoretically, 'server specific files' [15:49:16] yeah, i'm going to go back on using the define manually too [15:49:18] so , meh? [15:49:42] the simpler (but according to me less elegent) solution wins! [15:49:43] :) [15:50:08] :D [15:50:14] 'better is worse' etc [15:50:34] and .. when chceking out source ottomata how do we ensure source is also checked un der the "mounted" srv [15:50:54] *when checking out the source [15:51:54] hmmm [15:52:03] good thought, will submit another change in a sec... [15:54:34] ottomata: k [15:59:10] ok, nuria_, all merged and things look good on staging [16:00:00] ok ottomata changes are this one: https://gerrit.wikimedia.org/r/#/c/160689/8/manifests/role/wikimetrics.pp [16:00:07] and which is the other one, [16:00:17] yup [16:00:26] this [16:00:26] https://gerrit.wikimedia.org/r/#/c/160687/ [16:00:45] ha, actually, we don't even need that one, but that one was necessary for his define to ever work more than once [16:00:50] but, we are only using it once [16:01:02] so, the fix is not relevant in our case, but is needed for correctness :) [16:01:30] and ... weren't you going to do another one for the checked out source? [16:02:49] ottomata: also does the modules/labs_lvm need to have teh "submodule" bum up? (never know how to call this hopefully you know what i mean) [16:04:16] nuria_: I added it to the othe rpatch [16:04:23] see the require => at the bottom of the wikimetrics class usage [16:04:47] no, so nuria_, it only needs a submodule bump if module/X is a a git submodule [16:04:55] moduleslabs_lvm is in ops/puppet [16:04:57] and this one is not ok [16:04:59] right [16:05:04] ok, i see the require [16:05:29] ottomata: i take teh require ensures that srv is a mount so.. what happens if there is an srv alredy pre-existing? [16:06:00] then it will conflict, that was something I was just talking with Yuvi about [16:06:27] the require does not ensure that srv is a mount [16:06:54] it just ensures that the labs_lvm::volume define is mounted before the wikimetrics class evalculates [16:06:57] is evaluted * [16:07:08] and [16:07:23] labs_lvm::volume does define the '/srv' file resource [16:07:31] so, if someone does this elsewhere in puppet too [16:07:32] it will conflict [16:11:16] ok, then in staging ottomata you must have changed the srv [16:11:22] prior to run puppet , correct? [16:11:35] as there the source was not checked into a mount [16:11:45] bur rather a 'made up' /srv directory [16:13:57] hm, noooo, when I messed with it yesterday it was already done there [16:14:00] i think yuvi did that [16:14:06] when he first made the patch for you [16:14:11] the first one, for mysql [16:17:11] nuria_: want to apply on prod? [16:17:13] or shall I? [16:20:38] making lunch... [16:24:45] ottomata: yes, let me look at the quueue in prod [16:42:13] ottomata: looks like prod is ok, let me know when you want to deploy [16:58:29] nuria_, ok let me know when you are back and we will do so [16:58:34] unless, milimetric, you around? [16:58:46] i just want a backup in case something goes wrong [16:58:57] i'd need to shut down wikimetrics for a minute too [16:59:09] i'm here ottomata [16:59:33] you can do your thing, just remember scrum of scrums is in 30 min. [16:59:43] either way, nobody was active on wikimetrics last i checked [17:01:43] ok [17:01:45] yeah i know [17:01:48] should be an easy thang [17:02:10] will you do your kill all thing on wikimetrics1? [17:02:15] milimetric: ^ [17:02:51] k, doing [17:03:25] k ottomata, stopped [17:03:28] k [17:05:20] hm, milimetric, i am running puppet on wikimetrics1 [17:05:21] s'ok, right? [17:05:27] it was admin disabled [17:05:41] it's ok to run [17:05:43] k [17:05:54] but if you leave it running it'll restart the queue and web randomly [17:05:55] and that's not ok [17:06:07] but it's ok to puppet agent -tv [17:06:15] we just do it manually every time we deploy [17:06:30] ahem, prod is giving permits errors : https://metrics.wmflabs.org/static/public/167208/full_report.json [17:07:02] that url loads for me nuria [17:07:03] i just restarted [17:07:05] looks ok now [17:07:09] ah, ok [17:07:38] ottomata: you didn't make puppet run automatically on a schedule right? [17:07:57] nbo [17:07:58] no [17:08:00] but it was admin disabled [17:08:10] which means that i couldnt' run puppet agent -tv [17:08:16] reenabled it [17:08:18] oh huh? [17:08:21] means [17:08:22] someone ran [17:08:25] puppet agent --disable [17:08:34] k, weird... [17:08:43] as long as it doesn't restart everything randomly, we're good [17:09:05] aye, i don't think puppet shoudl run withotu a manual run [17:09:09] i show the queue and scheduler running as normal [17:09:44] all looks well, checked the site [17:09:49] and wrote/deleted a file [17:10:00] ok, ottomata, we should be good from now on, dan and i just cleaned up some empty files that were created when we had no space [17:10:10] big thanks! [17:12:06] yes, thank you very much ottomata / YuviPanda [17:13:57] yup! [17:14:01] sorry that took so long! [17:40:21] Analytics / Tech community metrics: Allow contributors to update their own details in tech metrics directly - https://bugzilla.wikimedia.org/58585#c26 (Jicksy) Alvaro, I went through the prototype Sarvesh has made, and I wish to contribute to this project. I have worked on Django for my internship.... [17:49:06] milimetric, around? [17:49:14] yep [17:49:15] hi yurikR [17:52:47] hi milimetric, that sample seems to be using an expression for labeling [17:52:52] will be disabled for us :( [17:53:06] expression? [17:53:37] it's only transforming the data to grab just the last value [17:53:53] the way its done is by showing a label "enwiki", etc, on each data point of the graph, except that it filters it to just show on last point [17:54:36] right, this line: https://github.com/wikimedia/analytics-dashiki/blob/master/src/components/visualizers/vega-timeseries/bindings.js#L192 [17:54:44] you mean they're not letting you do the test attribute there? [17:55:29] why? vega's expression syntax is extremely limited and safe: https://github.com/trifacta/vega/blob/master/src/parse/expr.js [17:55:51] milimetric: needs csteipp approval, I think [17:56:03] for 'safe' [17:56:09] milimetric, not exactly - that test is compiled into a function [17:58:46] milimetric, https://github.com/trifacta/vega/blob/6bb8c9910c8901a962e45a91b7262144e7c6ba1c/src/parse/expr.js#L38 [17:59:17] in other words, it does "eval()" on your test expr [17:59:27] and we know how save eval is [18:00:19] it does not do eval, yurikR, it lexes and basically only allows d, index, data, and some math functions and normal operation tokens [18:01:01] milimetric, nope :) it lexes it to convert known knostants, e.g. PI, into Math.PI [18:01:02] that's the whole lexer in that file there, and there's no eval... [18:01:21] and to exclude any keywords besides the accepted ones [18:01:27] it tokenizes it, converts some tokens, and converts it back into a string [18:01:28] \https://github.com/trifacta/vega/blob/6bb8c9910c8901a962e45a91b7262144e7c6ba1c/src/parse/expr.js#L59 [18:01:31] https://github.com/trifacta/vega/blob/6bb8c9910c8901a962e45a91b7262144e7c6ba1c/src/parse/expr.js#L59 [18:02:18] try it - add "alert('danger')" as your test value [18:02:35] calling Function() is identical to eval [18:03:12] milimetric, ^ [18:06:26] :( you're right, i can't believe i missed that [18:06:28] that's sad [18:07:05] and seems unnecessary, we can probably just patch that with a limited parser [18:07:09] yurikR: ^ [18:07:32] milimetric, true, we could try to develop a limited parser on top of it [18:07:42] assuming we trully need it [18:07:50] otherwise, we could try to patch lua in :) [18:08:00] heh [18:08:03] emscripten+luajit, you mean? :) [18:08:18] scribunto! [18:08:48] i don't see why we can't, as a data source, pass tsome params via an api call ) [18:09:16] to a lua module, that will do any kinds of data manipulation for us [18:09:55] plus vega definition could be a template, thus passing even more params from the usage location [18:11:30] i looked around a bit. There are lots of examples of limited parsers not using eval [18:11:54] and I know gabriel's looked into this while he was developing knockoff [18:12:27] we could also just filter out bad stuff with a simple regex [18:12:47] milimetric, good, patches are welcome, but until then, that magic "if() return true" will stay in place ) [18:13:23] but i really think we should explore lua as an alternative to some of it [18:13:40] we just need to figure out how to use it for streaming stuff ) [18:14:46] i'm not sure how lua works, but if it makes sense and doesn't make it a lot harder to write a graph it sounds fine [18:15:09] i've gotta run, keep me in the loop :) [18:16:26] coool, qchris_away, if i subtract bits from the hourly sampled-1000 file i'm checking [18:17:56] * qchris cannot wait for the second part of the sentence :-) [18:17:58] ah, wait no, there is 5% more in squid now [18:18:06] i was about to say it was 5% off, which is about expected [18:18:13] but there are more in udp2log (aka squid) logs [18:18:25] Heh :-) [18:20:19] The udp2log file you are comaring against, is that some of the files from /a/squid/... or a live capture? [18:20:50] a/squid [18:20:56] 20140912 [18:21:08] stat1002:/a/otto/sampled [18:21:22] mhmmm. [18:21:22] oh, wait [18:21:24] i didn't remove nginx! [18:21:32] hmmm [18:21:33] They are not in the sampled stream. [18:21:39] no? [18:21:40] oh [18:21:45] wait [18:21:48] why wouldn't they be? [18:21:56] ah but hte proxied varnish won't be [18:21:57] hm [18:21:57] (The nginx-s were the reason why I asked about the live capture) [18:21:58] right [18:22:17] wait [18:22:19] yes it will [18:22:26] OH [18:22:31] because it is coming from erbium? [18:22:36] Erbium does not see nginx-s. [18:22:37] which is not on mulitcast? [18:22:38] Right. [18:22:40] rigihghghhg [18:22:41] t [18:23:05] hm [18:23:06] k [18:23:23] Are the 5% difference spread across all servers/clusters? [18:23:29] Or are some more affected than others? [18:23:53] checking [18:24:04] Also ... do they affect only some parts of wikis? [18:27:26] ok, first off, i was filtering for bits url [18:27:36] bits servers do serve other reqs than that [18:27:42] i fill filter bits server hostnames [18:28:27] however, it look as if some upload hosts have many more requests in udp2log [18:28:40] about 20K more per upload [18:30:52] Yup. Same here, if I look at your files. [18:35:22] Do the upload servers have a different setup than other servers that would explain that?? [18:37:20] just to verify. after correctly filtering bits [18:37:24] if I don't count requests from uploads [18:37:36] kafka has 3401958 [18:37:36] udp2log has 3402133 [18:37:39] for this hour [18:37:46] close, but still more in udp2log, which is unexpected [18:37:58] hm, but [18:37:59] hm [18:38:00] actually [18:38:01] not really [18:38:10] beacuse this is sampled, can we expect that the sizes will be about the same? [18:38:20] udp2log will sample 1/1000 requests it receives [18:38:28] if requests are dropped, it won't matter [18:38:36] udp2log will still sample the same amount of requests [18:38:38] right? [18:39:16] so, off by .005% can be explained away by log rotate [18:39:17] I am not surprised that udp2log has a few more. [18:39:19] i think [18:39:24] That can be explained away. [18:39:27] ok cool. [18:39:31] ok then, uploads..hm? [18:39:35] However, seeing that the 20K amount to [18:39:40] ~10% [18:39:53] And other uploads have an ~10% drop too [18:40:17] Can it be that we're only consuming 11 of the 12 partitions (or whatever thoy are called) from upload. [18:40:19] ja, kafka uploads: 4455796 [18:40:20] udp2log uploads: 4860647 [18:40:44] (We had a case before where some part ~10% from mobile were missing, which we explained that way) [18:41:07] hm [18:41:26] input [encoding=json] kafka topic webrequest_upload partition 0-11 from stored [18:41:34] input [encoding=json] kafka topic webrequest_mobile partition 0-11 from stored [18:41:34] etc. [18:41:43] Oh sure. [18:41:50] ja, but something? [18:42:03] i'm see if I can look at some of the missing requests [18:42:04] But for the mobile issue we had back then, they just stopped working at some point. [18:42:24] hm, i thought those were because i never changed kafkatee configs when we reinstalled the cluster and made more partitions [18:42:31] That was a different issue. [18:42:34] ah [18:42:39] what issue are you talking about? [18:42:52] Let me find the bug ... [18:44:51] https://bugzilla.wikimedia.org/show_bug.cgi?id=64181 [18:44:53] ottomata ^ [18:45:37] This was also around kafkatee. Also ~10% drop. Also config did not change. [18:46:25] hm [18:46:35] Back then it was 10 partitions -> 1/10 loss -> 10% [18:46:51] Now we have 12 partitions, -> 1/12 loss -> 8.33% loss. [18:47:02] Compare that to the ratio you gave above. [18:47:26] 1 - (4455796 / 4860647) = 8.329% [18:47:45] That matches really, really well. [18:50:17] hmmm [18:51:51] I see, you're not yet convinced although percentage agree up to the second digit after the decimal point :-) [18:51:59] If my theary is wrong, we should be able, [18:52:17] to find requests from each of the upload partitions in the kafkatee produced tsv. [18:52:35] haha, that means I am convinced! [18:52:38] haha [18:52:45] Hahahaha. [18:53:09] But seriously, can we check this somehow without too much effort? [18:53:47] yeah, let's find one of those lines [18:53:52] and check to see if it is in hdfs [18:53:56] but, i gotta run an errand [18:53:59] i will be back in a bit [18:54:08] ok. [19:14:54] ottomata: When you come back, let's start checking with partitions 8 and 10 from webrequest_upload. [19:15:06] Looking at the output of [19:15:49] tail -n 1 and comparing it to [19:15:59] tail -n 1 tail -n 1 tail -n 1 partition 8 of upload clearly stands out [19:16:31] DarTar: If you have a few minutes today, I'm not having much luck getting a limn graph set up, although I think I'm 99% there. [19:16:36] (although it's marked 'fetch_state: "active"') [19:16:45] And for [19:16:57] head -n 1 partition 10 stands out. [19:17:44] (Note that for today's file, it seems like we could already be missing two partitions for upload, so it might well be partitions 8 and 10) [19:30:00] Digging deeper into today's file, the second partitions seems to have been dropped between 14:00:00 and 18:00:00 yesterday. There, the difference between udp2log and kafkatee jumps from ~8% to ~16% [19:36:39] ^ was from looking at the tsvs. [19:37:42] Looking in the kafkatee.stats.json, the reported partition for upload switches from 10 on 16:04:58 to 8 on 16:06:24, which [19:38:04] is pretty much in the middle of the above window identified from the tsvs. [19:39:17] kafkatee got restarted between 16:05:04 and 16:06:09. [19:49:48] (PS3) Nuria: Bootstrapping from url [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) [19:51:15] kaldari: can I help? [19:58:15] milimetric: maybe in a bit, gotta go to another meeting now :( [19:58:29] np, i'm around for a bit [20:32:21] ottomata: Welcome back :-) [20:32:30] while you've been away, I poked around a bit. [20:32:30] It seems today's file is even missing two partitions :-) [20:32:30] See [20:32:30] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140917.txt [20:32:30] starting at 19:14:54. [20:32:53] yo [20:33:52] qchris: iiiinteresting! [20:38:42] whoa, qchirs [20:38:52] there is only one reporting upload partition at all? [20:38:57] tail -n 1 kafkatee.stats.json | jq .kafka.topics.webrequest_upload [20:38:58] partition 8 [20:39:05] I think librdkafka is misreporting there. [20:39:09] hm [20:39:23] But I think, that it points to the issue. [20:39:32] yes, and that one is not updating at all [20:39:43] next_offset.per_second is 0 in ganglia [20:40:43] Ganglia has this? Awesome! [20:40:48] I could not find it. [20:40:55] Can you point me to it? [20:41:19] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=kafkatee&hide-hf=false [20:41:27] or, there should be more on analytics1003 too [20:41:35] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=analytics1003.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [20:41:48] Awesome graph! Thanks! [20:42:14] Oh. wait. It's only reporting partition 8 for upload? [20:43:13] yes [20:43:19] :-) [20:44:04] so hm, i wonder if kafkatee has trouble restarting sometimes [20:44:10] puppet doesn't do it properly usually, [20:44:19] so, it tries to restart it on config change [20:44:24] but fails (not yet sure why) [20:44:28] usually I have to restart it manually [20:44:31] Last time, the issue was resolved by restarting kafkatee by hand. [20:44:32] maybe it takes a while to stop everything [20:44:34] yeah [20:44:40] and i betcha that would fix this too [20:44:45] but, that does not make me confident in kafkatee [20:44:53] although, i wonder if it restarts, if it will continue reading [20:44:56] oo, let's look at offset files... [20:46:58] It seems offsets for partition 8 and 11 are lagging. [20:47:55] Yup. Partition 8 and 11 are not updating [20:48:06] (According to the offset files) [20:48:07] the others are? [20:48:28] so, i see about 70 lines / second on erbium's sampled-1000 [20:48:39] and about 57 lines / second on kafkatee's [20:48:48] ottomata: Yes, the others are updating. [20:48:55] hm, interesting [20:51:44] ok, qchris, let's restart kafkatee and see if those offsets start updating, and from where [20:51:52] Ok. [20:51:54] Sounds good. [20:52:57] hm, qchris, curious that the mtime of most of the mobile offset files is old too [20:53:28] but they are updating. [20:54:43] Meh. Stupid me. [20:54:55] I only checked the two that are so far behind. [20:55:02] i just restarted kafktee, upload-8 is updating now [20:55:02] Those two are updating. [20:55:04] ? [20:55:12] eh? [20:55:18] My ^ was about the mobile offsets. [20:55:49] ah, the June mtime ones were not? [20:55:54] Yup. [20:55:57] hm, they still have June mtimes [20:56:00] even after restart [20:57:06] upload-8 is updating again. [20:57:21] Maybe you rearranged the mobile ones, because they are so few in numbers? [20:57:34] So the blocks get better used. [20:57:47] ? [20:57:48] Did you compare the mobile tsvs? [20:58:16] not recently, i did make a change to mobile while you were gone though, likely unrelated [20:58:28] when we started saving mobile tsvs with kafkatee [20:58:32] we were not consuming any other topics [20:58:38] so we did not filter out other hosts [20:58:51] when we started consuming ohter topics, mobile logs then just contained 1/100 sampled of all logs [20:59:03] i fixed it so it is using the same grep hostname filter that udp2log is now [20:59:29] files between aug 23 and sept 10 are large [20:59:33] Filesizes of the mobile ones look strange. [20:59:34] for mobile-sampled-100 [20:59:35] Right. [20:59:45] but, that's why ^^ [20:59:53] they weren't filtering for mobile hosts [21:00:30] https://gerrit.wikimedia.org/r/#/c/159381/ [21:01:00] I just compared the gzipped sizes between upd2log and kafka generated tsvs, and they basically agree. [21:01:07] (For the last few days) [21:01:08] for mobile? [21:01:13] Yes. [21:01:15] cool [21:01:45] milimetric: you still around? [21:01:48] weird [21:01:57] hi kaldari, yep [21:02:03] so, kafkatee.stats.json does report mobile offsets changing [21:02:07] but it isn't writing to file [21:02:07] ? [21:02:30] milimetric: So I'm trying to create a new graph on the ee dashbaord (mostly as a learning exercise). Here's the code I checked in: https://github.com/wikimedia/limn-editor-engagement-data/commit/6ba9feb60267e9302a5ceeaa3a6d2cb5a73b156f [21:02:47] It creates a new tab, but it never retrieves the data or draws the graph [21:03:04] not even for partitions 10 and 11? [21:03:12] those two change for me. [21:03:46] i'll pull and try it out [21:04:01] qchris, aye, those do [21:04:04] just not the old ones [21:05:47] huh, qchris, but still only 1 upload partition in stats.json [21:05:54] this time partition 11 [21:06:10] So I guess all are working except partition 11 :-) [21:06:59] The corresponding offset file for upload's partition 11 has mtime from yesterday. [21:07:25] ah, kaldari, this is probably one of the stupidest things about limn [21:07:36] version: 0.1.0 is not what it should be [21:07:41] yeah, huh [21:07:43] it's the version of the schema of the graph [21:07:44] very strange [21:07:49] so you just need version: 0.6.0 [21:07:54] "graph_version": "0.6.0", [21:08:06] your graph works fine after that [21:08:25] also weird that stats.json does not show any partition stats except for the one that is not updating [21:08:55] yup. I checked kafkatee before, and the stats seem to get produced by librdkafka. [21:09:08] I cloned the repo, but did not yet get further there. [21:09:13] but kaldari: ideally just leave out graph_version completely, it's only really useful if you have old-style graphs so limn can still understand them [21:09:13] ja, i think that's right [21:09:21] varnihskafka has some extra stats, but i think that's riht [21:09:29] kafkatee uses a callback to write the stats out [21:09:36] from librdkafka [21:09:36] right. [21:09:51] So which of the three issues to tackle first? [21:11:00] oof [21:11:11] so, lemme summarize to make sure I understand 3 :) [21:11:39] 1. some offset files not updating [21:11:39] 2. stats.json not reporting some partition stats [21:11:39] 3. some partitions not writing at all [21:12:17] I think 1 and 3 are the same thing, aren't they. [21:12:57] hm, no, because, mobile-9 offset file is not updating, but mobile-9 partition is writing data (out to kafkatee outputs) [21:12:58] I'd replace 1 by Mobile is only having 2 partitions. [21:13:07] hm. [21:13:09] Oh. [21:13:10] Ok. [21:13:15] ok [21:13:16] wait [21:13:17] uhh [21:13:18] yes [21:13:36] that's right, because stats.json reports all mobile partitions with updating offset...right? [21:14:04] No clue. [21:14:12] yes [21:14:15] I just looked at those files for the first time 1 hour ago :-) [21:14:39] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=analytics1003&mreg%5B%5D=.*.webrequest_mobile.partitions..*.next_offset.per_second>ype=stack&glegend=show&aggregate=1 [21:14:57] that data comes from stats.json [21:15:11] Yup. Looks good. [21:15:48] yeah, this is pretty strange. [21:15:48] hm [21:15:53] not a good sign for kafkatee! [21:16:40] Should we generate the tsvs from Hive instead? [21:17:47] ? [21:17:54] too early to say! [21:18:06] Sure. [21:18:15] but not a bad idea.....>..>>..>>>>.....i guess.... [21:18:16] he [21:18:16] heh [21:18:27] i'm going to write an email to snaps and CC you [21:18:28] ja? [21:18:39] Sure. Thanks. [21:20:34] milimetric: Thanks so much! [21:21:41] np [21:22:55] hmm, qchris, i'm going to turn on kafka.debug, restart kafkatee and see if we get any interesting output in logs [21:23:13] Awesome. [21:47:57] milimetric: how often is data copied from stat1003 to stat1001? [21:48:48] DarTar: ^ [22:18:05] milimetric, DarTar, it looks like I can no longer ssh to limn0 from bastion.wmflabs.org. Where do you go to pull changes for the live dashboard graphs now? [22:18:41] kaldari: limn0's dead because it was in the tampa cluster [22:18:47] limn1.eqiad.wmflabs [22:19:08] ah, thanks! [22:19:41] milimetric: updating the documentation... [22:19:59] kaldari: the fabric deployer has the necessary information and it'll deploy the graph if you just do "fab ee_dashboard deploy.only_data" [22:20:05] https://github.com/wikimedia/limn-deploy/blob/master/fabfile/stages.py#L222 [22:20:22] but in some people's case you have to pass your username to it with --user [22:21:25] milimetric: I don't know about the fabric deployer, where do I run that from? [22:26:03] milimetric: "The program 'fab' is currently not installed. To run 'fab' please ask your administrator to install the package 'fabric'" [22:26:30] kaldari: sorry, here: https://github.com/wikimedia/limn-deploy [22:26:36] clone that, then do pip install -e . [22:27:27] "fab" is short for "fabric" which is a simplistic deployer. It executes commands over ssh [22:27:37] so assuming you have access to ssh limn1.eqiad.wmflabs [22:27:46] you will be able to run commands of the form: [22:27:59] fab [stage] [command].[subcommand] [22:28:06] in your case, all you would ever do is: [22:28:13] fab ee_dashboard deploy.only_data [22:28:48] that would basically update the ee_dashboard repo from github and clean up any broken symlinks [22:30:46] milimetric: hmm, it's still installing. might be stuck in a loop. [22:31:17] does it normally take a very long time to install? [22:32:24] kaldari: i'm not sure, pip is buggy as hell [22:32:55] kaldari: the easiest way by far to deploy this is to go "milimetric: deploy ee_dashboard" [22:33:00] want me do that? [22:33:08] yes :) [22:33:29] k, done [22:33:58] fabric is great and these scripts we wrote are bullet proof. But stupid pip ... [22:34:03] milimetric: Yay! Here's the shiny new dashboard: http://ee-dashboard.wmflabs.org/dashboards/enwiki-features#wiki_love-graphs-tab [22:34:22] very cool kaldari [23:28:17] nuria: hi, do you want me to merge that change for /var/lib -> /srv/wikimetrics ? [23:28:40] i think all merges needed were done by ottomata [23:28:49] ^ mutante [23:28:52] https://gerrit.wikimedia.org/r/#/c/160679/1 [23:28:54] this is left open [23:33:27] mutante: i have corrected that now, thank you [23:33:43] thanks, just going through the queue :)