[11:13:25] Analytics / Quarry: Clicking "Download CSV" button does not download a CSV file - https://bugzilla.wikimedia.org/69074#c2 (Southparkfan) NEW>RESO/WOR CSV download works here. Marking this as WORKSFORME. [11:36:41] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) NEW p:Unprio s:normal a:None In this bug, we track issues around raw webrequest partitions (not) being marked successful. [12:17:10] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [12:17:11] Analytics / Refinery: Raw webrequest partition monitoring did not flag data for 2014-08-18T13:..:.. as valid for text caches - https://bugzilla.wikimedia.org/69854 (christian) [12:17:12] Analytics / Refinery: Kafka partition leader elections causing a drop of a few log lines - https://bugzilla.wikimedia.org/70087 (christian) NEW p:Unprio s:normal a:None In this bug we'll track issues of kafka partition leader elections causing packet loss [12:20:30] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [12:20:39] Analytics / Refinery: Make webrequest partition validation handle races between time and sequence numbers - https://bugzilla.wikimedia.org/69615 (christian) [12:24:39] Analytics / Refinery: Kafka partition leader elections causing a drop of a few log lines - https://bugzilla.wikimedia.org/70087 (christian) [12:24:40] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages every now and then - https://bugzilla.wikimedia.org/69667 (christian) [12:38:45] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages since 2014-08-16 ~07:30 - https://bugzilla.wikimedia.org/69666#c2 (christian) NEW>RESO/FIX It works again since 2014-08-18 (See bug 69854). [12:39:44] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages since 2014-08-13 ~10:00 - https://bugzilla.wikimedia.org/69665 (christian) [12:39:45] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages since 2014-08-16 ~07:30 - https://bugzilla.wikimedia.org/69666 (christian) [12:39:45] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages since 2014-08-06 ~1:44 - https://bugzilla.wikimedia.org/69244 (christian) [12:39:46] Analytics / Refinery: Kafka partition leader elections causing a drop of a few log lines - https://bugzilla.wikimedia.org/70087 (christian) [12:39:47] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages every now and then - https://bugzilla.wikimedia.org/69667 (christian) [12:40:55] Analytics / General/Unknown: Raw webrequest partitions for 2014-08-23T20:xx:xx not marked successful - https://bugzilla.wikimedia.org/69971 (christian) [12:41:15] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [12:45:37] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [12:45:38] Analytics / Refinery: Raw webrequest partitions for 2014-08-25T11:xx:xx not marked successful - https://bugzilla.wikimedia.org/70088 (christian) NEW p:Unprio s:normal a:None For the hour 2014-08-25T11:xx:xx, none [1] of the the four sources' bucket was marked successful. What happened? [... [12:48:01] Analytics / Refinery: Raw webrequest partitions for 2014-08-25T1[67]:xx:xx not marked successful - https://bugzilla.wikimedia.org/70089 (christian) NEW p:Unprio s:normal a:None Five partitions [1] on 2014-08-25T1[67]:xx:xx, were not was marked successful. What happened? [1] _____________... [12:48:07] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [12:50:29] Analytics / Refinery: Single raw webrequest partitions for 2014-08-26T16:xx:xx not marked successful - https://bugzilla.wikimedia.org/70090 (christian) NEW p:Unprio s:normal a:None The upload partition for 2014-08-25T1[67]:xx:xx, was not was marked successful. What happened? [1] ________... [12:50:40] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [13:16:29] Analytics / Refinery: Raw webrequest partitions for 2014-08-25T1[67]:xx:xx not marked successful - https://bugzilla.wikimedia.org/70089#c1 (Andrew Otto) I deployed a varnishkafka change here, so this would have caused sequence numbers to reset. [13:16:47] qchris, thanks for the bugs! TOO MANY! [13:16:58] :-) [13:17:00] :) [13:17:08] Sorry for them. [13:17:10] haha, s'ok [13:17:22] I just filed what I need to investigate :-/ [13:17:43] Some more dup2log things will come too, as we had lots of alerts over the past days it seems. [13:18:00] https://bugzilla.wikimedia.org/showdependencygraph.cgi?id=70085&display=web&rankdir=BT [13:18:00] i think i have a hard time with so many different ones, would it be easier to track the related ones in a single bug? [13:18:12] i'm fine with being verbose in the bug about different parts [13:18:22] OOO [13:18:29] and I can click on them [13:18:32] that's cool [13:18:34] Yes :-D [13:18:38] Bugzilla is nice. [13:18:52] The green ones are open bugs. [13:47:59] Analytics / General/Unknown: Packetloss_Average alarm on erbium on 2014-08-23 - https://bugzilla.wikimedia.org/70092 (christian) NEW p:Unprio s:normal a:None On 2014-08-23 erbium reported a Packageloss_Average alert on 20:23:14 and recovery ~20 minutes afterwards [1]. What happened? Was t... [13:50:14] Analytics / General/Unknown: Packetloss_Average alarm on erbium on 2014-08-23 - https://bugzilla.wikimedia.org/70092 (christian) a:christian [13:50:27] Analytics / General/Unknown: Raw webrequest partitions for 2014-08-23T20:xx:xx not marked successful - https://bugzilla.wikimedia.org/69971 (christian) [13:50:43] Analytics / Wikimetrics: replication lag may affect recurrent reports - https://bugzilla.wikimedia.org/68507#c9 (christian) PATC>RESO/FIX All relevant changes have been merged. [14:56:20] qchris: A [14:56:24] chattychatty [14:56:25] ottomata: B [14:56:39] So two things. webstatscollector and gadolinium. [14:56:51] How do we proceed with webstatscollector? [14:58:04] :-) So I guess we'll let it sit for now, and I'll have a look after I checked the production issues. [14:58:10] Does that sound ok? [14:59:35] well, if it is doing ok, i guess i might as repackage and puppetize it [14:59:37] the production issues? [14:59:39] oh the kafka ones? [14:59:57] Yes. Kafka + udp2log + ... [15:00:11] But kafka webstatscollector is not doing well really. [15:00:15] It's running. [15:00:22] It does not loose packets. [15:00:24] qchris, If I were me, I would not look into any kafka missing messages until the next time there is a leader change [15:00:33] seeing as I made that change to varnishkafka this week [15:00:44] i'm hoping that a leader change will no longer cause drops [15:00:56] But the issues we saw are different. [15:01:12] Like there was likely a network issue that also affected udp2log. [15:01:19] at the same time? [15:01:24] Yes. [15:01:35] aye [15:01:36] hm [15:01:38] And some canary monitoring reported 80% loss for a different time. [15:01:48] And some such issues. [15:02:41] So about packaging kafka's webstatscollector ... that's too early from my point of view. [15:03:03] I'd postpone that until we're somewhat sure, it is behaving [15:03:03] ok [15:03:04] ok, so, gimme a summary of what's up then [15:03:19] most projects look ok [15:03:25] but some smaller projects are missing up to 20%? [15:03:35] I did not check all projects, but [15:03:45] when I looked today in the UTC morning, [15:03:53] some smaller wikis were missing up to 20%. [15:03:54] yes. [15:04:11] We should not have less pageviews than upd2log. [15:04:17] So something must be wrong. [15:04:47] We could check against the hive tables for some small projects [15:04:58] and check if anything is obviously wrong. [15:05:36] Also, the timestamps of the projectcount files does not look to good. [15:05:48] 15 seconds after the full hour is unusual. [15:06:38] So I expect something being wrong there. [15:06:51] One might also check the "unable to flush" error messages. [15:07:07] That might also explain why we're seeing different counts. [15:08:02] Not sure ... are those things just "qchris being overeager", or do the make sense outside of my head too? [15:08:36] hm [15:08:36] yeah i'm not sure abou tthe unable to flush things [15:09:10] I am not sure either. Should be the same on gadolinium. Just that it has not been noticed before. [15:12:43] since analytics1003 is not especially busy, we could just set up a second filter -> log2udp -> collector pipeline. [15:13:11] That would allow to compare "vanilla collector" and "collector with fixed unlink / unable to flush". [15:13:57] If they agree the pageline does not have random drops, and the unlink is not an issue. [15:14:09] If they don't, we have one thing to investigate :-) [15:17:27] where are those logs written? [15:17:30] on gadolinium? [15:18:06] qchris: i'm looking at the latest hour's file now too [15:18:06] projectcounts [15:18:09] en has 1.3M MORE requests in kafkatee file [15:18:21] hm, gadolinium has recv errors [15:18:21] and um, a lot more in the last 24 hours [15:18:21] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=oxygen%7Cerbium%7Cgadolinium&mreg[]=UDP_RcvbufErrors&z=large>ype=stack&title=UDP_RcvbufErrors&aggregate=1&r=day [15:18:22] the only thing that I know that has changed on gadolinium is that I stopped the unused udp2log instance [15:18:22] but, I did that at about 19:00 [15:18:22] and gadolinium started dropping packets at around 15:00 [15:18:55] * qchris is bad at doing two things at once. [15:19:27] About gadolinium ... yes. That's why I wanted to talk with you about gadolinium :-) [15:19:32] also in the dumps folder at [15:19:54] you seeing the minute level dump files? [15:20:05] i'm just noticing those [15:20:06] http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/ [15:20:09] 210000 [15:20:10] 220015 [15:20:10] 230001 [15:20:10] 000013 [15:20:10] 010006 [15:20:10] 020003 [15:20:10] 030015 [15:20:10] ... [15:20:16] The dumps are showing strange timestamps. [15:20:33] Yes. Exactly. [15:20:47] But ... "minute level"? [15:20:47] yes [15:21:14] Where are those? [15:21:57] i mean, not on the hour [15:21:57] ssorry, those are not minutes [15:22:34] • http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/projectcounts-20140826-210000, size 28K [15:22:34] • http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/projectcounts-20140826-220015, size 28K [15:22:34] • http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/projectcounts-20140826-230001, size 28K [15:22:34] • http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/projectcounts-20140827-000013, size 28K [15:22:34] • http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/projectcounts-20140827-010006, size 32K [15:22:34] • http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/projectcounts-20140827-020003, size 28K [15:22:34] • http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/projectcounts-20140827-030015, size 28K [15:22:42] Ok. I see we are talking about the same things :-) [15:22:46] i was looking on gadolinium, but ja [15:22:58] Yes, that's what got me concerned in first place. [15:23:05] yes [15:23:11] But CPU is a lot in io wait. [15:23:16] (on gadolinium) [15:23:28] And it started about the time when udp2log was turned off. [15:23:32] (yesterday)# [15:23:57] oh? around 19:00? [15:24:25] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=large&c=Miscellaneous+eqiad&h=gadolinium.wikimedia.org&jr=&js=&v=15.0&m=cpu_wio&vl=%25&ti=CPU+wio [15:24:32] ottomata: Yes. ^ [15:25:35] i see it starting around 15:00 [15:25:35] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=gadolinium.wikimedia.org&r=day&z=default&jr=&js=&st=1409153096&v=12.4&m=cpu_wio&vl=%25&ti=CPU%20wio&z=large [15:25:37] which is before I messed with gadolinium [15:27:00] * qchris looks puzzled. [15:27:07] me too [15:27:09] That does not match the graphs I am seeing. [15:27:55] http://cl.ly/image/1I3a15051S34 [15:28:05] right? [15:28:23] Oh. Those windows are local time :-) [15:28:29] Look at the rendered PNGs. [15:28:33] They are UTC. [15:28:49] ? [15:29:03] The 15:00 is in your local timezone. Not UTC. [15:29:10] WUT [15:29:11] why would they do that [15:29:15] i ahve always used those inspect things to zoom in and find errors [15:29:20] i have never noticed that [15:29:21] AGH [15:29:24] I have no clue. [15:29:29] I found that soooooo annoying. [15:29:47] But meh. What can you do? [15:29:56] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=gadolinium.wikimedia.org&r=custom&z=default&jr=&js=&st=1409153264&cs=08%2F26%2F2014%2013%3A00%20&ce=08%2F26%2F2014%2021%3A00%20&v=9.3&m=cpu_wio&vl=%25&ti=CPU%20wio&z=large [15:30:04] ^ that is a PNG in UTC. [15:30:06] ok that is really annoying [15:30:06] ok. [15:30:06] um, then yes, it is likely because i messed with gadolinium [15:30:06] but how! [15:30:06] i turned something OFF! [15:30:34] No clue :-D [15:30:34] yeah, i like inspect because its hard to see the times on those sometimes [15:30:47] i like to zoom in on the area and see the timestamp there [15:30:49] oh well [15:31:12] oook [15:31:13] weird [15:31:23] I noticed that total CPU stayed about the same. [15:31:37] qchris, i'm thinking about just restarting collector... [15:31:41] So what was gained by turning off the other jobs, was picked up by "waiting" [15:31:53] 85 gb of vritual memory. [15:31:54] :-) [15:31:57] Yes. [15:31:59] Totally! [15:37:54] !log webstatscollector restarted by ottomata on gadolinium [15:41:50] i see a few drops, but those were present before the change too [15:41:50] not dropping much there (yet) [15:42:03] Cool. Thanks! [15:45:56] ok so! let's watch drops on gadolinium, and if we get few drops there, let's try comparing again! [15:46:06] Great! [15:50:49] ah nope, looks like it is dropping pretty regularly again [15:50:50] ja [15:50:50] crap [15:50:50] dunno. [15:50:50] why would stopping udp2log make this happen? [15:50:50] so weird [15:59:12] that was a saracastic Great!? :) [15:59:15] sarcastic* [16:00:05] No. [16:00:13] Things looked good before I restarted. [16:00:21] But now ... load is high again. [16:00:28] wait is high again :-( [16:00:53] yes, i was saying that it is dropping pretty regularly [16:00:56] on gadolinium [16:01:08] but it happens in bursts [16:01:18] hourly :-) [16:01:27] hm, no i see it more frequently i think [16:01:31] collector starts with a fresh berkley db every hour. [16:02:21] Again we're seeing different things on the same graph :-D [16:02:31] It's hourly for me :-D [16:02:40] http://ganglia.wikimedia.org/latest/graph.php?c=Miscellaneous%20eqiad&h=gadolinium.wikimedia.org&r=day&z=default&jr=&js=&st=1409155241&v=0.0&m=cpu_wio&vl=%25&ti=CPU%20wio [16:02:47] What graph are you looking at! [16:02:52] s/!/?/ [16:04:32] well, yes, it was hourly in that graph [16:04:32] i'm watching drops on gadolinium manually [16:04:32] in a loop [16:04:56] Oh. I see. [16:08:16] ottomata: Can you connect as usual to gadolinium etc. I am having a hard time for the whole day. [16:08:32] Is the issue on my side, or does it work as usual for you? [16:09:29] qchris, i see drops about every 30 seconds, although not very many [16:09:29] and it looks like there was a low level of drops before the udp2log change there yesterday anyway [16:09:29] but, when i started collector, there were some irregular groups of lots of drops [16:09:29] but now it seems to just happen every 30 seconds [16:09:29] about [16:09:29] but, not enough to be concerned with [16:09:29] seeing as it was happening before [16:09:29] ha, if we wanted to, we could to the tempfs thing on gadolinium's webstatscollector [16:10:09] Not sure I want to change both systems at once :-) [16:10:27] If tempfs is causing issues, we would not notice. [16:14:12] i can connect as usual, occasionally the connection is bumpy [16:14:13] it will hang for a few seconds sometimes [16:14:13] but for the most part it works [16:14:13] ah, like right now, lots of drops [16:14:13] and now, cleared u [16:14:13] cleared up [16:14:13] ah nope, still happening :/ [16:14:13] https://gist.github.com/ottomata/f7bed2e9e8969e81fad2 [16:14:13] yeah [16:14:13] agree [16:14:17] i just hate that we touched gadolinium yesterday to fix a different problem, and now our experimental control is all messed up [16:15:44] qchris: because this is taking up too much time, i think you might want to start focusing on the hive query instead of this. i'll keep working with this and bothering you about it, but the hive query is really what we want to get out of this whole thing [16:16:02] ok. [16:18:26] heading home for lunch, back on in a bit [16:18:30] btw. ... the queue size is really small (4MB) for gadolinium. [16:18:35] Too late :-) [17:04:40] qchris, your idea about running a second collector before... [17:04:49] if we do tempfs on an03 [17:05:00] i bet we could run a udp2log based webstatscollector there [17:05:05] and use that as our control instead [17:05:06] ? [17:28:36] * YuviPanda pokes Ironholds about pageviews [17:28:42] Ironholds: fwiw we don't care about theh old app :) [17:28:53] YuviPanda, you have questions. Speak, mortal. The great oz is listening [17:29:09] actually the great oz in this scenario is Erik or Christian, but I'm around too. [17:29:11] Ironholds: you said you'll get us pageviews data!!!!! :) [17:29:18] Ironholds: in the meeting yesterday! [17:29:52] you've been hanging around with the product people haven't you [17:30:48] Ironholds: :P [17:31:04] Ironholds: you could write the hive query, and I can puppetize cron it as well if you don't have time for that [17:32:06] why puppetize? [17:32:25] and it's not just a cron query, it also needs timestamp handling and UA parsing [17:32:37] hello i am in meeting, but perhaps the two of you want to talk to me about that shortly :) [17:32:59] Ironholds: why? why can't you just do a "LIKE WikipediaApp/%" and then split on iOS and Android? [17:33:04] Ironholds: app UA is super simple [17:33:06] ottomata: indeed :) [17:35:36] YuviPanda, and then split by tablet, versus phone? Eh, no massive reason, it just increases complexity [17:35:51] now go write a UDF to turn log TSes into POSIX TSes and I'll agree with you ;p [17:37:03] Ironholds: I have wondered if we should change timestamp format... [17:37:03] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps [17:37:12] Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...] [17:37:20] i think that would allow them to be used witih hive datetime functions [17:39:04] that sounds great [17:39:22] it'd also allow for Limn compatibility and it's a common format a lot of analytics languages can handle [17:39:30] Ironholds: no, we don't need to split tablet vs phone either, at least now [17:39:43] note that we have *no* pageview data, so having *any* would be an improvement :) [17:39:47] to turn that into a POSIX timestamp, you need as.POSIXlt() [17:40:13] ottomata: ping when you're out of the meeting so I can bug you with the rsync patch :) [17:40:21] at the moment I need stptime and to remember when it's %m versus %M and augh [17:40:33] YuviPanda, yes, except then I have to regenerate things later. I'd rather do it right ;p [17:40:47] if you don't think we need that split go argue with howie and/or dan whose doc disputes that [17:41:05] disputes what? [17:41:24] YuviPanda: will do. [17:41:36] Ironholds: i think it woudl be relatively easy to make that change, but would make old logs incompatible with new ones [17:50:41] YuviPanda: meeting over [17:51:01] ottomata: https://gerrit.wikimedia.org/r/#/c/156324/ [17:51:12] j [17:51:13] a [17:51:23] so, both rsyncs use --delelte [17:51:33] right, so if they have same names it'll fuck up? [17:51:35] meaning the second will delete files from the first that are not in the second [17:51:37] no [17:51:38] oh [17:51:40] damn [17:51:41] right [17:51:56] should we put that in a different folder? [17:52:06] i think that is a better way to do it [17:52:23] we could add --exclude=thisnewdirectory to the first [17:52:28] and then just sync thisnewdirectory in the second [17:52:35] not sure if that is the best way [17:53:03] ottomata: right. we could perhaps setup a new directory structure, such as public-2, public-3, and just symlink public-datasets to public-s [17:53:11] which will be cleaner [17:53:29] well, i don't think we can mess with the dir as is right now [17:53:36] people link to the files where they are [17:53:46] but, we can make a subdirectory for stat1002 [17:53:51] and exclude that from the sync for stat1003 [17:53:54] so [17:54:08] ottomata: yeah, and the symlink would ensure that the current structure works as is, no? [17:54:13] stat1002:public-datasets/ -> stat1001:public-datasets/ [17:54:19] oh [17:54:21] ehhhh [17:54:30] i'd really rather not mess with it, that thing is so messed up right now [17:54:49] heh [17:54:52] oh well [17:54:53] ok [17:54:58] e.g. [17:55:05] we're then just adding messes to the mess, no? :) [17:55:16] and we probably need to split that thing into quite a few modules [17:55:19] https://gist.github.com/ottomata/3a8d14d22abc558f166c [17:55:54] would it be so bad to just have a subdirectory? [17:56:05] http://datasets.wikimedia.org/public-datasets/ [17:56:16] heh [17:56:27] actually [17:56:27] http://datasets.wikimedia.org/ [17:56:30] if you wanted to put it there [17:56:30] ottomata: I'll just amend the patch to do a subdir for now? [17:56:36] ottomata: hmm? [17:56:37] i think that might be fine... [17:56:37] hm [17:56:44] oh [17:56:50] above public-datasets [17:56:51] just outside of public-datasets even? [17:57:00] since 'public-datasets' is canonically on stat1003 right now [17:57:04] wat [17:57:13] if you have datasets.wikimedia.org that's fully public [17:57:17] haha, yes [17:57:18] no [17:57:21] i mean the directory name [17:57:28] i mean [17:57:29] *why* have a subfolder called public-datasets?! [17:57:29] right now [17:57:33] history! [17:57:37] heh [17:57:40] there is no good reason other than that [17:58:18] it is a mess [17:58:25] heh [17:58:26] but i don't want to try to clean it up right now [17:58:29] yeah [17:58:31] tech debt! [17:59:00] ottomata: so I'll just rsync it to /var/www/? [17:59:03] so ja, if you came up with a good name [17:59:19] you could make a separate directory just for syncing from stat1002 -> datasets/ [17:59:21] stat1002? :) [17:59:27] nope, bad name! [17:59:27] :) [17:59:31] pffft :P [17:59:39] even though that is what it is, we shouldn't tie this to hostnames [17:59:40] mabe-public-datasets? :) [17:59:42] haha [17:59:45] aggregate-datasets? [17:59:51] kinda redundant... [17:59:51] good enough [17:59:59] indeed [18:00:04] but then so is public-datasets :P [18:00:07] yup [18:00:11] oof [18:00:24] sooo annoying [18:00:26] ottomata: we could also turn off -delete and just put them in public-datasets and hope people don't write files with same names on both hosts [18:00:40] well, they'd both need to have --delete turned off [18:01:07] YuviPanda: do you predict needing more than just this one dataset you are thinking of to be synced there? [18:01:08] yeah [18:01:14] like, in the forseeable future (1y?) [18:01:17] ottomata: I think it's a good general thing to have, yeah. [18:01:20] indeed [18:01:21] yeah...probably so [18:01:23] as I play more with Hive. [18:01:31] ok ok ok.......... [18:01:35] sigh, ok [18:01:48] let's do a subdir with --exclude and keep --delete [18:01:49] for now [18:01:58] nooooo, I'll just call it aggregate-data [18:01:59] :) [18:02:00] HMMMM [18:02:01] and let it be [18:02:02] actually [18:02:08] yeah, i think that is better, it will be easier to clean up later [18:02:14] yup [18:02:20] messing with the rsync command feels dangerous to me [18:02:20] ok ok [18:02:22] esp with --delete [18:02:27] since boom suddenly nothing is there [18:07:52] Ironholds: you can convince limn to accept any datetime format btw, so don't let that stop you [18:09:26] milimetric, oh, cool! [18:13:38] (PS2) Milimetric: Add wikimetrics api and data converter [analytics/dashiki] - https://gerrit.wikimedia.org/r/156453 [18:14:08] nuria_: ok, so api and converter are split out now, full with mock-y testing: https://gerrit.wikimedia.org/r/#/c/156453/ [18:14:25] wanna sync up again? I'd like to go to the next thing [18:22:46] (PS3) Nuria: Set up layout and basic pieces [analytics/dashiki] - https://gerrit.wikimedia.org/r/155826 (owner: Milimetric) [18:48:03] (PS4) Milimetric: Set up layout and basic pieces [analytics/dashiki] - https://gerrit.wikimedia.org/r/155826 [18:54:43] (CR) Nuria: "Looks a lot better with an api and data muncher separated. Modules needs comments as to what they do and I would add a more through descri" [analytics/dashiki] - https://gerrit.wikimedia.org/r/156453 (owner: Milimetric) [18:58:09] (CR) Nuria: [C: 2] Set up layout and basic pieces [analytics/dashiki] - https://gerrit.wikimedia.org/r/155826 (owner: Milimetric) [19:08:23] (PS3) Milimetric: Add wikimetrics api and data converter [analytics/dashiki] - https://gerrit.wikimedia.org/r/156453 [19:09:38] (CR) Nuria: [V: 2] Set up layout and basic pieces [analytics/dashiki] - https://gerrit.wikimedia.org/r/155826 (owner: Milimetric) [19:22:06] (PS2) Milimetric: Add generic timeseries visualizer [analytics/dashiki] - https://gerrit.wikimedia.org/r/156346 [19:23:50] qchris_away: i dunno about this gadolinium webstast thing, i'm woried abou tit [19:23:57] its dropping packets, and people will complain that data is missing [19:24:02] Yup. [19:24:04] i don't know what started causing it [19:24:13] but, if tempfs fixes it, we might want to do it... [19:24:22] Yes, you're right. [19:24:47] Better a tempfs fix than no fix. [19:24:52] ja... [19:25:02] Did you find the process that is writing the 20MB/s? [19:25:10] Was it the collector process itself? [19:25:17] hm, no i'm not sure I understood that email, lemme read it again [19:25:54] oh, there are other processes there [19:25:58] there is an nginx udp2log instance there [19:26:05] also fundraising [19:26:08] oh no [19:26:11] that's what we removed [19:26:11] sorry [19:26:13] Yes, but that is not producing files since ages. [19:26:20] nginx? [19:26:25] Yes. [19:26:36] (At least I could not find any under /a/...) [19:26:46] woah weird [19:27:12] dunno why that would be [19:27:12] Yes. [19:27:27] I am starting the google machine. [19:27:42] Maybe we can trap chat and go over the things? [19:30:08] qchris [19:30:10] iotop shows [19:30:21] iotop :-) [19:30:29] flush-252:0 [19:30:33] heheh, yeah just installed it :p [19:30:37] I really what that capability :-) [19:30:39] sooo, is that berkely db? [19:30:42] No root. no fun. [19:31:09] IIRC, flush is kernel writing to disk. [19:32:44] ja sooooooo [19:32:52] It might be berkely db, but I somewhat doubt it. [19:33:03] your q is, if buffer is limited to 4MB, then why is disk flushing 10-15MB? [19:33:04] It would be several order of magnitude more writes than I'd exepct. [19:33:26] but then again ... when I profiled them, profiling slowed things down. So it might still be it. [19:33:43] Oh. No. That was not my question. [19:33:53] My question was twofold. [19:34:10] in iotop, collector is showing ~15 M /sec writes [19:34:10] 1st: Could we bump the udp receive buffer to something like 512MB? [19:34:12] in io [19:34:39] flush is doing writes in K / s [19:35:02] um, yes, although I would prefer to do it for the process rather than the default [19:35:15] i can do it global default temporarily to see though [19:35:24] Sure, per process is fine too. [19:36:35] 15M/sec? Mhmmm. We're expecting <<100K lines filter output per second. That would be ~100 bytes per row. [19:36:51] That sounds too big. Meh. [19:37:10] If that's the number. Then that's the number. [19:38:33] So throwing it in tmpfs and bumping the udp receive buffer size sound like a good thing to me. [19:38:45] It should bring webstatscollector back to life. [19:38:54] ok, we will have to compile a new webstatscollector... [19:39:38] For the tmpfs, the vanilla collector would do (if it is started in the tmpfs) [19:39:57] well, we don't want the dumps to be written to the tempfs [19:40:27] But we could make "dumps" a symlink to the place where the files should end up in. [19:40:37] hm, ok [19:40:42] No recompilation. No worries about breaking something. [19:41:02] vanilla does not have the kafkatee proxy logic? [19:41:14] The master branch should be fine. [19:41:29] Just the version that has been debianized. [19:41:38] Let me rephrase that ... [19:41:47] "The debianized version should be fine" [19:42:08] It creates the Berkeley DBs in the current directory. [19:42:34] So one only needs to take care about the current directory when starting it. [19:43:45] ohoh [19:44:06] ok [19:44:09] * qchris does not like "ohoh"s from ottomata. [19:44:17] * qchris does like "ok"s from ottomata :-) [19:45:52] ok, restarting collector.. [19:46:24] Do it :-D [19:47:37] Looks good. [19:48:32] !log restarted webstatscollector on gadolinium with berkley db in tmpfs at /run/shm/webstats [19:48:32] 3:48 [19:48:38] !log restarted webstatscollector on gadolinium with berkley db in tmpfs at /run/shm/webstats [19:55:27] no drops at all yet, qchris :) [19:55:34] Yes. Looks good :-D [19:55:40] and io is way low [19:55:49] Right. [19:56:12] I'll gonna add that I/O rate to the wikitech page. I for sure again would not beleive it next time :-) [20:04:52] ha ok [20:05:23] Looks like the new collector is working nicely! \o/ [20:08:47] puppetizing that chang enow.. [20:09:00] You rock. [20:10:49] qchris: https://gerrit.wikimedia.org/r/#/c/156673/ [20:12:53] ottomata: Does the template exist in puppet already? [20:13:10] ah, git add [20:13:12] one moment... [20:14:24] Ironholds: you filling up /tmp/????:) [20:14:27] on stat1002/ [20:14:27] ? [20:14:46] ohp, alarm is now OK [20:14:47] :p [20:14:55] qchris: recheck [20:14:56] :) [20:15:06] Yes. Doing just that. [20:15:07] :-D [20:15:31] ottomata, done; sorry! [20:15:36] long story - will explain in ~30? [20:15:39] np [20:16:01] So the service depends on the webstatscollector package only indirectly through the init script. [20:16:09] TL;DR it's important to handle per-session temp directories Right when parallelising and forking code. [20:16:18] Oh. That should be fine. You're right. [20:16:21] I may have a wider conversation about this to see if I can just do it in a less dumb way :D [20:16:44] (PS4) Milimetric: Add wikimetrics api and data converter [analytics/dashiki] - https://gerrit.wikimedia.org/r/156453 [20:23:32] (PS5) Milimetric: Add wikimetrics api and data converter [analytics/dashiki] - https://gerrit.wikimedia.org/r/156453 [21:50:13] Analytics / EventLogging: database consumer could batch inserts (sometimes) - https://bugzilla.wikimedia.org/67450#c1 (nuria) More detailed information of why is this item important when it comes to making EL data public is available in this e-mail thread: https://lists.wikimedia.org/pipermail/analyti... [22:10:01] backeth [22:52:16] Analytics / Wikimetrics: Labs instances rely on unpuppetized firewall setup to connect to databases - https://bugzilla.wikimedia.org/69042#c5 (christian) Automatic loading of iptables settings is getting implemented in https://gerrit.wikimedia.org/r/#/c/156599/ Once that has been merged, the issue... [23:12:39] (PS1) Milimetric: Add visualizer to coordinate selectors and graphs [analytics/dashiki] - https://gerrit.wikimedia.org/r/156722 [23:12:51] ^^ pro-style observable work :D [23:12:54] nitey nite [23:14:36] (PS6) Nuria: Add wikimetrics api and data converter [analytics/dashiki] - https://gerrit.wikimedia.org/r/156453 (owner: Milimetric) [23:15:18] (CR) Nuria: Add wikimetrics api and data converter (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/156453 (owner: Milimetric) [23:57:12] (PS1) BearND: Add edit funnel sql for apps [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/156735 [23:57:14] (PS1) BearND: Add edit funnel reports for apps to dashboards [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/156736