[05:29:00] ori: re hue is really slow - was it only a one time thing or just something that happens everytime? [05:29:47] we moved the DB from sqlite to mysql and the situation should have improved, but there may some other stuff to do [05:55:02] reading http://gethue.com/performance-tuning/ [06:08:27] maybe we could add moar threads, atm we have 40 (and we could go up to 50 from what I am reading) [06:26:08] elukey: https://grafana.wikimedia.org/dashboard/db/t129963 [06:26:51] graphite has some fancy, advanced functions for grouping and extracting metrics, but I decided to not get too fancy and just accept some measure of copypasta [06:28:36] is mc1009 really the only host we have with 1.4.25? [06:28:40] in eqiad [06:35:55] woa! [06:36:37] ori: yes mc1009 is the only one.. mc1007 is running with gf 1.15, only difference with the others [06:36:58] ah right [06:37:02] I think we should experiment with more hosts [06:37:22] newbie question for you: how is each data point calculated with statsd? [06:37:33] or better, with graphite [06:37:58] that's a long complicated question :P https://wikitech.wikimedia.org/wiki/Graphite has some notes [06:37:59] just wanted to know where that data comes from, I guess that statsd grabs "stats" from memcached periodically right? [06:38:23] ahhh ok! So I am going to RTFM today :) [06:38:26] no, diamond does [06:38:48] statsd doesn't go out and collect metrics; it just listens on a socket for metrics formatted using a very simple line protocol [06:39:01] it accumulates them and flushes summary metrics every minute [06:39:55] ahh okok, I always confuse them, sorry [06:40:20] anyhow, one test that I'd like to do is to restart mc1009 for the weekend [06:40:25] it looks like we configure diamond to write to graphite directly [06:40:43] to see with the new metrics how does the hit ratio behave [06:41:03] yeah, sounds good [06:41:08] i actually have to run [06:41:24] hue was slow today, i don't think it's chronic, but i don't use it very often [06:41:25] sure, thanks for the dashboard! I'll try to merge stuff in the official one [06:41:39] o/ [06:41:45] o/ [08:28:34] Hi elukey [08:29:14] joal: o/ [08:30:07] compactions looks completed (almost) [08:30:33] elukey: Completed for the hosts we are interested in :) [08:31:40] joal: unrelated question - do I need special permissions to access webrequest_raw via beeline? [08:31:56] elukey: you don't need permissions, you need a jar :) [08:32:17] elukey: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#JsonSerDe_Errors [08:32:26] nono not that error [08:32:28] :PP [08:32:32] arrfff :) [08:32:38] Permission denied: user=elukey, access=EXECUTE, inode="/wmf/data/raw/webrequest/webrequest_maps/hourly/2016/05":hdfs:hadoop:drwxr-x--- [08:32:48] * joal 's divination power is not working today [08:32:52] ahhahaha [08:33:04] nono Jo I am coming up with more stupid questions everyday [08:33:13] your divination is constantly improving [08:33:14] ahhahaha [08:33:18] :) [08:33:34] anyway, that does look indeed like a permission thing ... [08:33:47] I was trying something like SELECT * FROM webrequest WHERE year=2016 AND month=5 AND day=23 AND hour=0 LIMIT 5; [08:33:53] to try it out [08:35:15] one thing, don't forget to specify the webrequest_source when querying webrequest (either raw or refined) [08:35:37] I put use wmf_raw; [08:35:41] is it enough? [08:36:20] nope nope : Not the DB, the partition head --> We partition by date, but before that by webrequest_source (text, misc, etc) [08:36:58] ahhhhh okok sorry [08:37:46] np :) [08:37:48] anyhow, sorry for the diversion [08:37:51] So, there is weirdness: [08:38:41] in /wmf/data/raw/webrequest/webrequest_misc/hourly/2016/, permissions changed the 2016-03-23 [08:39:07] From 'drwxr-xr-x - hdfs hadoop', it goes to 'drwxr-x--- - hdfs hadoop' [08:39:28] and this is on hdfs right? I was trying to find the command :P [08:39:40] I think it's to cover for PII extraction if somebody sneaks in [08:40:34] And we said: Normally people use refined webrequest, so there is a special group for that one, but nobody but us (almost) uses raw, so let's remove readability for all [08:40:45] I think that's what we said, but we should ask andrew [08:40:59] And the solution is: sudo -u hdfs beeline :) [08:41:18] ah yes but I wanted to ask before using the hammer :D [08:42:18] SELECT * FROM wmf_raw.webrequest WHERE webrequest_source = 'misc' AND year=2016 AND month=5 AND day=23 AND hour=0 LIMIT 5; [08:42:27] Has worked for me impersonating hdfs [08:42:48] elukey: Let me know if you need /want help :) [08:43:22] yeah it works now [09:42:11] m.hostname missing_sequence_runs [09:42:12] cp3007.esams.wmnet 19 [09:42:12] cp3009.esams.wmnet 16 [09:42:12] cp3008.esams.wmnet 16 [09:42:12] cp1051.eqiad.wmnet 7 [09:42:38] something works! \o/ (copy/pasta from ottomata but still :P) [09:43:10] elukey: would you have some time for me to make changes on new-aqs? [09:43:21] joal: sure! [09:43:25] cool :) [09:43:28] batcave? [09:43:45] elukey: actually, batcave in 5mins ;) [09:43:49] okkk [09:43:54] ping me when you are ready :) [09:44:33] ready |! [09:44:37] Was not 5 mins :) [09:47:38] elukey: --^ [09:47:52] joining :) [11:00:48] elukey: kafka1013 is dying ? [11:01:34] yeah super weird [11:01:58] conntrack got maxed out [11:02:09] yup, saw a huge peak [11:02:23] the broker log size looks weird [11:02:36] elukey: yes, was bigger than others [11:03:13] elukey: Why are there peaks on conntrack>? [11:03:34] so from conntrack -L those are connections tracked from mw hosts [11:03:40] that are now in TIME_WAIT [11:03:49] we keep them for 65 seconds (the timeout of conntrack) [11:03:57] if we exceed the max, we drop packets [11:04:03] as it happened for kafka1013 [11:04:13] but also for the others [11:05:49] number going from 100k to more than 220k in minutes is weird [11:06:21] elukey: in contrast to the other kafka* hosts 1013 has the default of 256k (instead of the 512 there were set earlier) [11:06:31] since it wasn't reset after the reboot [11:06:35] when we upgraded to 4.4 [11:06:55] ah snap you're right [11:06:59] I didn't notice it [11:07:06] I checked only the timeout [11:07:13] I'm fixing it [11:07:33] thanks! [11:08:39] I'll also prepare a patch to bump these for the kafka brokers [11:08:52] they all went up to 300k [11:09:03] where could I find the stats of conntrack usage again, was that in grafana somewhere? [11:09:18] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=34&fullscreen [11:09:30] with the variance we're seeing in kafka brokers (depending on usage patterns) 512k is the safer default [11:10:01] yeah, that's what I suspected, it spiked for all of them [11:10:52] moritzm, elukey: an explanation of why it bumps like that? [11:11:34] no idea, kafka connections went up in general, so probably some internal kafka bug [11:11:48] not familiar enough with kafka to actually make a judgement call... [11:11:54] from conntrack it seems that we were tracking mw connections, any chance that we saw a bump somewhere? [11:17:30] maybe a side effect of the collation scripts currently running? [11:18:27] sorry pasted in the wrong channel; https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=now-6h&to=now (analytics-eqiad) [11:19:07] the one that spiked seems to be MediaWiki Api action [11:19:33] followed by upload and text a bit [11:19:49] moritzm: not sure if it could be related, might be [11:21:54] !log executed kafka preferred-replica-election on kafka1013 [11:21:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [11:23:12] grafana is not tracking the correct status atm, I can see from kafka topics --describe that 1013 handles correctly a lot of partitions [11:23:15] mmmmmm [11:26:05] [27 May 2016 11:25:35] [ServerScheduler_Worker-5] 252960958 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error [11:26:08] java.nio.BufferOverflowException [11:26:08] jmxtrans [11:26:09] mmmmmm [11:26:19] !log restarted jmxtrans on kafka1013 [11:26:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [11:26:50] ah yeah and now we have metrics again [11:28:06] Yay, I can see dataz! Thanks elukey :) [11:28:09] !log restarted jmxtrans on kafka10* hosts [11:28:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [11:28:18] weeeeeeird [11:31:57] [27 May 2016 11:25:35] [ServerScheduler_Worker-5] 252960958 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error java.nio.BufferOverflowException [11:32:06] ah snap not here sorry [11:32:13] https://gerrit.wikimedia.org/r/291211 [11:33:00] elukey: you're lucky, gehel is one of the upstream authors of jmxtrans :-) [11:33:18] I knooowww [11:33:23] I am chatting with him now :P [11:38:32] joal: all right emergency resolved, all good :) [11:38:42] awesome :) [11:38:47] Well handled elukey :) [11:38:48] going to open a phab task to track jmxtrans issues [11:38:52] sure [11:39:00] as always moritzm solved the problem :D [11:39:21] joal: already moved those files? [11:39:34] yes from etherpad [11:39:54] shall we restart cassandra? [11:40:06] indeed, cassandra start needed, then check data, then aqs start [11:41:20] all right, proceeding [11:41:25] (CR) Joal: [C: 1 V: 1] "Looks good to me, test in dry-run mode." [analytics/refinery] - https://gerrit.wikimedia.org/r/290639 (https://phabricator.wikimedia.org/T130123) (owner: Madhuvishy) [11:45:17] joal: cassandra is up [11:45:35] ok elukey, checking [11:46:35] elukey: nodetool-a cfstats on aqs1004 looks good (data in correct keysapce) [11:46:44] goooooooood [11:46:51] elukey: restart aqs? [11:46:57] elukey: Please :) [11:47:23] already done :) [11:47:37] hehehe [11:48:37] elukey: right, got an interesting error when testing [11:48:49] elukey: tried to run /srv/deployment/analytics/aqs/deploy/test/test_local_aqs_urls.sh [11:49:01] elukey: you can try it and you'll see [11:49:31] joal, elukey, team, hi! [11:49:37] Hello mforns ! [11:49:43] joal: mmmm maybe cassandra is still bootstrapping? [11:49:47] hello mforns ! [11:49:56] elukey: possible, but I don't think so [11:50:02] looks more of an aqs error [11:50:07] (like port is no good [11:50:24] joal, when you want and have time, I can update you on the data juggling [11:50:34] awesome mforns :) [11:50:39] let me know [11:50:42] give me a few minutes, then we'll do [11:51:18] no 7232 port opened :P [11:51:24] aqs error [11:51:37] elukey: looked in conf, should be 7232 [11:53:07] joal: now that I think of, we never tried if aqs was working [11:53:19] I think that we are missing some puppet config [11:53:43] elukey: ? [11:54:27] joal: did you try to run come curl commands before this morning on 100[456]? [11:54:41] elukey: no, I did not [11:55:16] yeah this is my point.. probably it wasn't working even before this morning :P [11:55:29] elukey: possible [11:55:49] elukey: maybe we can start launching aqs by hand and prevent logstash logging? [11:55:57] to watch what happens [11:55:58] ? [11:57:18] joal: I need to double check why it is not running, it might be something related to a line missing in puppet [11:57:39] elukey: ok, but remember we don't want to reenable puppet becasue of compaction change [11:58:14] joal: wouldn't be reverted only with a deploy? [11:58:25] elukey: I don't know [11:58:50] as a wise man would say [11:58:53] elukey: so far, compaction ok :) [11:58:54] marrfff [12:00:10] joal: mind if I step away to have lunch? I'll restart working on in in ~30 mins [12:00:46] elukey: np, need to take a break as well, we'll check that after [12:01:22] elukey: Just one thing: cassandra seems full (can access data with queries :) [12:01:26] elukey: SUCCESS ! [12:01:28] :) [12:01:31] gooooood [12:01:32] Have a good luncvh elukey [12:01:42] mforns: let's spend some time in da cave? [12:02:03] mforns: I feel like platon when saying that: ) [12:04:39] mforns: not here currently? [12:05:53] joal, back! [12:06:00] to the batcave :] [12:06:00] cool :) [12:20:06] * elukey back [12:30:59] * joal is AFK for a while [12:33:23] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333519 (elukey) [13:23:33] mroning!! [13:34:14] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333775 (Ottomata) We didn't upgrade to a newer JMXtrans because of a verbose logging bug. Buuut! It looks like it has been fixed? https://github.com/jmxtrans/jmxtrans/issu... [13:34:36] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333779 (Ottomata) > Why do we need to push to statsd rather than directly to graphite since jmxtrans does buffer for us? Good question! Perhaps we don't! [13:45:13] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333507 (fgiunchedi) the typical (only?) reasons for pushing to statsd is for aggregation across machines sending the metrics, or aggregation for a particular type of metric... [13:47:05] joal: mooorning [13:47:14] Analytics-Kanban, Patch-For-Review: Puppetize druid - https://phabricator.wikimedia.org/T131974#2333845 (Ottomata) I'm calling this done! There are going to be a lot of smaller follow up tasks, especially for monitoring. [13:49:13] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2333849 (elukey) Tried for the first time to query Hadoop data via Hive, so this info will need to be validated, but I run a script to find how many holes... [13:52:16] ottomata1: I'm waiting for joal but can't wait to load pageviews up :D [13:54:25] :) [13:54:25] cool! [13:54:32] excited for yall to try it to [13:54:36] want to watch and see how it works [13:54:45] i've got an apt. at 10:15, so i'll be out for a bit [13:54:49] back by standup for sure [13:59:10] * elukey afk for 30 mins! [14:50:07] * elukey back! [15:05:58] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334115 (elukey) Modified a bit the script to print host and timestamp related to the sequence number right before the hole. Here some snippets of the res... [15:18:10] joal: lemme know when you're back, I want to pick up the pageview loading stuff if you're busy [15:19:35] (i'm back!) [15:19:41] ottomata: o/ [15:20:13] if you have time later on can you tell if https://phabricator.wikimedia.org/T136314#2333849 makes any sense? [15:20:24] I am not sure that my queries are good [15:30:15] elukey: looking [15:32:13] the results are weird and I suspect that my Hive skills are none [15:32:26] even if I "took inspiration" from your queries [15:32:31] hmm, results do seem to be associated around the hour, ja? [15:32:40] haha they are not quite "mine", i think dan and christian wrote the originals [15:32:45] i adapted a little [15:32:48] for example, I can see "drops" for all the hours but oozie pings us once in a while [15:32:49] and now you have carried on the tradition :) [15:32:57] ahhhh okok :P [15:32:57] for all hours? [15:33:07] all the one that I tried, yes [15:33:10] hm [15:33:19] gonna try some queries [15:33:28] yeah thanks [15:34:18] elukey: do all misc hosts print out each time there is a hole? [15:34:42] nope only a subset [15:34:45] hm [15:34:49] but not a consistent subset? [15:34:50] hm [15:34:55] a consisten one [15:35:00] *consistent [15:35:06] and around the hour [15:35:36] but I didn't check the original queries that oozie uses [15:36:13] that's ok, oozie is just gonna report on percent missing [15:36:46] sum (size of each host's hole ) / sum(expected count of each host) [15:36:48] somethign like that [15:36:59] elukey: a consistent one? [15:37:09] meaning is it is always the same hosts that have holes? [15:37:20] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334176 (Nuria) > The loss seems to happen around the hour, but I don't have a good idea about the why (logrotate afaik happens daily). You probably have... [15:37:27] elukey: so, a cause of false positives that we've seen before [15:37:31] hmmm [15:37:33] is [15:37:44] the timestamp and sequence numbers are not always in order [15:38:01] i think the seq is generated as soon as varnishk sees the start of a req [15:38:08] and the timestamp is the time of response [15:38:12] so [15:38:13] sometimes [15:38:23] especially around hour boundries [15:38:31] sequence numbers can be bucketed into hours out of order [15:38:39] ahhhhhhhhh [15:38:43] that would make sense! [15:38:52] so, to verify this, see if you can find some of the missing seqs in the next hour [15:39:17] cp1061.eqiad.wmnet 2016-05-27T12:59:59 2514541 [15:39:17] might be a good one to look for [15:39:26] oh you don't know the size of that hole [15:40:00] maybe 2514542 is in hour 13 [15:40:03] gonna look too :) [15:43:29] but why only with Varnish 4 [15:43:31] ? what happened to perms on raw??? [15:43:51] ahhh joal checked this morning, you need to sudo to hdfs [15:43:57] they got restricted [15:44:15] 10:37 So, there is weirdness: [15:44:15] 10:38 in /wmf/data/raw/webrequest/webrequest_misc/hourly/2016/, permissions changed the 2016-03-23 [15:44:18] yeah but, uhhh [15:44:18] 10:39 From 'drwxr-xr-x - hdfs hadoop', it goes to 'drwxr-x--- - hdfs hadoop' [15:44:25] they should be like refined data [15:44:27] hdfs analytics-privatedata-users [15:44:31] 750 [15:44:42] I am only reporting what I know :P [15:45:08] ha [15:45:12] weird [15:45:44] oook, i'm going to change them all, new files are supposed to inherit the parent directories group and ownership [15:46:45] ottomata: so had an interesting issue come up lately that I wanted to drop a note for you on. the subject of managing a group that has human and service users. i.e. admin was never meant to do it. We basically did the following, had the humans defined and setup via admin module, had the service users setup and dfined via defined type in puppet, then using an instance of admin::groupmembers directly passed in a [15:46:45] array of all members for the //mixed group// with a metaparameter to depend on the setup of the service user itself [15:46:59] so anyways, it's a solution that may actually be...ok and not even a hack depending on your pov [15:47:13] ha, uhhh [15:47:16] so wait [15:47:26] instance of admin::groupmembers [15:47:30] don't sure i understand that [15:47:42] so [15:47:54] would admin::groupmembers have to be kept in sync with admin.yaml? [15:48:10] not sure* [15:48:34] technically I guess the edge case is [15:48:42] you pass in a user in the array that is no longer defined in admin yaml [15:48:49] we could have a check for that I guess [15:48:54] but atm yeah that's an edge case [15:49:16] hm, still not sure i understand, link to example? [15:49:28] heh sure I'll try to find the change brandon put up [15:53:14] milimetric: o/ [15:53:39] hey joal, early batcave? [15:53:51] already there ;) [15:54:18] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334236 (Ottomata) Oh ho ho, check this out. Looking at > `cp1061.eqiad.wmnet 2016-05-27T12:59:59 2514541` ``` ADD JAR /usr/lib/hive-hcatalog... [15:54:29] elukey: it is bad sequence bucketing indeed! [15:54:56] hdfs dfs -text /user/joal/pv_druid/2016-01-01/part-00031.gz | less [15:55:07] elukey: dunno why it would be just varnish4 [15:55:40] elukey: by any chance, have you had a look at aqs? [15:55:41] joal: are making the ts name? [15:55:42] "ts": "2016-01-01T00:00:00Z", [15:55:46] if so, could you use dt? [15:55:54] joal: I am a bad ops, didn't get time :( [15:56:33] elukey: please don't say that, let me do it ;) [15:57:58] ottomata: sure can do, does it change a lot for druid? [15:58:37] joal: could be a restbase thing, maybe if you don't set LVS? Not sure, I don't see node bond to any port.. [15:58:51] mwarf .... [15:59:06] Will you have a few minutes after standup to try a manual startup with logs ? [15:59:09] elukey: --^ [15:59:29] joal: no, but dt is what we have standardized on in eventbus for ISO 8601 fields [15:59:35] ts would be for an integer timestamp [15:59:40] riiiiiiight [15:59:56] ottomata: this was one-off for pageviews, but yes, I'll update my code :) [16:00:04] cool [16:00:13] joal: sure! [16:00:23] ottomata, elukey : standdupp [16:00:45] yepppp sorry chrome hates me [16:00:46] as always [16:30:29] ottomata: can I get added to druid100*? It asks me for a password when I ssh in [16:30:35] or should I just curl to it [16:30:51] oh yeah, that'd be fine if the port's open [16:33:09] https://plus.google.com/hangouts/_/wikimedia.org/a-batcave-2 [16:42:41] Hey, is there someone working on vital-signs available here? [16:48:21] ottomata: poke [16:48:32] ooh, jonas_agx hi! [16:48:50] what's up, I'm Dan, was telling you to come talk to us about vital signs [16:48:53] you said you had some questions [16:52:02] * milimetric is suspicious of his own visibility ... [16:54:11] milimetric: hi [16:54:15] hi! [16:54:18] talk in cave? [16:58:26] hi milimetric ! It'd be nice to click on the event in the timeserie [16:58:37] *events [17:00:27] Who is writing vital-signs? I'd like to contribute for the project if possible [17:01:30] I've created a project quite similar ages ago for pt-wiki -- also including relevant events for the community [17:09:03] jonas_agx: sorry, in the middle of something, I'll ping you very soon [17:14:11] a-team logging off! byeeeee o/ [17:14:19] elukey, bye! [17:16:06] ottomata, elukey: about the missing sequence numbers around the hour ... if the scripts reported missing messages at the change of the hour, I typically just ran the scripts for those two hours. If the problem vanished (no dupes, no missing) then it was just the issue [17:16:10] that ottomata described. [17:16:26] if the issue was still there, the dupes/missing sequence numbers were real. [17:16:55] qchris_: thanks! :) ja it seems it happens a little more frequently with varnish4 than varnish3 [17:17:16] Oh :-( [17:17:51] I wanted to automatize those boundary checks at some point ... but time's limited. [17:18:15] jonas_agx: what is what you would like to do? [17:18:23] jonas_agx: I did not quite understand [17:19:24] hi nuria_ the text for "events" through the timeserie has links but they are not clickable [17:20:24] jonas_agx: trying to look but not seeing it.. what text for events? [17:20:48] jonas_agx: ah , the annotations in the botom axis? [17:21:04] yes nuria_ sorry [17:21:44] jonas_agx: k, got it, it is a tolltip that displays but sure, it can easily be converted into a small overlay. you will see text if you mouseover [17:22:02] jonas_agx: do you have gerrit credentials? [17:22:21] yes I do nuria_ [17:22:48] jonas_agx: then project is on analytics/dashiki [17:23:06] jonas_agx: it's a client side app so you can get running in about 30 secs [17:23:26] jonas_agx: github mirror: https://github.com/wikimedia/analytics-dashiki [17:23:46] jonas_agx: client side as in "there is no server" just js/html/css [17:24:16] jonas_agx: https://wikitech.wikimedia.org/wiki/Analytics/Dashiki [17:24:18] nuria_, Thanks. Then the timeseries are just static files read somewhere? [17:24:41] jonas_agx: yes, the annotations are on meta [17:24:49] jonas_agx: they are just wikis on json format [17:25:26] nuria_, Great! Thanks, that's what I was searching for [17:25:30] jonas_agx: FYI that we will be moving this tool to a prod (not labs) domain soon but that should be transparent as the labs url will redirect [17:26:30] jonas_agx: k, let us know if you need help [17:26:52] jonas_agx: we use semantic-2 for css [17:28:19] jonas_agx: awesome, welcome onboard dashiki development, let us know here if you have any trouble, but we're happy to review patches [17:29:59] thanks nuria_ and milimetric -- patches :) [17:34:02] jonas_agx: for patches in gerrit, just add me to the reviewers (milimetric there as well) [17:36:29] milimetric: joal, for reference https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Ports#Druid [17:39:59] a-team: btw, Monday's a holiday [17:40:01] in the US [17:45:10] milimetric: right, i will be working [17:46:21] Analytics-Kanban: Announce analytics.wikimedia.org - https://phabricator.wikimedia.org/T136426#2334658 (Nuria) [17:46:45] (PS1) Nuria: Rename 'readers' dashboard to vital-signs [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/291284 (https://phabricator.wikimedia.org/T136426) [17:50:13] a-team, I will be working, too [17:51:01] i;m working too [17:54:48] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334727 (Ottomata) Ah, I was incorrect in my previous comment. The dt is the request timestamp, and the sequence number is not generated until the respon... [18:51:14] I won't be working, I'm going hiking and/or trying to get over this cold I just realized I have [19:31:22] milimetric: here ? [19:31:43] hi joal yes [19:31:47] heya :) [19:32:03] Wondered if you add kept the indexing task id for the currently running indexing [19:32:06] 82% I see [19:32:11] And if you could ask for status :) [19:32:13] yes, sec [19:32:15] ok :) [19:32:38] {"task":"index_hadoop_pageviews-hourly_2016-05-27T17:24:53.355Z"} [19:32:52] but the status from druid isn't super useful [19:33:00] arf, really ? [19:33:02] it'll just say running the more interesting one is the job in hue [19:33:15] ok makes sense [19:33:18] yep, if you have the tunnel: [19:33:18] http://localhost:8081/druid/indexer/v1/task/index_hadoop_pageviews-hourly_2016-05-27T17:24:53.355Z/status [19:33:29] in hue, it's in the job browser, if you look for user druid [19:33:40] milimetric: I've been monitoriung the thing: there have two jobs: one small-ish first, this big one after [19:34:15] yep, about 2.5 hours so far [19:34:22] yes [19:34:25] that's a good amount of time [19:34:34] It is, but we also load 3 month :) [19:34:42] yes, like one or so hours per month [19:34:56] it kind of makes sense if it took 30 minutes to dump the raw data [19:35:02] 1 hour to read and index it seems ok [19:35:11] which represent yup [19:35:18] oops, sorry [19:35:58] joal: but you can go relax, I can write to the analytics-internal when it's all done [19:36:13] milimetric: I WANNA PLAYYYYYY ! [19:36:30] milimetric: I've been playing with pivot on one day, want MOAAR ! [19:37:05] :) [19:38:11] heheh [19:38:13] :) [19:51:07] hey dudes (milimetric) [19:51:17] ssh -N stat1002.eqiad.wmnet -L 9091:stat1002.eqiad.wmnet:9091 [19:51:22] http://localhost:9091/ [19:52:43] admin/admin [19:53:06] hmm actually, hang on [19:53:09] need to fix something [19:54:47] airbnb data exploration platform that connects to Druid what?! [19:54:49] :P [19:54:54] :) [19:55:15] so sneaky, getting us all these goodies [19:55:36] yeah wasn't working though, re iinitialiing it, think i broke it [19:56:41] good find, gonna keep reading the docs [19:58:44] Hallo [19:58:56] hi aharoni [19:59:22] does it log all the clicks? [20:00:09] hm? does what log what clicks [20:00:12] grrrr, it isn't working GRRRR [20:00:15] at least not the exampels [20:01:07] You don't seem to have access to this datasource [20:01:07] grrr [20:11:39] grr not sure what is wrong with that [20:11:40] :/ [20:16:57] THERE It goes [20:22:37] milimetric: it works now [20:22:40] the example dashboards work [20:22:46] not sure what is up with the druid datasources [20:22:47] they say [20:22:49] yep, I was trying to edit the caravel datasource [20:23:03] Please define at least one metric for your table [20:23:03] I'm like setting it up and it seems to have some required fields but the only docs I found are: https://github.com/airbnb/caravel/blob/master/docs/druid.rst [20:23:15] hm [20:23:24] yeah, I defined a metric, but the metric has a required "type" which I'm not sure what it wants there [20:23:34] oooh! indexing finished :D [20:23:38] hi team, back [20:23:39] gonna test that for a sec [20:23:41] hey mforns [20:23:41] in caravel? or in druid datasource/ [20:23:42] hehe [20:23:43] ok [20:23:48] playing with pivot instead... [20:23:51] heyyy mforns! [20:24:04] ssh -N stat1002.eqiad.wmnet -L 9090:stat1002.eqiad.wmnet:9090 http://localhost:9090 for pivot [20:24:24] yeah, I won't play with caravel too much now, but it looks great [20:27:01] hm milimetric am I doing something wrong? i'm only looking in pivot [20:27:06] you loaded 3 months, ja? [20:27:17] yes, but it looks like maybe it's still processing the segments it just indexed [20:27:20] what do the logs say? [20:27:24] the indexing task says it's done [20:27:35] and I see some shards in the console [20:27:40] http://localhost:8081/#/datasources/pageviews-hourly [20:27:40] oh ja they are busy [20:28:12] oh yeah, when I reload, it looks like more shards show up [20:28:15] ja [20:28:26] we shoudl be careful here, since we are using the prod zookeeper [20:28:43] and druid seems to use zookeeper a lot, especially for data loading coordination [20:28:43] yeah, and it's weird, each day shows 2GB of data or so... [20:28:44] hm... [20:29:16] we could set up a separate zookeeper on each of the druid nodes [20:29:27] I mean separate cluster, on the 3 druid nodes [20:29:47] that way that cluster is more self-reliant and doesn't mess with kafka [20:30:15] ja thought about that too [20:30:33] there are some pretty annoying cdh vs debian zookeeper package issues though [20:30:37] work aroundable [20:30:41] hm, ok [20:30:46] any way to see if this is killing zookeeper? [20:30:53] am looking here [20:30:54] https://grafana-admin.wikimedia.org/dashboard/db/server-board [20:30:56] for conf1001 [20:31:03] there are more segments ready now, I just queried 3/11 -> 3/13 and I got data [20:31:05] i see more disk usage [20:31:08] but not too bad [20:31:17] load jumped some but still not huge [20:31:37] took about 6 seconds to get top 25 projects for the month [20:31:41] (March) [20:32:36] 16.7 Billion pageviews :) not bad [20:33:26] ehe, this thing needs some cache [20:33:49] hm, druid has some local query cachces [20:34:00] wonder if we can hook up to prod memcaches.... [20:34:46] what needs cache? [20:35:06] but this would only be internal for now [20:35:38] ja, a global query cache [20:35:45] i think druid has that option [20:35:47] we just aren't using it [20:35:54] yep [20:36:10] pivot is fun, seems to have grown up a lot since I last used it [20:44:16] Analytics-Kanban, EventBus, Patch-For-Review: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need - https://phabricator.wikimedia.org/T134502#2335241 (Ottomata) Talked with @aaron about this in IRC. Apparently I am not misunderstanding `$oldBits` and `$newBits`... [20:44:50] hmm milimetric for some reason its stopped loading everywhere except druid1001 [20:45:29] maybe the load/procesing segment is a single task? dunno [20:45:33] oh [20:45:34] no, I still see the segments coming in [20:45:38] but they are loading into other nodes [20:45:38] it's doing Jan 20 now [20:45:44] i guess the task is just running on that one host [20:45:54] dunno why it only has one loading task [20:45:55] yeah, I guess each host would have 2/3s of the segments [20:46:16] AH [20:46:17] i take it back [20:46:20] logs just rotated :p [20:46:28] so my tail wasn't printing them all out [20:46:30] I see [20:46:38] this is where I've been monitoring: http://localhost:8081/#/datasources/pageviews-hourly [20:46:46] coool [20:46:49] it's pretty informative, that timeseries graph [20:47:02] because it shows the size druid knows about of each segment [20:47:08] huh intresting, and its backwards in time? [20:47:09] so you can see if any segments loaded partially or whatever [20:47:11] yes [20:47:20] over the range it knows it should have [20:47:33] each day is different because basically the values zip differently [20:47:41] but no day should be too far from the avg [20:47:59] which looks like 2G or so [20:48:02] wow that's a lot though... hm [20:48:42] 700G or so per year... mmm maybe that's ok actually [20:48:49] but without title... hm [20:49:04] wonder why it was able to squish to only 64MB for the first day we loaded [20:53:31] ottomata: ok, all done [20:53:42] yeah, I think this is definitely faster than pageview_hourly, by like... a lot :) [20:53:55] gonna send a message so people can play [20:59:47] COOOL [21:06:33] ahh cool milimetric [21:06:37] got caravel to work for druid data [21:06:50] oh sweet [21:06:53] the metrics field starts filled in with a bad thing [21:06:55] you gotta x it out [21:06:59] and just select count(*) [21:07:20] ah cool, I'll check that out and mention it in my email [21:08:11] ok gotta run [21:08:14] see yaaaa [21:08:16] have a good weekend! [21:08:17] this is cool! [21:17:58] Analytics-General-or-Unknown: http://reportcard.wikimedia.org/ - redirect and delete old stuff - https://phabricator.wikimedia.org/T71625#2335352 (Dzahn) The old configs are meanwhile gone from stat1001 (pretty sure @elukey cleaned it up during T76348) Looks like this is all gone now. and the old URLs are... [21:18:19] Analytics-General-or-Unknown: http://reportcard.wikimedia.org/ - redirect and delete old stuff - https://phabricator.wikimedia.org/T71625#2335358 (Dzahn) Open>Resolved a:Dzahn [21:18:55] milimetric, I will finish the user code and then, if I have time look into the php thing [22:07:15] (PS1) Amire80: Add a script for processing interlanguage links stats [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/291358 [22:15:10] (PS2) Amire80: Add a script for processing interlanguage links stats [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/291358 (https://phabricator.wikimedia.org/T135584) [22:16:08] (PS3) Amire80: Add a script for post-processing interlanguage links stats [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/291358 (https://phabricator.wikimedia.org/T135584)