[07:21:53] Analytics-Kanban, Operations, ops-eqiad: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2202237 (elukey) [07:22:18] PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [07:22:31] PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [07:22:35] good morning :) [07:22:55] auto resolved, but this is the third analytics host showing high temp [07:23:00] updated the phab task [07:59:09] elukey: Hi ! [08:05:03] joal: helllooo [08:22:03] PROBLEM - Hadoop NodeManager on analytics1054 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [08:24:43] 2016-04-13 08:15:06,466 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container Monitor,5,main] threw an Error. Shutting down now... [08:24:46] java.lang.OutOfMemoryError: GC overhead limit exceeded [08:24:47] joal --^ [08:26:28] weird elukey :( [08:27:22] and it wasn't restarted automagically, could be something weird in the upstart config [08:28:23] mwarf [08:40:16] elukey: I restarted the failed refine job [08:41:03] from hue? [08:41:07] yes [08:41:14] * elukey finally got one right [08:41:20] :) [08:41:48] was it related to the 1054 failure or totally unrelated? (I saw that was coming from 1015) [08:42:17] elukey: I have not double checked [08:49:44] elukey: job failed at about the same time as namenode error - could well be related [08:51:29] super weird, theoretically the yarn daemon should be restarted straight away [08:52:08] :( [08:52:41] I mean, "the correct thing would be", meanwhile our config might not be doing it.. I'll open a phab task [08:52:52] cool elukey [08:55:01] 10:52 PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:06] one minute downtime for aqs [08:55:12] I mean, aqs1002 [08:55:18] this is the second time that I see the problem [08:58:23] https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase --> tons of errors like the ones we were seeing :( [09:02:19] ah joal now I understand what you were saying [09:02:20] READ messages were dropped in last 5000 ms: 5 for internal timeout and 0 for cross node timeout [09:02:26] from https://logstash.wikimedia.org/#/dashboard/elasticsearch/analytics-cassandra [09:07:30] elukey: batcave? [09:09:22] elukey: aqs1002 seems back to normal [09:10:00] (PS1) Addshore: Fix java package names [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283136 [09:10:27] (CR) Addshore: [C: 2 V: 2] Fix java package names [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283136 (owner: Addshore) [09:11:02] joal: not needed don't worry (but thanks!), also aqs1003 experienced problems [09:11:20] all related to cassandra timeouts as you were saying, but now I got why and how to find it :) [09:11:40] elukey: huge peak of requests [09:12:44] yeah.. :( [09:13:16] (well 25 req/second is not that huge) [09:13:19] :P [09:14:54] elukey: 25reqs/s is big when it spans 2TB of dfata on rotating disks;) [09:15:09] good point :D [09:16:38] moar SSDs! [09:19:09] elukey: indeed, waiting for SSDs eagerly [09:44:55] (PS1) Addshore: Add pre & post processing to Processor interface [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283143 [09:45:46] (CR) Addshore: [C: 2 V: 2] Add pre & post processing to Processor interface [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283143 (owner: Addshore) [10:04:40] (PS1) Addshore: Use double instead of long in MetricProcessor [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283146 [10:05:55] joal: analytics 1043 and 1049 showed the same Yarn failure :( [10:06:08] (PS1) Addshore: New build [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283148 [10:06:21] elukey: maaaaan [10:06:21] (CR) Addshore: [C: 2 V: 2] Use double instead of long in MetricProcessor [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283146 (owner: Addshore) [10:06:24] This is ops day [10:06:24] (CR) Addshore: [C: 2 V: 2] New build [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283148 (owner: Addshore) [10:09:37] (PS1) Addshore: Fix finding processor classes [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283149 [10:09:39] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2202590 (elukey) [10:10:11] opened a phab task --^ [10:10:14] (PS1) Addshore: Fix finding processor classes [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283150 [10:10:27] (CR) Addshore: [C: 2 V: 2] Fix finding processor classes [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283149 (owner: Addshore) [10:10:30] (CR) Addshore: [C: 2 V: 2] Fix finding processor classes [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283150 (owner: Addshore) [10:13:21] aaaaannnnd we have a correspondent oozie error :) [10:14:17] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2202603 (elukey) [10:17:16] Analytics, Operations, Traffic, Patch-For-Review, Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2202604 (ema) One more: Feb 11 09:17:47 cp4010 varnishstatsd[2820]: Traceback (most recent call la... [10:24:10] Analytics-Kanban: Puppet on stat1003 keeps failing for git errors - https://phabricator.wikimedia.org/T132445#2202621 (elukey) p:Unbreak!>Normal [10:24:47] Analytics, Operations: kafkatee cronspam from oxygen - https://phabricator.wikimedia.org/T132322#2202623 (elukey) p:Triage>Low [10:27:34] Analytics-Kanban, Operations, Traffic, Patch-For-Review: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102820 (elukey) Installed on maps hosts by @ema, we will rollout the new version everywhere along wiht the Varnish 4 upgrade. [10:42:06] Analytics, Operations, Traffic, Patch-For-Review, Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2202664 (ema) And this one: Mar 18 12:41:35 cp4010 varnishstatsd[10396]: Traceback (most recent ca... [11:16:58] 13:16 PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [11:17:04] this starts to be a big problem [11:17:25] raaaaah [11:18:55] elukey: any idea from where it could come from ? [11:20:09] joal: didn't get the time to investigate, I'll do it right after lunch.. I'll probably need to have a chat with you about how to track it down [11:20:32] elukey: I actually have no idea currently [11:20:39] Let's talk when you get back [11:21:29] So the nodemanager is running with -Xmx1000m [11:22:01] $maybe something more is needed for $reason [11:22:16] rmh [11:23:45] * elukey imagines Joseph's face mumbling [11:26:19] :) [11:53:07] (PS1) Addshore: Real CLI option parsing [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283154 [11:55:11] (PS2) Addshore: Real CLI option parsing [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283154 [12:05:12] (PS1) Addshore: New CLI arg processing [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283157 [12:19:44] * elukey loves http://animagraffs.com/ [12:34:45] * joal thanks elukey for such a good link ! [12:46:14] Analytics-Tech-community-metrics: Tech community KPIs for the WMF metrics meeting - https://phabricator.wikimedia.org/T107562#2202939 (Aklapper) [12:46:17] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016), developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2202937 (Aklapper) Open>Resolved Please continue discussing bring... [12:46:39] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016), developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2202953 (Aklapper) [12:47:46] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016): Identify Wikimedia's most important/used info panels in korma.wmflabs.org - https://phabricator.wikimedia.org/T132421#2202956 (Aklapper) p:Triage>High [12:50:03] PROBLEM - Hadoop NodeManager on analytics1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args [12:59:28] so it might be related to jobs causing OOM errors, but not sure why the namenode itself shutsdown [13:08:45] elukey: jobs failing should not impact nodemanager [13:12:03] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203044 (elukey) From https://grafana.wikimedia.org/dashboard/db/server-board I don't see major memory problems at host level, a lot of GBs are simply cac... [13:12:57] joal: yeah.. the weird thing is that from the logs the nodemanager shutsdown right after receiving the OOM, not sure if from a mapper or caused by itself [13:13:26] elukey: my guess is that it is an internal error to the nodemanager [13:22:05] elukey: looking at ganglia, it seems the cluster have been overloaded between 11:45 and 12:00 [13:22:09] UTC [13:25:05] GOOD MORNING [13:25:30] Hi ottomata ! [13:25:46] joal: https://phabricator.wikimedia.org/T102954 :P [13:25:49] hello ottomata !! [13:26:16] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203092 (elukey) Another ticket was opened for the same thing https://phabricator.wikimedia.org/T102954 [13:26:50] Analytics-Kanban, Operations: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2203094 (elukey) Open>Resolved p:Triage>Normal [13:27:23] Analytics-Cluster: Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors - https://phabricator.wikimedia.org/T102954#1378495 (elukey) Opened https://phabricator.wikimedia.org/T132559 before checking phab, today multiple hosts showed this behavior and I had to restart Yarn manually [13:28:13] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016), developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2203101 (Qgil) [13:32:02] Analytics-Kanban, Operations, ops-eqiad: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2203112 (elukey) ``` elukey@neodymium:~$ sudo -i salt -t 120 analytics10* cmd.run 'grep "Hardware event" /var/log/mcelog | uniq -c' analytics1041.eqiad.wmnet: analytics1... [13:33:02] Analytics-Cluster, Analytics-Kanban: Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors - https://phabricator.wikimedia.org/T102954#2203113 (elukey) p:Triage>High [13:34:33] Analytics-Cluster, Analytics-Kanban: Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors - https://phabricator.wikimedia.org/T102954#2203116 (Ottomata) This clearly doesn't happen often. In lieu of not knowing what causes this, it wouldn't hurt to increase NodeManager memory a little, eh? [13:38:01] hm, elukey just saw another one of those, eh? [13:40:59] (CR) Addshore: [C: 2 V: 2] Real CLI option parsing [analytics/wmde/toolkit-analyzer] - https://gerrit.wikimedia.org/r/283154 (owner: Addshore) [13:41:02] (CR) Addshore: [C: 2 V: 2] New CLI arg processing [analytics/wmde/toolkit-analyzer-build] - https://gerrit.wikimedia.org/r/283157 (owner: Addshore) [13:41:46] ottomata: the weird thing is that (as you were noticing in your task long time ago) I can't figure out where the OOM comes from and why the node manager shuts down [13:42:47] hm logs on 1056 maybe look al ittle different? [13:42:54] or, maybe i just didn't look at them all on the other nodes well [13:43:05] The directory item limit of /var/log/hadoop-yarn/apps/hdfs/logs is exceeded: limit=1048576 items=1048576 [13:43:26] that's > 10 mins before the alert [13:44:30] mmmmmm [13:45:14] shall we increase the -Xmx option for the yarn namenode to patch the problem? [13:45:15] elukey: i'm thinking there must be some particular job behavior that causes this [13:45:25] i'm suspicious of this [13:45:26] http://localhost:8088/cluster/app/application_1458129362008_86444 [13:45:35] since it is seems to be mentioned near the problem in teh logs... [13:45:43] probably.. but I wouldn't have expected the nodemanager to go down [13:45:44] but, that could just be an artifact [13:45:52] ja me neither, but maybe some weird bug [13:45:57] ottomata: yurik has a hell lot of jobs lately [13:46:15] ja i'd like to know what this is [13:46:24] ottomata: another thing to mention about logs: we log yarn at debug level!!! [13:46:24] 1009 reduces [13:46:34] oh? [13:46:40] I see INFO [13:46:41] ottomata: 6 concurrent jobs (~2500 maps and 1009 reduces) [13:46:47] ah [13:47:16] ottomata: Ah... When doing cassandra testing, couldn't get rid of debug [13:48:27] joal, elukey: what did we decide regarding an upgrade to Cassandra 2.2.5? or maybe, what's next there? [13:49:11] urandom: Arrg, forgot to ask the wider team :S [13:49:18] urandom: I deeply apologize [13:49:30] joal: no worries :) [13:49:40] urandom: I'm making a note and you'll have an answer later on today, after our standup [13:49:45] Analytics-Cluster, Analytics-Kanban: Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors - https://phabricator.wikimedia.org/T102954#2203156 (Ottomata) Seeing some interesting logs on an56 which had a NodeManager just die. Not sure if related: ``` Caused by: org.apache.hadoop.ipc.RemoteExce... [13:49:54] auh, then my timing is excellent then. [13:49:59] joal: oh you mean for the yarn job logs itself? [13:50:03] the ones that are stored in hdfs after job completion? [13:50:11] ottomata: correct sir [13:50:11] not the yarn daemon process logs [13:50:13] aye [13:50:14] hm ok [13:51:16] urandom: atm we are seeing tons of errors from cassandra timeouts (causing also problems to aqs restbase) and I'd prefer to get it resolved (waiting for SDDs) before proceeding with an important upgrade. This is of course something said from somebody with little context in AQS, so not sure if it makes sense :) [13:52:33] elukey: very good point [13:52:39] elukey: yea let's up the yarn heapsize [13:53:01] elukey: ok, i understand [13:53:03] the way it is puppetized, we'll increase the resourcemanager heap size too, but that won't hurt at all [13:53:15] plenty of mem on 01 and 02 for that [13:53:20] ->2048? [13:53:33] elukey: do you have an eta on that? are you replacing the nodes, or replacing the disks in the existing nodes? [13:54:28] ottomata: ack! 2048 looks good [13:54:51] joal: do you have more info about the SSDs? I know we discussed it but I don't recall the ETA :( [13:55:16] elukey, urandom : ottomata is the one knowing most here [13:56:41] https://phabricator.wikimedia.org/T124947 [13:56:58] https://phabricator.wikimedia.org/T132067 (not sure if you can see that one) [13:57:38] I think they are ordered... [13:57:42] ya [13:57:46] we are replacing the nodes with SSDs [13:57:52] ETA has been asked 8 days ago ... [13:58:01] yeah :( [13:58:06] joal i think that q is stale, see the linked tickets [13:58:07] can we get an eta on the eta? [13:58:15] * urandom jokes [13:58:20] bumping [13:58:20] :D [13:58:34] Thanks a lot ottomata (I don't have access to the second ticket) [13:58:44] Analytics, Operations, hardware-requests, Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2203165 (Ottomata) @Robh, over in T132067 it looks like these nodes were ordered, is this correct? If so, any idea on ETA? [13:58:46] are you planning to first bootstrap the new nodes, and then decommission the existing ones? [13:59:19] hey joal elukey, since we've missed it for a couple of weeks, want to do analytics ops sync up today? [13:59:24] maybe 30 mins before standup? [13:59:26] urandom: TBD with elukey and ottomata, my guess would be to replace old by new one by one [13:59:33] ottomata: sure ! [13:59:38] ottomata: +1 [13:59:43] joal: it would be better if you could first bootstrap in the new ones [13:59:59] urandom: bootstrap = copy data? [14:00:17] bootstrap == join a node to the cluster (and copy over data) [14:00:28] urandom: I would have left cassandra deal with data replication, but it might not be best [14:00:40] urandom: Makes sense [14:00:41] joal: not sure what you mean [14:01:24] urandom: add the node to the cluster as a new node (4 nodes in cluster, possible new virtual DC to facilitate copy), copy dat using replication, then replace [14:01:39] and not remove/replace and assume copy would work [14:01:51] makes sense (to me) urandom [14:01:59] i would stand up the new nodes, and bootstrap, bootstrap, bootstrap [14:02:08] urandom: ok [14:02:09] then decommission, decommission, decommission [14:02:50] urandom is a sequential island in today's async world [14:02:50] you'll get one stream for each source/dest pair, and often the throughput of that stream is cpu-bound on a single thread [14:02:50] :D [14:03:02] i.e. if you're using compression [14:03:18] by having more nodes, you'll be adding stream concurrency [14:03:25] and it'll move faster [14:03:28] urandom: ok [14:03:47] until you get to the last decommission of course, which will move at the same speed [14:03:56] urandom: what about DTCS compression, how does treplication handle the thing ? [14:04:13] joal: not sure what you mean [14:04:28] urandom: about orders, does replication keep it ? [14:04:33] yes [14:04:40] (i guess so, but prefer to ask) [14:04:43] yeah ok [14:04:59] yeah, data that is streamed over gets pushed back through the compaction pipeline [14:05:45] urandom: is there a way to manage the order of data streamed ? [14:06:06] joal: nope [14:06:09] urandom: Since we have backfilled data, there are some misordering in the current DTCS [14:06:13] urandom: ok :) [14:06:35] it's entirely possible DTCS isn't working the way you expect anyway [14:06:40] it's not for us [14:07:00] https://people.wikimedia.org/~eevans/20140401_to_20140404.svg [14:07:08] mobrovac: ^^^ dunno if you saw this [14:08:25] joal: when i get the time, i plan to look into an alternative (TimeWindowCompactionStrategy), that is probably a better fit [14:08:58] urandom: ok, let me know when you start, I'd love to know more [14:10:26] that's some strange compaction indeed [14:10:52] mobrovac: tl;dr there is only one data tier, encompassing everything, evar [14:11:03] mobrovac: or put another way, we're using STCS [14:11:15] it seems so, yeah [14:11:29] wasn't dtcs supposed to bound them? [14:11:36] yeah [14:11:44] it's probably read repair [14:12:25] ok, but it should abide the same rules, right? [14:12:44] what rules? [14:13:08] compaction rules [14:13:25] in my mind, dtcs should say "ok, that's enough, move on" [14:13:32] a read repair that dredged up old data to repair a node would introduce that old write into a new sstable [14:13:59] creating an sstable with a min timestamp from far in the past, and max timestamp of basically 'now' [14:14:17] that would be its window, and would make it a candidate to merge with similar sstables [14:14:19] ottomata: I was about to ask you a question for the heap size but I saw the code reviews :) [14:14:26] that's what you can see happening in that graph [14:14:30] ah sorry [14:14:30] ja [14:14:37] ah [14:14:42] shoulda told you i was on it [14:14:42] very nice that we have everything in hiera! [14:14:45] ja! [14:15:41] mobrovac: if you look at the green nodes at the very top of the graph, notice all the ones what have a timestamp interval of almost a year [14:16:14] those are candidates for compaction with brand new tables, *and* very old tables, they effectively flatten the entire structure [14:16:50] (the date-tiered structure, i mean) [14:17:05] yeah, they get swallowed basically [14:18:47] milimetric: here ? [14:18:57] yep [14:19:00] what's up [14:19:14] quick review of pageview project doc [14:19:17] milimetric: --^ [14:19:28] what about: If you want to filter by project, use the domain of any Wikimedia project, for example 'en.wikipedia.org', 'www.mediawiki.org' or 'commons.wikimedia.org'. If you are interested in all pageviews regardless of project, use all-projects. [14:19:44] joal: perfect [14:19:48] cool :) [14:19:57] Update, then PR ! [14:21:17] ottomata: also, if you have time https://gerrit.wikimedia.org/r/#/c/282936/ [14:21:23] joal, elukey: https://phabricator.wikimedia.org/P2892 [14:21:25] so I'll start the puppet agent :) [14:21:40] sstables/read is probably what is killing your read latency [14:21:53] urandom: woa! [14:21:55] on rotational disks, that is a lot [14:22:03] yay thank you elukey! [14:22:06] +2 [14:22:17] urandom: yeah, even without exact numbers, we knww it was that [14:22:29] elukey: you think we should restart all NMs 1 by 1, or just wait until they need restart to apply that? [14:22:32] joal: that is compaction doing a poor job [14:22:48] urandom: there are almost 2TB of data to get accessed randomly ... [14:23:00] * mobrovac wonders if there exists a cass prod install that's not on SSDs [14:23:07] urandom: aggreed, it could do a better job [14:23:08] joal: yeah, compaction is meant to make that much less random [14:23:10] apart AQS, ofc [14:23:10] :P [14:23:29] * joal turns his back to mobrovac [14:23:29] mobrovac: tons [14:23:32] :D [14:23:36] haha [14:24:20] urandom: backfill has probably broken the DCTS I imagine ... [14:24:22] ottomata: mmmm I thought that puppet would have done it for us, but probably no.. so I'd like to restart them now since a lot of hosts alarmed today :( [14:24:48] Analytics-Kanban, Operations, Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2203239 (Milimetric) I'd say we should wait for @ezachte to figure out what we should do with those 2T. Pinging. [14:25:13] no, it won't do it [14:25:23] we don't let puppet refresh those [14:25:32] ok elukey i'll do them one by one [14:25:44] def works [14:25:45] -Xmx2048m [14:25:46] joal: can you paste the output of this somewhere? -- /home/eevans/compaction/print_sstables_info /var/lib/cassandra/data/local_group_default_T_pageviews_per_article_flat/data [14:25:48] on nodemanager proc [14:25:49] \o/ [14:25:52] * urandom is without root [14:25:57] ottomata: if you want we can split [14:26:30] urandom: I'm not root either [14:26:32] elukey: naw s'ok, i'll just script it with a 3 minute sleep in between [14:26:43] elukey: can you answer urandom request? [14:28:20] actually... [14:28:24] i can sudo -u cassandra [14:28:39] elukey: hm, we used to have salt role grains we could easily target, i don't htink they work anymore [14:28:44] i might make some grains now...:) [14:28:50] * elukey likes grains [14:28:56] hmm maybe [14:29:14] urandom: I can run the command but should it be run on aqs100x? [14:29:31] Analytics, RESTBase, Services, User-mobrovac: configure RESTBase pageview proxy to Analytics' cluster on wiki-specific domains - https://phabricator.wikimedia.org/T119094#2203247 (Milimetric) I think there are no real blockers to doing this other than the most likely painful bikeshed. So we can... [14:30:14] elukey: nevermind, i got it: https://phabricator.wikimedia.org/P2893 [14:30:33] elukey, joal: that is really awful btw [14:30:52] urandom: I could have guessed that :( [14:31:26] there is a 581G file in there! [14:31:39] urandom: Yayyyyyy :) [14:31:45] Hurray for big fillllles [14:32:12] I am very ignorant but what causes an sstable to grow like that? [14:33:14] elukey: DTCS is supposed to create windows by (write) time, the idea being, that once the tables in a window reach a certain age, they'd stop being compacted with one another [14:33:19] (or very rarely) [14:33:22] elukey: compaction, which is cassandra's process of consolidating and ordering data [14:33:34] old tables would get merged with old tables, new tables with new tables, etc [14:34:22] elukey: here, it looks like out of order writes are doing to aqs, what they're doing to restbase, flattening out data tiers so that all data gets merged with all data [14:34:35] Analytics-Kanban: Fix number formatting in charts - https://phabricator.wikimedia.org/T132579#2203281 (Milimetric) [14:34:47] though, you can almost discern two tiers here, like the out-of-order writes are quite as routine as they are for us [14:34:55] aren't, that is [14:35:32] adding SSDs is a pretty brute-force approach [14:36:12] urandom: Is there a way to tell cassandra about a timestamp info of of the it using the write one ? [14:36:39] for cassandra to compact time-related data together whatever time it is written [14:38:01] joal: not sure what you mean, do you mean can you tell compaction to use an alternative time? [14:38:18] !log restarting hadoop-yarn-nodemanager on all hadoop worker nodes one by one to apply increase in heap size [14:38:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:38:23] urandom: correct [14:38:39] joal: nope, it's write time, everything else would be schema-specific [14:39:01] joal: you can supply write time though [14:39:02] urandom: ok [14:39:07] you can specify it when making a write [14:39:10] urandom: really? [14:39:17] didn't know that [14:39:28] yeah, but that's probably not going to help here [14:39:41] ok [14:40:01] urandom: the shame is that our data is very evenly partitionned based on time (daily) [14:40:22] It could be easy to compact, let's say, monthly, and keep it like that [14:40:25] DTCS makes the assumption that you access your data in order, and that that order matches the order it was written in [14:40:38] joal: yeah, you need TWCS [14:40:46] urandom: keep asking trivial questions - the motivation for the DTCS windows should be that new values will be found in more recent sstables or even in the memtable [14:41:01] then you could game write-time when backfilling to preserve locality [14:41:22] urandom: yay, that would sense indeed [14:41:43] elukey: yes, as it merges smaller tables to create larger once, it does so ordered by writetime [14:41:49] urandom: What about existing data - only way would be to reinsert, right ? [14:42:28] elukey: if your data is ordered by time, and if you are writing in time-order, then that will make queries answerable in the fewest possible tables, because the data is contiguous [14:42:52] elukey: newer results in new tables, older results in older tables [14:43:06] joal: no, you can recompact [14:43:30] urandom: but recompact would be based on write time :S [14:43:30] joal: assuming a move to TWCS, where recompacting would separate into time-based windows [14:44:05] joal: in DTCS, the damage is done [14:45:20] joal, elukey: the other thing you could do here is host more Cassandra nodes [14:45:50] more nodes, with a lower node density would bring down the sstables/read too [14:46:13] that probably means running multiple instances of Cassandra (which is what we do on the RESTBase cluster) [14:48:44] ah nice... [14:49:00] Analytics-Cluster, Operations, Traffic, HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203363 (BBlack) I don't see any mixed content in simple checks (and we checked/fixed that in a much earlier ticket: (T93702). Since this site is clearly for human c... [14:56:28] hey joal, cmjohnson swapped the faulty disk on aqs1001 [14:56:48] we haven't been able to figure out how to make the OS recognize the hot swapped device [14:56:55] when we've done this before we've rebooted nodes [14:57:01] so, i think we need to reboot aqs1001 [14:57:12] ottomata: no problem for me [14:57:17] waaaiiitttt [14:57:27] will coordinate! [14:57:31] elukey: unless you know how to make it show up! [14:57:38] we need /dev/sdh [14:57:40] back [14:57:42] ottomata: have you killed/restarted namenodes or something ? there are lot of jobs errors :( [14:57:48] the last time we stopped cassandra on aqs1001 the pageview API stoppeD! [14:57:53] ottomata, joal --^ [14:57:54] joal: i am restarting one nodemanager every 3 mins [14:58:18] ottomata: ok, jobs have failed (see alerts) [14:58:22] I think we fixed the quorum problem, but we need to test it :D [14:58:42] so please before rebooting just stop restbase and cassandra with a nodetool drain before [14:58:47] joal: i see one refine job [14:58:48] fail? [14:58:49] ja? [14:58:49] and see what happens [14:59:11] haha, elukey i'm not rebooting it yet! [14:59:14] coordinating with yall [14:59:34] k ottomata [14:59:51] ottomata: on hadoop, 3 load and 1 refine failed [15:00:02] !log restarting failed jobs [15:00:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [15:00:11] ottomata: I knooowww just wanted to tell you [15:00:54] hehe :) [15:02:26] Analytics-Cluster, Analytics-Kanban, Operations, Patch-For-Review: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2203442 (elukey) stat1004 is ready for a test! @Ottomata would you be the first one to try it? :P [15:02:30] joal: hmm, ok, sorry about that [15:02:44] ottomata: ok, restart, done :) [15:02:49] there are still more restarting, this shouldn't cause failed jobs though...although [15:02:55] i see some of the restarts say [15:03:02] nodemanager did not stop gracefully after 5 seconds: killing with kill -9 [15:03:17] ottomata: might be the thing [15:03:39] elukey: stat1004 works for me! [15:03:53] ottomata: happened to me too [15:04:00] elukey: is there a ticket open for these timeouts/high latency? [15:04:15] urandom: good point, not sure.. joal? [15:04:56] urandom: I'll try to study a bit cassandra compaction and then I'll re-read what you wrote, it makes sense but there are some question marks [15:04:59] :D [15:05:21] urandom: https://phabricator.wikimedia.org/T124314 [15:05:31] But it's not yet in our task list [15:06:00] elukey: sorry, i'm never sure how much detail to go into in a venue like this; it's a rather broad subject [15:06:11] elukey: i wonder if the ones that aren't restarted nicely are ones that would soon OOM [15:06:56] ottomata: nah because it was happening for each restart that I did in the past, I wouldn't worry [15:07:04] oh really? [15:07:07] yep [15:07:07] not all of them do that though [15:07:37] I also tried with a higher timeout in salt, but those 5 seconds are somewhere in the stop script probably (never checked) [15:07:51] urandom: nope I have to thank you, leaning tons of things :) [15:10:06] ah elukey the stats user commits to gerrit via http auth [15:10:07] not ssh [15:10:34] hmmm geowiki pull? [15:10:34] hmm [15:10:37] need to look into this [15:10:45] ahh so many things going on at once this morn! :) [15:11:12] ottomata: yeah I patched it on the fly to make things working again! [15:11:26] joal: what are typical AQS queries like? what kind of time range do they cover? [15:11:40] Analytics-Cluster, Operations, Traffic, HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203508 (Ottomata) +1 should be fine to do. [15:12:48] urandom: we have not looked into those stats much, so can't say [15:16:04] joal: that would influence the sstables/read too, if the data spanned very large periods of time, then it follows it will be spread across more files [15:16:09] urandom: I can do an analysis onto that [15:17:10] Analytics, Operations, hardware-requests, Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2203525 (Eevans) [15:18:18] joal: keep me in the loop, super interested :) [15:20:46] ottomata: have you restarted analytics1055 by any chance? [15:22:14] Analytics-Kanban: backfill pageviews for iOs app - https://phabricator.wikimedia.org/T132589#2203538 (JAllemandou) [15:23:27] Analytics-Kanban: Upgrade scripts to facilitate wiki data loading / treatment on hadoop - https://phabricator.wikimedia.org/T132590#2203553 (JAllemandou) [15:23:48] Analytics-Kanban: Upgrade scripts to facilitate wiki data loading / treatment on hadoop - https://phabricator.wikimedia.org/T132590#2203565 (JAllemandou) a:JAllemandou [15:27:17] o/ joal [15:27:22] I just saw https://github.com/wikimedia-research/research-cluster/pull/5 [15:27:24] \o/ [15:27:27] \o halfack :) [15:27:33] -c [15:27:34] :D [15:27:42] halfak: currently applying it to frwiki 20160305 [15:27:47] Great! [15:28:14] halfak: will let you know if successful tomorrow or the day after (depending of how fast the thing runs) [15:28:41] mwarf, sorry halfak (i do the c one regularly ...)( [15:29:10] joal, cool. Sounds great. [15:35:46] Analytics-Cluster, Operations, Traffic, HTTPS, Patch-For-Review: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2203641 (BBlack) Open>Resolved a:BBlack [15:46:15] yurik: Hi [15:46:16] Analytics-Cluster, Analytics-Kanban, Operations, ops-eqiad: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2203677 (Ottomata) [15:47:08] yurik: the jobs you're launched today gave pain to the cluster [15:47:29] yurik: one by one instead of everything at once would be better [15:49:56] Analytics, DNS, Operations, Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2203682 (Nuria) [15:50:34] (CR) Milimetric: Change float to Decimal in dynamic_pivot.py (1 comment) [analytics/reportupdater-queries] - https://gerrit.wikimedia.org/r/282728 (owner: Mforns) [15:50:36] Analytics: Move datasets.wikimedia.org to analytics.wikimedia.org/datasets - https://phabricator.wikimedia.org/T132594#2203684 (Nuria) [15:50:51] Analytics, DNS, Operations, Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (Nuria) [15:50:54] Analytics: Move datasets.wikimedia.org to analytics.wikimedia.org/datasets - https://phabricator.wikimedia.org/T132594#2203696 (Nuria) [15:51:26] Analytics: Move datasets.wikimedia.org to analytics.wikimedia.org/datasets - https://phabricator.wikimedia.org/T132594#2203684 (Nuria) Move datasets.wikimedia.org to analytics.wikimedia.org/datasets [15:53:22] Analytics: Get piwik stats for dashiki - https://phabricator.wikimedia.org/T126247#2203708 (Nuria) Open>Resolved [16:00:10] Stuck in traffic again a-team, running few minutes late to standup [16:02:18] madhuvishy: k, please send e-scrum if you think you cannot make it [16:03:00] Analytics-Cluster, Analytics-Kanban: Experiment with new Kafka versions and verify that they work with existing clients - https://phabricator.wikimedia.org/T132595#2203733 (Ottomata) [16:03:20] nuria: I'll make it I think [16:05:37] Analytics, Analytics-Cluster: Use MySQL as Hue data backend store - https://phabricator.wikimedia.org/T127990#2203754 (Ottomata) [16:05:39] Analytics-Cluster, Operations, hardware-requests, netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2203753 (Ottomata) [16:06:10] Analytics, Analytics-Cluster: Use MySQL as Hue data backend store - https://phabricator.wikimedia.org/T127990#2060494 (Ottomata) Let's do this before we move Hive/Oozie to analytics1003 in T130840 [16:06:12] Analytics-Kanban: Upgrade scripts to facilitate wiki data loading / treatment on hadoop - https://phabricator.wikimedia.org/T132590#2203760 (JAllemandou) https://github.com/wikimedia-research/research-cluster/pull/5 [16:07:27] Analytics, DNS, Operations, Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (BBlack) If we're doing this in production, the frontend should probably be through cache_misc. I'm not sure what the backend looks like at all role/software-wise... [16:10:21] Analytics: Better response times Pageview API - https://phabricator.wikimedia.org/T124314#2203782 (Eevans) I briefly looked at the Cassandra cluster, and the typical column family read latency (this just the time Cassandra records for local storage-system reads) is quite high. {F3869428} {F3869443} At lea... [16:11:26] Analytics-Kanban: Better response times Pageview API - https://phabricator.wikimedia.org/T124314#2203783 (JAllemandou) a:JAllemandou [16:35:34] !log rebuilding raid1 array on aqs1001 after hot swapping sdh [16:35:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [16:52:51] urandom: I'm going to invest more time on compaction, trying to help you [16:53:26] urandom: will ping you again tomorrow to discuss around the things I'll have understand [16:54:39] joal: ok, let me know! [16:55:20] urandom: about upgrade, nothing more than what we discussed: waiting for new nodes on our side, and since no ETA, difficult to really plan [16:58:19] a-team, logging off for today ! [16:58:25] see you tomorrow :) [16:58:28] nite! [16:58:33] bye joal! [17:21:29] logging off team! byyeeee [17:42:27] milimetric: 20 daily visits to browser reports WOW! [17:42:51] Analytics, Hovercards, Reading-Web-Sprint-70-Lady-and-the-Trumps, Reading-Web-Sprint-71-m: Capture hovercards fetches as previews in analytics - https://phabricator.wikimedia.org/T129425#2204138 (bmansurov) [17:43:01] milimetric: although we have not announced it yet though [17:43:05] Analytics, Hovercards, Reading-Web-Sprint-70-Lady-and-the-Trumps, Reading-Web-Sprint-71-m: Capture hovercards fetches as previews in analytics - https://phabricator.wikimedia.org/T129425#2105269 (bmansurov) a:bmansurov [17:45:36] Analytics-Kanban, Commons, Multimedia, Wikidata, and 2 others: Allow tabular datasets on Commons (or some similar central repository) (CSV, TSV, JSON, XML) - https://phabricator.wikimedia.org/T120452#2204148 (Yurik) The above patch allows tabular data storage with string, numeric, and "localized"... [18:00:43] nuria_: yeah, we better watch out, it's gonna get popular. That's what we get for building useful tools [18:00:53] have you all seen this: http://classicprogrammerpaintings.tumblr.com/ [18:00:57] it's *amazing* [18:02:12] “Junior programmer learns `git rebase –interactive`“ - Salvador Dalí, 1936, Oil on Canvas [18:02:19] jajaja AWESOME [18:03:01] SUPER AWESOME [18:07:46] Analytics, Hovercards, Unplanned-Sprint-Work, Reading-Web-Sprint-70-Lady-and-the-Trumps: Capture hovercards fetches as previews in analytics - https://phabricator.wikimedia.org/T129425#2204242 (MBinder_WMF) [18:11:35] Analytics, Hovercards, Unplanned-Sprint-Work, Reading-Web-Sprint-70-Lady-and-the-Trumps: Capture hovercards fetches as previews in analytics - https://phabricator.wikimedia.org/T129425#2204276 (bmansurov) How should we treat the user's DoNotTrack setting in this context? I suppose the header will... [18:15:38] yoo elukey still there? [18:20:51] madhuvishy: looking at beeline now [18:20:56] ottomata: coool [18:21:06] i haven't tested the puppet patch itself [18:21:31] ja, i think cdh/hive.yaml is the wrong place for those settings but ja [18:21:38] oh [18:21:46] ya i wasn't sure where to put it [18:23:47] probably a new file uhhh role/analytics_cluster/hive/client.yaml i thiink [18:24:06] then you can just look them up in the role by their var names [18:24:15] probably just hive_server_host is good [18:24:29] and, you don't need a new role class for the beel,ine script [18:24:38] just render it from hive::client role [18:24:51] hm actually [18:25:05] ahh [18:25:10] you can make the values of hive_server_host default to metastore_host [18:25:13] so [18:25:13] uhh [18:25:28] hiera('hive_server_host', hiera('cdh::hive::metastore_host') maybe [18:25:30] i think that woudl work [18:25:31] actually [18:25:36] no, you don't have to look it up in hiera [18:25:58] you can grab it from the inclusion of the hive::client class directly in puppet [18:25:59] pretty sure [18:26:03] so in hive::client role [18:26:04] ottomata: okay [18:26:14] hiera('hive_server_host', $::cdh::hive::metastore_Host) [18:26:29] looking at script... [18:26:33] ottomata: okay [18:26:38] i'll make the changes [18:28:55] oh, madhuvishy btw, the templates in the role module live in modules/role/templates/... [18:29:02] not modules/role/manifests/... [18:29:06] aah [18:29:08] that's why [18:29:09] okay [18:29:13] will move it [18:29:17] k :) [18:29:47] mm, since you are doing python might as well pep8 /lint this too [18:38:37] ottomata: as in just run a check? [18:38:54] puppet will do it, but ja you got lines > 80 chars, etc. [18:39:59] oh [18:40:19] right, fixing [18:56:41] ottomata: pushed new patch [19:05:58] madhuvishy: i might have a sneakier way of parsing these args... [19:06:14] ottomata: ha ha what [19:10:51] madhuvishy: https://gist.github.com/ottomata/2d4fdd5edeb838bac9666781e1b569dc [19:11:02] oh lemme add some comments [19:11:43] HMM oh wait [19:11:47] might not work.... [19:12:25] ottomata: [19:12:26] well [19:12:43] I thought docopt won't let you accept other args [19:12:48] that you don't specify [19:13:21] ja i have been testing this with non opt args [19:13:26] ones without - or -- [19:13:27] that works fine [19:13:39] buuut ya i think this doesn't work if you want to give it more opts with -- [19:13:41] without specificying them [19:13:42] :/ [19:13:45] yup [19:13:49] that's why [19:13:56] doh! [19:13:58] k [19:14:04] i had to stick to list of strings [19:21:11] ottomata: you are also missing that long options have a different format than short options [19:21:24] can't be space separated, have to be option=value [19:21:52] aye ja [19:22:02] indeed [19:29:49] madhuvishy: added some comments on the script on patch set 2 [19:29:57] i may be not understanding the code, so correct me if i'm wrong [19:37:57] ottomata: okay [19:44:02] ottomata: o/ just seen the message [19:44:45] elukey: hey meant to ask before standup started if you had time to do the hue in mysql task [19:46:08] ottomata: haven't looked at it yet but I can start tomorrow [19:46:40] ok awesome! i'll assign to you thank Youuu! [19:47:00] Analytics, Analytics-Cluster: Use MySQL as Hue data backend store - https://phabricator.wikimedia.org/T127990#2204673 (Ottomata) a:elukey [19:50:41] ottomata: you're right - fixed and updated patch [19:51:20] ottomata: mmmm but where should I put the mysql db? On the same host? [19:52:48] elukey: i think you can use it remotely...if that is possible [19:52:57] the one that is already running hue and oozie is fine [19:53:01] sorry [19:53:08] the mysql db that is used for oozie and hive [19:53:27] ahhhh okok adding a schema in there + tables [19:53:35] all right I'll double check tomorrow :) [19:54:08] ja, i'm not sure how to do it, but i think there are good instructions in cloudera docs [19:54:15] yep yep [19:54:20] wherever they talk abouyt installing hue [19:54:35] will probably have to set it up in labs to test :/ [19:55:11] madhuvishy: this looks good! [19:55:49] ottomata: cool! can we test it somehow before merging? [19:55:58] i'm guessing no [19:56:08] hm, could im running puppet compiler on it now [19:56:12] also ottomata I acked a bit around the puppet compiler, I found a procedure to make the submodules work.. basically git submodule update --init as first step, cd modules/submodule_name dir, git fetch and checkout (that created a DETACHED HEAD commit), then git checkout -B test, then cd ../../ and finally git add modules/submodule_name [19:56:14] i think that will be good enough for this [19:56:16] ya okay [19:56:35] a bit of a hack but works doing it manually [19:56:44] elukey: with a new submodeul? [19:56:54] nono existing one [19:57:01] to test a new change [19:57:05] hm, why is --init needed? [19:57:30] I think that each time the puppet dirs are re-recreated in temp [19:57:33] ohhh [19:57:35] ja [19:57:37] makes sense [19:57:41] so I assumed to be a fresh start each time [19:57:44] never hurts :) [19:57:51] why -B test? [19:58:06] and why git add? [19:58:16] it creates a branch with the new change on top, that will be recognized as submodule change [19:58:28] rather than relying on a DETACHED HEAD [19:58:42] (basically only a commit referring to the last SHA as parent) [19:59:39] madhuvishy: https://puppet-compiler.wmflabs.org/2428/stat1002.eqiad.wmnet/change.stat1002.eqiad.wmnet.err [19:59:56] is that needed to test puppet? [19:59:56] ottomata: ah [19:59:59] elukey: ? [20:00:28] yeah I know it sounds weird, but it is the only clean way I found to trick git :P [20:00:43] madhuvishy: role/analytics_cluster/... [20:00:52] oh [20:00:53] right [20:01:01] hm, elukey won't it just let you checkout the sha directly? [20:01:09] and then run puppet? [20:01:12] puppet doesn't care about git [20:01:14] no? [20:02:07] ottomata: not sure, could also be an option, I wanted to be super sure that the change was recognized in a clean state.. but I'll do some testing tomorrow! [20:02:27] ok cool! [20:02:40] talk with you tomorrow!! [20:02:42] byyeeee! [20:02:55] madhuvishy: keep fighting puppet :) [20:03:19] ottomata: running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/2429/ [20:03:28] elukey: I will! good night :) [20:04:02] elukey: byee [20:05:48] ottomata: looks good i think? https://puppet-compiler.wmflabs.org/2429/stat1002.eqiad.wmnet/change.stat1002.eqiad.wmnet.err [20:06:25] ja looks good! [20:06:28] let's try it! [20:06:42] ottomata: coool [20:07:04] uhh [20:07:05] ottomata: [20:07:08] ja? [20:07:17] oh [20:07:18] nothing [20:07:26] minor mental scare - but it's correct [20:08:37] hah ok [20:14:21] Analytics-Kanban, Operations, Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2204736 (ezachte) I can't login right now to check. The vast majority of that 2TB will be backups, which I thin out every half year or so. All html files in htdocs should be co... [20:14:37] madhuvishy: running puppet [20:14:48] excited [20:15:02] madhuvishy: maybe try it from stat1004! :) [20:15:07] ooh [20:15:22] i haven't logged in to that instance yet [20:18:57] Hmmm seems to hang? [20:19:32] ottomata: oh i can't get into both stat instances now - something's up with my ssh - fixing [20:19:38] bast1001 changed [20:19:39] is all [20:19:43] it was reinstalled [20:19:51] yeah [20:19:53] fixed now [20:21:08] ottomata: hmmm [20:21:13] why is it hanging :/ [20:21:14] hangs for you too? [20:21:15] yeah dunno [20:21:39] madhuvishy: you had it working on stat1002 manually before? [20:21:46] ya [20:21:48] hm [20:21:53] i didn't test the very last changes though [20:22:06] ja ut the args look good i thikn [20:22:23] OH [20:22:28] madhuvishy: unqualified poath i think [20:22:36] its prob infinite looping [20:22:41] execing /usr/local/bin/beeline [20:22:45] since that is now in the path [20:22:50] you need to make it exec /usr/bin/beeline [20:22:53] instead of just 'beeline' [20:23:33] just tested, ja that works [20:23:37] you patch I merge? [20:23:41] ohh [20:23:53] of course [20:24:02] so no need to execvp [20:24:54] patching [20:25:55] oh eh? [20:25:56] execvp needed [20:26:02] you just need to execvp /usr/bin/beeline [20:26:04] not just 'beeline' [20:26:08] ottomata: p is for path [20:26:12] oh [20:26:13] ok [20:26:14] i put that so it'd figure it out [20:26:18] but obviously [20:26:23] not if same name [20:26:23] oh ok, still execing htough ok [20:26:28] yes yes [20:26:32] execv [20:28:21] ottomata: patched [20:29:19] madhuvishy: that works? not /usr/bin/beeline for first arg to execv too? [20:29:43] gah [20:29:47] of course [20:29:57] no my bad [20:30:38] ottomata: sorry fixed. thank god for CRs [20:33:58] :) running puppet [20:37:28] coooll madhuvishy works! [20:37:30] ottomata: yay it works [20:37:31] :D [20:44:24] HaeB: you can now do just 'beeline' and the connection string and user are set automatically [20:44:38] it also defaults outputformat to tsv2 [20:45:09] i don't like the table thing beeline does by default otherwise [20:46:19] Analytics-Kanban, Operations, Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2204894 (Dzahn) If that is mostly HTML and text let me try to compress that, we should achieve a high compression ratio and maybe it's not so bad then. [21:06:58] Analytics, Analytics-Cluster: Debianize Kafka 0.9 - https://phabricator.wikimedia.org/T132631#2204960 (Ottomata) [21:07:05] Analytics, Analytics-Cluster: Debianize Kafka 0.9 - https://phabricator.wikimedia.org/T132631#2204975 (Ottomata) [21:17:54] milimetric: There is a repo request for https://phabricator.wikimedia.org/T120497 [21:18:07] The developer requested 'wikistats' as name for the repo. [21:18:17] oops, I missed that [21:18:21] ha, yeah, that won't work :) [21:18:30] I told him that this name already has a meaning for WMF. [21:18:55] So he suggested: [21:19:13] wikipagestats [21:19:23] Which is different, but still close. [21:19:40] qchris: wait, but I don't see the repo request, I see the slight misunderstanding here: https://phabricator.wikimedia.org/T120497#1974235 [21:19:40] Since the task reads like the A-team is somehow in the mix here ... [21:19:53] https://www.mediawiki.org/wiki/Git/New_repositories/Requests [21:19:58] ^ there's the repo request [21:20:29] https://www.mediawiki.org/wiki/User_talk:QChrisNonWMF#Ambigous_name_wikistats_for_Gerrit_Repo [21:20:41] ^ is the response to my pushing back on the name "wikistats" [21:20:59] I figured the A-team might have an opinion about where this new repo should live [21:21:11] (e.g.: underneath 'analytics/*' ?) [21:21:17] and maybe also what it should be called? [21:21:27] hm :/ thanks for catching this, I would've rather Eggeda talked about it on the phab ticket so we could all see [21:21:41] I've been following that ticket, and promised to help, but hadn't heard anything and assumed the project was dead [21:21:52] Repo requests are still on wiki. It's still a split world. [21:22:09] ok, no prob, I didn't know I'll watch that page now [21:22:37] so... yeah, I'm not sure if it should live under analytics/ [21:22:40] No need to watch that page. I am doing the Repo requests. If something analytics related comes up, I'll just ping you. [21:23:06] (It's mostly extensions etc) [21:23:06] I don't necessarily want to sign up to maintain this new tool, and there's already another one built that is getting very popular and has pledged support from MusikAnimal [21:23:28] Hi! anyone have an idea about how to take a quick, representative sample of event logs from hive over the course of 4-5 months? [21:23:32] but then again, anywhere else would seem random [21:23:54] If it ends up as labs tool, then we can just put it there. [21:23:59] AndyRussG: tablesample? Depending on what fields you need, go against pageview_hourly instead of webrequest [21:24:16] qchris: ok, so analytics/wikipagestats seems ok with me [21:24:26] Ok. Cool. Then I'll create that. [21:24:28] Thanks! [21:24:32] milimetric: I dunno tablesample... Not web logs, EventLog event results... :) [21:24:33] thank you for the ping [21:24:41] oh EventLog... [21:24:43] uh... [21:25:32] AndyRussG: why Hive? [21:25:51] madhuvishy: ah well no, doesn't have to be Hive. That's just where I've gotten these event results from before [21:26:37] oh interesting - you used hive to query EL tables? that's a first :) [21:26:45] not a first! people do it! [21:26:47] ummm [21:26:49] maybe not.. [21:26:49] ottomata: :P [21:26:49] haha [21:26:55] ellery uses spark! [21:26:57] madhuvishy: yes, it's sort of a first, Andy is querying the only blacklisted schema [21:27:02] right [21:27:03] ahhh :) [21:27:07] (ellery's getting the same data) [21:27:17] ok, so where is that in Hive? [21:27:19] that reason i get [21:27:22] I don't see any tables in wmf or wmf_raw [21:27:38] milimetric: there are no tables [21:27:45] madhuvishy: maybe I'm remembering wrong [21:27:48] you have to make them [21:27:50] ah, spark is the only way then, I see [21:27:55] no you can use hive [21:27:57] if you make tables [21:27:59] milimetric: no - you can use a UDF [21:28:05] but they will be external tables on json [21:28:06] let me find wiki [21:28:08] spark is easier [21:28:19] does spark have something like tablesample? [21:28:20] https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Hive [21:28:29] it has sample [21:28:34] http://spark.apache.org/docs/1.5.0/programming-guide.html [21:28:38] sample(withReplacement, fraction, seed) [21:28:41] but [21:28:42] I don't actually ahve to sample, I mean, I could just run a loooong query [21:28:42] in either case [21:28:50] its not going to sample smartly [21:28:57] because the data is just in json [21:29:03] not bucketed in anyway [21:29:08] like the refined webrequest data [21:29:19] AndyRussG: 4-5 months worth data? hmmm doesn't seem like it'll go well for the cluster [21:29:20] so, any sampling you are doing will still have to read all the data (pretty sure) [21:29:26] right, but if you make a hive table partitioned by time you can at least sample over time [21:29:27] well, eventlogging data mioght not be that bigt [21:29:40] yes, it's not huge, but it's not tiny [21:29:41] true [21:29:47] which schema AndyRussG? [21:30:12] CentralBannerHistory [21:30:12] OO, AndyRussG you can be our first guinea pig! can you log into stat1004.eqiad.wmnet? [21:30:18] new server, dedicated for working with hadoop stuff [21:30:22] CentralNoticeBannerHistory [21:30:28] so you don't have to compete with resources when folks arerunning heavy local stuff on stat1002 [21:31:08] for reference, the only blacklisted one: https://github.com/wikimedia/operations-puppet/blob/f4420ead5ea65e5ccf92e27ab5fe726cb38c0660/hieradata/common.yaml#L282 [21:31:40] 8.1 G in 2016 [21:31:46] not too bad [21:31:58] oh true, why'd we blacklist that? :) [21:32:11] oh well, right [21:32:12] we prune [21:32:14] that's just last 3 months i think [21:32:22] true [21:32:26] right, that's still a lot smaller than Edit [21:32:31] so they don't really have 4 months data [21:32:35] * milimetric brb, seems like yall got this [21:32:35] aye, dunno, maybe it used to be bigger? [21:32:38] anywhere [21:32:49] aye [21:32:53] ottomata: ya may be during fundraising times [21:33:09] AndyRussG: so my advice is to use Hive and TABLESAMPLE, here's how: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling [21:33:15] we don't have december data now [21:33:21] which was probably huge [21:33:30] but maybe we kept that for this schema? [21:33:37] milimetric: no [21:33:39] unless ellery was done with it [21:33:59] oh? i don't think we have special rules for not dropping EL data in Hive [21:34:01] don't remember any special casing [21:34:02] yeah [21:34:08] k, so then we don't have 4-5 months [21:35:29] oh hmm I thought it was kept a while! [21:36:31] madhuvishy: milimetric: ottomata: didn't realize this was such a fun and extended question! I'm actually starting a meeting, but I'll see any backscroll, and hopefully we can talk more soooon! Thanks so much !!! :) [21:41:02] AndyRussG and all: i would be interested to learn how it goes (perhaps let Analytics-l know?); would be eager to use Hive for EL, too. Back in January, mforns successfully walked me through the steps at https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Hive , but then we got stuck at aggregating the resulting partitions into one queryable table [21:44:36] Analytics-Kanban, Operations, Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2205087 (Ottomata) > 2.0T wikistats Ja, this is why we don't backup! Too big! stat1001 is in the analytics cluster...so technically we could use HDFS as a holding pen. Migh... [21:45:56] AndyRussG: I think spark will be easier than hive [21:46:02] less dealing with partitions [21:46:34] ottomata: ok I've never used spark yet, but that sounds fun to work on! [21:48:48] madhuvishy: (beeline) great! appreciating not having to look up the parameters [21:49:38] ... and i appreciate the switch to TSV too... those ASCII art tables look nice, but can't be posted directly into a spreadsheet [21:49:59] HaeB: :) Yes! annoying to specify everytime too [21:50:41] AndyRussG: spark is fun! its fun to play with on the REPL [21:50:58] ping me or madhuvishy if you are trying and want help [21:51:13] I hope we can make all of this more fun with https://wikitech.wikimedia.org/wiki/PAWS/Internal [21:51:15] ottomata: fantastic thx, will do fer sure [21:51:49] madhuvishy: cool! [21:55:26] (PS1) Milimetric: Allow optional exclusion of columns from parsing [analytics/dashiki] - https://gerrit.wikimedia.org/r/283336 (https://phabricator.wikimedia.org/T132579) [21:55:56] (CR) Milimetric: [C: 2 V: 2] "self-merging 'cause this is a prod bug. Feel free to review after the fact." [analytics/dashiki] - https://gerrit.wikimedia.org/r/283336 (https://phabricator.wikimedia.org/T132579) (owner: Milimetric) [21:56:29] (CR) Milimetric: "This goes with this diff: https://meta.wikimedia.org/w/index.php?title=Config%3ASimpleRequestBreakdowns&type=revision&diff=15524226&oldid=" [analytics/dashiki] - https://gerrit.wikimedia.org/r/283336 (https://phabricator.wikimedia.org/T132579) (owner: Milimetric) [22:13:28] madhuvishy: oh, had not seen that you had written this up into a proposal. that would be so awesome! [22:14:00] HaeB: I'm gonna be experimentally working on it this quarter - will see how it goes.