[00:53:36] Analytics-Wikimetrics, MediaWiki-Vagrant: Vagrant Setup alembic config errors - https://phabricator.wikimedia.org/T99631#1295554 (bd808) [00:55:09] Analytics-Wikimetrics, MediaWiki-Vagrant: Vagrant Setup alembic config errors - https://phabricator.wikimedia.org/T99631#1295505 (bd808) The interesting part of the Puppet trace is: ``` ==> default: Notice: /Stage[main]/Wikimetrics::Database/Exec[alembic_upgrade_head]/returns: Traceback (most recent call... [03:04:54] Analytics-Cluster, operations: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1295682 (Gage) I followed this: https://git.wikimedia.org/blob/operations%2Fpuppet.git/2cdd08f9686b040816bd0dd8e63e712f4b084a7a/modules%2Fpackage_builder%2FREADM... [03:05:15] Analytics-Cluster, operations: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1295683 (Gage) [13:06:16] Good morning milimetric :) [13:06:22] morning [13:06:39] I don't see aaron around [13:06:42] Aaron not being here, do we cancel the meeting or do you want us to talk ? [13:06:44] so I figure the meeting's off [13:06:52] he ent an email :) [13:06:56] oh, sorry, didn't read [13:07:00] np :) [13:07:02] nah, let's cancel, I could use the time [13:07:10] Cool [13:07:15] Talk later then :) [13:07:42] joal: if you want to talk, of course I have the time, btw [13:07:53] So far so good :) [14:29:52] moooornin [14:29:54] joal: you around? [14:30:00] Heya ! [14:30:03] I am :) [14:30:07] helloOooO [14:30:13] Was waiting for you to ping, not to be invasive :) [14:30:18] hehe :) [14:30:32] How are you doing ? [14:30:39] thanks, slow morning for me. went to cafe, forgot my bike lock, had to run back home to get it [14:30:41] now settled in at cafe! [14:30:52] am doing gooood [14:30:57] how you doiiiin? [14:31:08] how's baby joal? [14:31:16] Lino is great :) [14:31:37] Smiling every day more, and shouting when hungry, as his father :D [14:31:40] :) [14:32:00] And he lets us sleep not too bad, which is really cool [14:32:24] I'd advise anybody who wants a really incredible adventure to make a child :D [14:32:29] that is great! you should bring baby lino to mexico! :) [14:32:56] I am sure he'd love, but my concentration would be completely gone :) [14:33:00] ha [14:33:10] maybe nuria can bring her little one too! [14:33:16] SOOOooOOo um jajjajajaj [14:33:17] Next year, depending on the place maybe [14:33:17] this guy: [14:33:19] https://gerrit.wikimedia.org/r/#/c/209642/ [14:33:48] do you think my worry is legitimate? [14:33:52] the comment? [14:33:57] Just read it [14:34:11] So let's math a bit [14:34:25] 2G / container by default [14:34:35] see comment in file on line 211 [14:34:36] 64g per new machine (right ?) [14:34:44] ya [14:34:45] - 12G for system ? [14:35:12] Isn't there some setting to reserve memory space for the systme ? [14:35:21] I recall something like that [14:35:24] elsif $::memorysize_mb <= 73728 { [14:35:24] $reserve_memory_mb = 8192 [14:35:24] } [14:35:34] k [14:35:34] $available_memory_mb = $::memorysize_mb - $reserve_memory_mb [14:35:41] # Since I have chosen a static $memory_per_container of 2048 across all [14:35:41] # node sizes, we should just choose to give NodeManagers [14:35:41] # $available_memory_mb to work with. [14:35:41] # This will give nodes with 48G of memory about 21 containers, and [14:35:41] # nodes with 64G memory about 28 containers. [14:36:31] this is probably relevant too: https://phabricator.wikimedia.org/T90640 [14:36:33] 28 containers but 24 (virtual) procs, right [14:37:28] yarn_nodemanager_resource_cpu_vcores - Default: max($::processorcount - 1, 1) [14:37:49] I think it could cause trouble [14:39:17] trouble = slow down jobs for bad reasons (waiting for an allocated container with no CPU to finish while the rest of ocntainers have finished) [14:40:02] It really depends on the jobs though --> IO bottlenecked jobs, no problems [14:40:04] well, except that mostly yarn schedules based on memory [14:40:09] CPU bounded jobs, problems [14:40:12] bu ya [14:40:13] but ya [14:40:31] cpu load would likely increase a lot, hm. but wait, what is it now? [14:40:34] let's see [14:40:47] That why you could have trouble: some containers scheduled, but no real CPU resource available [14:41:07] can you share you url for acluster load ? [14:41:16] ah, right now [14:41:26] it is about 21 containers per node no matter what [14:41:52] joal: this has more than just worker nodes in it, but: [14:41:53] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Analytics%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false [14:42:21] psshhhhh [14:42:24] 1017 is out of cluster [14:42:33] why no balacner! :) [14:42:35] checking [14:42:59] k [14:43:02] wha it hasn't run since the 10th [14:43:08] or, hasn't logged since then [14:43:10] checking cron logs [14:43:12] wow [14:43:27] No unhealthy node why you weren't here though [14:43:39] ok phew, so not that long [14:43:42] This is my way of checking if balancver runs [14:43:44] mabye its only been like that for a bit [14:44:00] I should make an alert for unhealthy nodes [14:44:01] that would be easy [14:44:07] yup [14:44:12] Would like that as well [14:44:13] i do see it attempted to run today at least [14:46:00] hmmhm, i think its been running [14:46:25] hm i think its just not logging well [14:47:13] ok [14:47:20] weird :( [14:47:57] yeah ah its output is mostly on stderr, fixing that. [14:48:13] Nneewway [14:48:47] so ja [14:48:55] right now, i think all nodes get about 21 containers [14:49:20] hm, wonder if it says that somewhere [14:49:25] total # containers in cluster [14:49:59] whoa, joal, did you deploy pageviews_hourly job? [14:49:59] this number is due to memory calculation, correct ? [14:50:05] in production? [14:50:07] yes [14:50:28] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/hadoop.pp#L187 [14:50:33] ottomata: used a oozie job in my folder, but ran it in production queue to try to get some resources [14:50:51] ah cool [14:50:56] just saw it and was like whoooaaoo COOOL [14:50:57] :) [14:51:02] ja, so [14:51:06] This is another topic I'd like to discuss: spark fight for resources [14:51:08] the previous settings i was using [14:51:08] did [14:51:13] # number of containers = min (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE) [14:51:17] its a min [14:51:22] and since all nodes have the same # of disks [14:51:24] (12) [14:51:38] floor(1.8*12) = 21.6 [14:51:40] sorry [14:51:40] 21 [14:51:49] ok [14:52:06] That is good for IO [14:53:03] But assuming that non-IO bounded containers will be evenly distributed, we can probably add some containers in there, based on RAM [14:53:22] hm [14:53:24] leading to --> your modification [14:53:29] right ? [14:53:35] right, but the worry is allocating more than cores now [14:53:43] yep [14:53:51] That's not a very good idea [14:53:52] cpu load looks hihg on those newer nodes now [14:53:56] but, that's because it is a % [14:54:05] and hyperthreading is off, so it looks like they have 12 cores [14:54:11] assuming they each get 21 containers [14:54:20] 21/12 > 100% [14:54:24] (that's if YARN maxed them out) [14:54:37] I think we should setup hyperthreading, don't you think ? [14:55:02] i've googled this before, doing it again, don't remmeber which [14:55:06] is should def be consistent [14:55:12] yeah [14:55:51] OO [14:55:53] this is our hw [14:55:54] http://www.slideshare.net/lhrc_mikeyp/ai10-optimizing-poweredgehadoopvfnl [14:56:45] http://image.slidesharecdn.com/ai10optimizingpoweredgehadoopvfnl-130608190555-phpapp01/95/optimizing-dell-poweredge-configurations-for-hadoop-15-638.jpg?cb=1370718739• Assume 1.5 Hadoop Tasks per physical core– Turn Hyperthreading on– This allows headroom for other processes [14:56:45] ? [14:56:49] sure then let's turn it on. ;) [14:56:52] :) [14:58:18] ok. joal. let's do this: I merge this puppet change as is, but I will reboot each nodemanager and turn on hyperthreading and fix that ticket as we go [14:58:21] And then, if they say 1.5 container per core, 21 seems even a bit high, no ? [14:58:28] 12*1.5 [14:58:30] oops [14:58:31] ha [14:58:32] yes [14:58:46] Let's keep it this way and see, ok ? [14:59:03] keep it what way? [14:59:16] when you say fix that ticket, you're talking about hte jdk version one ? [14:59:27] things: [14:59:37] jdk upgrade, that's for security. that will be fine [14:59:43] k [14:59:48] i meant to do that when we did the cluster upgrade [14:59:50] but forgot [14:59:54] that's easy [14:59:58] :) [15:00:00] hyperthreading, i should just turn it on [15:00:04] ok [15:00:17] made that ticket a while ago but didn't want to deal with it (have to reboot all nodes into bios, etc. really tedious) [15:00:23] About memory, I like the idea of having predefined size [15:00:38] so, you think we should try my change? [15:00:56] I think we should, but I want to confirm [15:01:04] confirm...that things don't explode? [15:01:37] so. actually, something we are saying isn't exactly right. [15:01:40] mostly it is [15:01:44] so, we are tweaking mainly 2 settings [15:01:52] 1. amount of RAM available to NMs [15:01:57] and [15:02:11] the default job RAM size [15:02:26] but not the max number of containers :) [15:02:26] the number of containers we are talking about comes from that [15:02:54] let's batcave for a minute [15:02:57] so, for example, we know that impala will likely allocate containers that are smaller than the default job size [15:03:00] ok. [15:04:29] CAn't hear you :( [15:05:33] You there? [15:06:06] Analytics-Dashiki: Dashboard Directory research: Look at Hay's directory - https://phabricator.wikimedia.org/T99675#1296343 (Milimetric) NEW [15:11:45] Analytics-Kanban, Analytics-Visualization: Integrate Dygraphs into Vital Signs {crow} [13 pts] - https://phabricator.wikimedia.org/T96339#1296357 (Milimetric) p:Normal>Triage [15:12:11] Analytics-Kanban, Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1296360 (Milimetric) a:Milimetric>None [15:12:40] Analytics-Cluster, Analytics-Kanban: Report pageviews to the annual report - https://phabricator.wikimedia.org/T95573#1296361 (Milimetric) a:Milimetric>None [15:13:32] Analytics-Kanban: Fix analysis that categorizes users (anonymous is bad) {lion} - https://phabricator.wikimedia.org/T92300#1296369 (Milimetric) Open>Invalid This turned out to be untrue, the users are tagged correctly as far as we can see, it's just that the rates of anonymous usage of Visual Editor was... [15:14:45] Analytics-Dashiki, Analytics-Kanban: Add time range selection to Limn dashboards (or new Dashiki dashboards) {frog} - https://phabricator.wikimedia.org/T87603#1296381 (Milimetric) p:Low>Normal [15:14:58] Analytics-Dashiki, Analytics-Kanban: Add time range selection to Limn dashboards (or new Dashiki dashboards) {frog} - https://phabricator.wikimedia.org/T87603#994953 (Milimetric) p:Low>Normal [15:41:20] Analytics-EventLogging, Analytics-Kanban: Backfill EL missing data for 2015-05-06 - https://phabricator.wikimedia.org/T98729#1296460 (mforns) Open>Resolved [15:41:36] Analytics-EventLogging, Analytics-Kanban: ContentTranslationError event logging table is not receiving new events - https://phabricator.wikimedia.org/T98842#1296463 (mforns) [15:41:38] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Troubleshoot EL performance problems on 2015-05-06 - https://phabricator.wikimedia.org/T98588#1296462 (mforns) Open>Resolved [15:48:08] Analytics-Cluster, Analytics-Kanban: Report pageviews to the annual report - https://phabricator.wikimedia.org/T95573#1296474 (Heather) What does up for grabs mean? :) [15:51:55] Analytics-Wikistats: Wikistats list of Wikipedias incomplete - https://phabricator.wikimedia.org/T99677#1296482 (Mdennis-WMF) NEW [16:02:38] Analytics-Wikistats: Wikistats list of Wikipedias incomplete - https://phabricator.wikimedia.org/T99677#1296522 (Mdennis-WMF) [16:19:07] Analytics-EventLogging, Analytics-Kanban: Code to write a new Camus consumer and store the data in two Hive tables [21 points] {oryx} - https://phabricator.wikimedia.org/T98784#1296582 (Milimetric) [16:19:17] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Configuration for Python & Kafka in Beta labs [8 points] {oryx} - https://phabricator.wikimedia.org/T98780#1296583 (Milimetric) [16:19:25] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Forwarder modification to produce to multiple outuputs, 1 to zmq and 1 to Kafka [8 points] {oryx} - https://phabricator.wikimedia.org/T98779#1296587 (Milimetric) [16:24:58] ottomata: let me know when you are ready for a chat [16:26:44] k, getting lunch... [16:26:51] Enjoy :) [17:04:23] joal: HIyyy, i haven't started the nodes yet, but we can chat [17:04:27] can we just chat here? [17:04:33] sure :) [17:05:11] So, let's start with ELhn [17:05:37] The comment you made on the coed review about key [17:05:44] What do you think we should do there [17:05:47] ? [17:07:01] hn? [17:07:03] commennnnnt [17:07:05] looking [17:10:46] joal: i forget, what does the event look like if it is raw? [17:10:50] i think i am a bit out of hte loop here [17:11:00] :) [17:11:07] it depends where it comes from [17:11:24] why did you need to add raw for all these writers? [17:11:43] well, only two :) [17:12:12] 5, no? [17:12:14] https://gerrit.wikimedia.org/r/#/c/210701/8/server/eventlogging/handlers.py [17:12:26] raw was added as a parameter for 5 of them [17:12:39] sorry [17:12:40] no 4 [17:13:10] forgot I had added them to other writers [17:13:13] let me explain [17:13:49] before the change, the forwarder was treating raw event and using a zmq publisher without the writer abstractio [17:14:24] having the writer abstraction, we need to tell those writers if the events we write are raw or not [17:14:46] And I added the raw param to stdout and log to ensure compatibility with th [17:14:59] with the forwarde [17:15:04] hmm, ok, i think i get it [17:15:07] makes sense ? [17:15:29] The writer abstraction used always to decode json [17:15:37] why does zmq writer get json.dumps with check_circular, but not kafka [17:15:40] now we need to send raw as well [17:15:40] oh...that might have been me. [17:15:46] since I wrote the kafka one [17:15:47] hm [17:16:04] don't know, didn't get into that level f detail for json :) [17:16:38] hm, yea weird, not sure why that is set at all, ori must know [17:16:40] Now about the key thing, my idea was to be able to pass a key as parameter for raw events (not defined in the event itself) [17:16:58] But I am not sure if it is usefull [17:17:29] hmm, i see, but that also makes it consistent with other writers [17:17:33] i think that is a good idea [17:17:53] but, yeah [17:18:04] you should probably not reset the key to the schema_rev is key was provided [17:18:22] Agreed ! [17:18:30] Will change that and resubmit [17:18:45] We good on that, or other comments ? [17:18:51] ok cool. do we need to use a string key at all? can we do Nil key? not sure. [17:19:02] pretty sure kafka can use null keys [17:19:06] not sure bout python kafka [17:19:14] yeah, that's the thing [17:19:27] we got errors when not encoding keys as urf8 [17:19:30] so weird [17:19:40] hm, ok, well i guess it doesn't really matter [17:19:56] can you add some documentation? [17:20:02] that specifies the behavoir? [17:20:04] right now it just says [17:20:07] """Write events to Kafka, keyed by SCID.""" [17:20:22] yup, will do ! [17:20:27] make it say that if events are not raw and key is not set, the key will be schema_rev [17:20:29] cool [17:20:33] aside from that +1 [17:20:33] :) [17:20:41] cool, thx :) [17:21:48] Analytics-Wikimetrics, MediaWiki-Vagrant: Vagrant Setup alembic config errors - https://phabricator.wikimedia.org/T99631#1296778 (Milimetric) I took a look this morning and this worked out of the box for me. The error pasted above basically means the configuration file didn't get written to /etc/wikimet... [17:21:49] Let me do that and then get back to you about spark resources :) [17:22:43] * milimetric goin to grab lunch [17:23:09] k [17:26:29] ottomata: let me know if you want help monitoring stuff while you apply the change [17:26:50] joal: i think mostly keeping an aye out to make sure the jobs all run [17:27:02] k, will do [17:28:11] will get diner first :) [17:28:15] back in a bit [17:29:24] mmk [17:32:56] Analytics-Cluster: Audit hyperthreading on analytics nodes. - https://phabricator.wikimedia.org/T90640#1296812 (Ottomata) I will be starting on nodes 1028 - 1041 today. [17:59:47] walking home, back shortly [18:12:07] milimetric, do you have 10 minutes to talk about dashiki build? [18:14:16] mforns: yes, just finished eating, to the batcave! [18:14:28] milimetric, ok