[07:28:42] hellooooo [08:08:35] Morning :) [08:21:30] o/ [08:40:30] o/ [08:41:08] mforns: the cleaner did all the tables up to MediaViewer_10867062, seems proceeding well [08:42:10] just a reminder to everybody that today at 15:00 UTC dbstore1002 will be stopped for maintenance [08:48:06] mforns: even more important, I was so disappointed by Star Wars [08:48:24] elukey, xD [08:48:37] buuut will continue ranting in private since I don't want to spoil anything :) [08:48:40] elukey, we can discuss today before standup if you want [08:49:16] :) [08:50:06] joal: I think that the hadoop beta cluster is ready :) [08:50:32] elukey: \o/ ! or /o\ depending on how you prefer to look at it :D [08:52:45] * mforns off for lunch! [08:52:51] cya [09:19:09] elukey: Can I try a few things with it? [09:19:27] joal: sure! [09:19:46] I pointed camus to two kafka servers in analytics that andrew told me, not sure what data is in there [09:19:53] anyhow, you have the following [09:20:11] hadoop-coordinator-1.eqiad.wmnet is the an1003 equivalent [09:20:24] err sorry [09:20:39] hadoop-coordinator-1.analytics.eqiad.wmnet [09:21:02] ok I definitely need a coffe [09:21:04] hadoop-coordinator-1.analytics.eqiad.wmflabs [09:21:11] then the hadoop workers [09:21:21] hadoop-worker-[123].analytics.eqiad.wmflabs [09:21:26] then the masters [09:21:35] hadoop-master-[12].analytics.eqiad.wmflabs [09:24:12] * elukey grabs a coffee [09:30:59] elukey: I'm going to have a look :) [09:35:11] !log reboot kafka1014 for kernel updates [09:35:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:42:35] 10Analytics, 10Beta-Cluster-Infrastructure, 10EventBus, 10Recommendation-API, and 3 others: What to do with deployment-sca03? - https://phabricator.wikimedia.org/T184501#3885749 (10Ladsgroup) [10:08:07] elukey: Heya [10:08:39] elukey: how do you want me to communicate notes about my findings? [10:08:47] !log reboot kafka-jumbo1002 for kernel updates [10:08:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:08:50] joal: sure [10:11:06] elukey: running hive or beeline I get errors :( [10:11:19] both send: NoClassDefFoundError: org/apache/zookeeper/ZooKeeper [10:12:30] ah interesting! Wrong version of the zookeeper packages in there, we have an outstanding patch to fix it, lemme manually change it [10:16:16] joal: now beeline says [10:16:16] Error: Could not open client transport with JDBC Uri: jdbc:hive2://hadoop-coordinator-1.analytics.eqiad.wmflabs:10000: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0) [10:16:21] that seems better [10:16:40] cool, continuing my test :) [10:16:46] as FYI, we have to manually pin via apt zookeeper packages [10:17:02] since some jars are contained in the cloudera version, but not in the debian on [10:17:05] *one [10:17:15] checking why port 10000 is not available [10:17:19] I have heardropped the other day about different zookeeper versions yes [10:17:43] elukey: will it be problemati to have it productionized with the cloudera one? [10:18:05] what do you mena? [10:18:07] *mean? [10:18:44] meaning, looks like you had to manually patch to get the cloudera one here - Will it cause problem? [10:19:57] elukey: something else: looks like Yarn master is master-2 currently - is it expected? [10:21:11] joal: afaik (will need to check) all our zookeeper packages across production are the cloudera ones, but recently something changed in our apt repo (we uploaded our own version of zookeeper) and apt-cache policy now returns different preferences [10:21:20] so new hosts are affected [10:21:28] but we have a patch to rollout that fixes this [10:21:47] Ok I get it :) Thanks for clarification elukey :) [10:22:24] about the master, I think it is just a matter of restarting the yarn daemon to make it flip over to -1, will do now! [10:23:06] now the main problem seems to be that hive server is not binding port 10000 [10:24:10] elukey: noticed that as well (beeline nor hive actually manage to connect) [10:25:10] elukey@hadoop-master-2:~$ sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState hadoop-master-1-analytics-eqiad-wmflabs [10:25:13] active [10:25:15] there you go, one issue fixed :) [10:25:18] elukey@hadoop-master-2:~$ sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState hadoop-master-2-analytics-eqiad-wmflabs [10:25:21] standby [10:25:36] elukey: was not really an issue, just wondering ;) [10:25:52] elukey: I'm fighting currently to understand why mapreduce example failed [10:27:24] fixed also the namenodes [10:27:28] elukey: realuching job with master-1 as master [10:27:46] elukey: when you say fixed, you mean master-1 is active, right? [10:29:55] yep [10:30:02] Beeline version 1.1.0-cdh5.10.0 by Apache Hive [10:30:02] 0: jdbc:hive2://hadoop-coordinator-1.analytic> [10:30:48] the hive server/metastore were complaining about zookeeper as well :) [10:30:56] ok [10:31:06] Got an error from hive: FileNotFoundException: /etc/hive/conf.analytics-hadoop-labs/hive-site.xml (Permission denied) [10:31:16] beeline is fine, but hive complains [10:31:29] checking [10:31:58] so perms are -r--r----- 1 hive hdfs 6037 Jan 8 16:39 /etc/hive/conf.analytics-hadoop-labs/hive-site.xml [10:32:52] same as on an1003 [10:32:52] elukey@analytics1003:~$ ls -l /etc/hive/conf.analytics-hadoop/hive-site.xml [10:32:55] -r--r----- 1 hive hdfs 6063 Apr 24 2017 /etc/hive/conf.analytics-hadoop/hive-site.xml [10:33:14] maybe the dirs [10:33:51] mmm they look fine [10:34:14] I get the same error running "hive" on an1003 though [10:35:05] there might be some user/perm trick on analytics clients [10:35:05] wow, weird [10:35:53] elukey: on stat1004: -r--r--r-- 1 hive hdfs 5134 Apr 11 2017 hive-site.xml [10:36:09] yeah [10:36:14] was about to paste it [10:50:14] so joal we could add the analytics_cluster client role [10:50:20] that will probably fix the issue [10:51:33] elukey: you mean add a new node I guess? [10:51:45] we can add it to the coordinator one [10:51:52] can be elukey [10:51:54] but to be more realistic we can create a new node [10:51:58] or even add it to a worker [10:52:37] ah joal workers are already clients [10:52:44] can be elukey Ah ok [10:52:57] elukey: jobs I have started from coord are stuck [10:53:17] I'm going to kill them and see if it makes a difference if started from worker [10:53:25] let's try [10:54:14] joal: &R_SERVICE(tcp, 10000, $ANALYTICS_NETWORKS); [10:54:18] * elukey cries in a corner [10:54:54] hm, not sure what it means, but I wish I could help :) [10:55:02] ferm rules! [10:55:03] I'm guessing ferm rules [10:55:05] yeah [10:55:27] And analytics_network is not something we have in labs I guess [10:55:55] correct, fixing puppet asap [11:06:29] joal: like this https://gerrit.wikimedia.org/r/#/c/403128/1 [11:07:45] makes sense elukey [11:21:17] fixing also the meta db https://gerrit.wikimedia.org/r/#/c/403131/ [11:26:52] joal: hive (default)> [11:26:57] from a worker node [11:27:07] whenever you have time can you re-test? [11:27:16] quick note: the refinery is still not deployed [11:27:22] there is an issue with that [11:27:32] (I think due to the fact that we are not in deployment-prep) [11:27:38] elukey: no prob, currently testing raw stuff :) [11:28:25] elukey: mapreduce job failed when started from worker-1 :( [11:28:38] elukey: acutally, didn't fail, but got stuck in accepted mode [11:36:33] joal: a bit ignorant about it, do you have any theories? [11:36:40] theory [11:36:44] not yet - testing hive [11:38:34] I might have fixed the refinery [11:39:48] (git fat will take a while) [11:42:14] yessssss [11:42:17] worked joal ! [11:43:19] elukey: Yay ! [11:47:21] elukey: I think we have a conf issue for node-managers --> No memory allocated [11:48:11] elukey: There are vcores (3) per node-manager, but no memory (as well as no default settings for container in mapred-site.xml) [11:53:06] ah! [11:53:23] maybe we are missing some hiera config [11:54:00] can you give me an example of key missing in mapred-site.xml ? [11:54:13] (or default missing) [11:54:52] like mapreduce_map_memory_mb ? [11:55:02] you know it elukey ;) [11:55:09] ah there you go, it defaults to undef [11:56:15] elukey: those are default for mapreduce jobs - We should also have values in node-managers (yarn-site.xml) for >yarn.nodemanager.resource.memory-mb [11:56:45] This last one defines available memory for the node-manager to be used by containers [11:57:08] I think mapreduce can do without default, byt yarn can't :) [11:57:15] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3885934 (10dr0ptp4kt) Thanks, @Shilad . @Ottomata would you be game to take another run at this? @Shilad what's your availability days+time UTC this current week... [12:02:09] joal: running puppet now, you should be able to re-test in a bit [12:11:19] joal: can you redo the test? [12:11:28] I can ! [12:13:06] elukey: looks like yarn value are not set :( [12:13:20] elukey: in yarn-site.xml, yarn.nodemanager.resource.memory-mb [12:13:47] elukey: This one is super needed: it tells every node-manager how much memory is available for it to run containers [12:14:05] yeah I am stupid [12:14:09] this is the issue :D [12:14:20] maybe not, but could be :) [12:14:33] I forgot to add those :) [12:14:35] And stupid you are not :) [12:14:45] I see no hamster so far :) [12:14:47] do you mind if I go out for lunch and then fix this? [12:14:49] hahahah [12:15:07] Please go for lunch :) [12:15:16] I'll take a break as well [12:15:18] thanks!! see you in a bit :) [12:15:20] * elukey lunch! [12:15:22] Ciao [12:17:53] later team - taking a break [13:45:53] joal: all fixed! [13:46:03] ready for the next test when you are back [13:51:31] !log reboot kafka-jumbo1003 for kernel updates [13:51:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:54:08] Heya elukey ! [13:54:12] Testiiiiiiiiiing ! [13:57:45] elukey: from yarn UI we have no workers :( [14:04:16] !log reboot kafka1022 for kernel updates [14:04:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:06:45] 2018-01-04 16:26:02,838 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager [14:06:48] org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From hadoop-worker-1/10.68.20.101 to hadoop-master-1.analytics.eqiad.wmflabs:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused [14:06:53] joal: this might be related :P --^ [14:07:10] hm, can't see why [14:07:24] :) [14:09:18] joal: better now? [14:10:14] elukey: We haz wokerzz ! [14:10:35] \o/ [14:11:08] elukey: and we haz a running jobz :D [14:11:13] ohh nice! [14:11:21] Estimated value of Pi is 3.141200 [14:11:51] elukey: What was wrong with nodemanagers? [14:15:43] elukey: now reviewing camus settings on coordinator - I have doubts on some [14:16:15] camus: yes those are probably wrong/prod-related, we'll need to review them [14:16:33] elukey: no parameterization, right? [14:16:40] nodemanagers: they were not able to automatically restart after the issue above (when the masters were not ready) [14:17:17] joal: I applied the prod role, that basically configures camus for production (modulo the kafka cluster that is a labs one) [14:18:04] ok makes sense elukey [14:18:51] elukey: about the kafka nodes in there, their address is like: k3-1.analytics.eqiad.wmnet -- shoul be .eqiad.wmflabs, right? [14:19:06] yes it should, fixing thanks [14:20:58] elukey: ok, there are and handfull of hard-coded things we need to change [14:33:12] !log reboot kafka1023 for kernel updates [14:33:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:41:48] !log reboot kafka-jumbo1005 for kernel updates [14:41:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:48:52] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3886647 (10Ottomata) I'm game to get together for another hour or so to meet and try, but I don't really have time to take this on and see it through on my own.... [15:00:38] Gone to grab Lino a-team - Will miss standup - Later ! [15:00:43] !log reboot kafka-jumbo1006 for kernel updates [15:00:47] o/ [15:00:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:00:56] byeeee [15:03:32] o/ [15:06:46] ottomata: o/ [15:07:04] analytics and jumbo clusters rebooted with new kernels, no perf regression so far [15:07:23] great! [15:10:19] !log reboot analytics1028 (hadoop worker and hdfs journal node) for kernel updates [15:10:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:16:21] (03PS3) 10Mforns: [WIP] Improve WikiSelector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530) [15:16:37] elukey: q about these monitoring enabled things [15:16:49] i had though it was mostly harmless to apply the monitoring stuff in labs [15:16:53] as it didn'ta ctually do anything? [15:16:59] or mabye i'm thikning of ferm... [15:19:02] ottomata: sorry I didn't get what you mean :) [15:19:10] ottomata / elukey: I tried to look into this but I'm too sick, the dashboards are all down because of some puppet bug on dashiki-staging-01 and dashiki-01 in cloud [15:19:15] the error points to line 20 of this file: https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/simplestatic.pp [15:19:21] (just check the puppet log on those instances) [15:19:46] elukey: what actually happens if e.g. monitoring::graphite_threshold is declared in labs? [15:19:57] I don't know much about how it was initially set up other than madhu did it, and it gets the configuration for the apache hosts from hiera I think [15:20:12] ottomata: ahhhh! Nothing afaics, we don't have any alarming set up in there (but I might be wrong) [15:20:34] for reference, this is the deploy of dashiki: https://github.com/wikimedia/analytics-dashiki/blob/master/fabfile.py [15:20:35] so why do we do the monitoring enabled guards? [15:21:15] and the documentation about the hiera config is mentioned here: https://github.com/wikimedia/analytics-dashiki/blob/master/README.md#deploy [15:21:27] milimetric: dashiki is vital signs? [15:21:28] https://analytics.wikimedia.org/dashboards/vital-signs/#projects=eswiki,itwiki,enwiki,jawiki,dewiki,ruwiki,frwiki/metrics=Pageviews [15:21:28] right? [15:21:44] ottomata: no, that's hosted separately [15:21:47] ottomata: as I said I am not sure if monitoring can be enabled, or if it will in the future, having guards around monitoring makes sense to prevent issues in my opinion [15:21:51] dashiki is all the custom domain dashboards like: [15:21:55] https://flow-reportcard.wmflabs.org/ [15:22:00] ah ok [15:22:00] http://language-reportcard.wmflabs.org/ [15:22:04] etc. [15:22:22] (the ones listed here: https://github.com/wikimedia/analytics-dashiki/blob/master/config.yaml [15:22:24] ) [15:22:36] milimetric: i don't think i have access to that host [15:22:39] what project it is in? [15:22:43] dashiki project [15:22:49] I'll add you if you're not on, one sec [15:23:22] ya need added [15:23:36] hm ok [15:23:42] elukey ok [15:23:55] k, I made ottomata and elukey admins there [15:24:10] k in [15:24:12] thanks much, sorry I'm out of it [15:24:49] milimetric: puppet runs fine on dashiki-01 [15:25:13] yeah, I think Andrew B did something to fix it, but what he did broke apache [15:25:18] (hence the 502) [15:25:31] on dashiki-staging-01 I saw that error about simplestatic.pp [15:25:38] in tail /var/log/puppet.log [15:25:51] i think mabye its not those hosts, but the webproxy [15:25:57] the 502 is from nginx [15:26:00] ? [15:26:01] maybe [15:26:16] I don't know much beyond what I said above, and brain too foggy to help sorry :( [15:26:26] hm it also runs on staging-01 too [15:26:30] maybe you saw an old error [15:27:21] ottomata: Jan 09 15:22:37 dashiki-staging-01 apache2[11992]: AH00526: Syntax error on line 7 of /etc/apache2/sites-enabled/50-simplestatic.conf: [15:27:51] there is an empty ServerName [15:28:10] maybe something changed in puppet and the parameter got emptied [15:29:23] yeah [15:33:46] !log stop mysql on dbstore1002 as prep step for shutdown (stop all slaves, mysql stop) [15:33:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:38:04] milimetric: all is back, andrew b just had an oopsy in his change [15:38:05] he fixed [15:38:32] sweet, thx! [15:46:52] (03PS1) 10Fdans: Translate g according to the y-axis width [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/403184 (https://phabricator.wikimedia.org/T184138) [16:00:36] ping fdans ottomata elukey joal [16:01:30] nuria_: skipping it today, omw to the airpot (see email :) ) [16:01:32] coming in a sec, dbstore1002 maintenance [16:02:08] fdans: please send e-scrum so we know where we stand on deployment of new apis [16:07:53] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3886850 (10RobH) [16:09:36] nuria_: Sorry, just sent it [16:12:59] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3886875 (10Cmjohnson) [16:13:47] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3867856 (10Cmjohnson) All the on-site work has been completed, production dns added and install server. @robh can you look into the partman recipe and complete the install... [16:13:55] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3886877 (10Cmjohnson) a:05Cmjohnson>03RobH [16:35:19] Hi mobrovac, any chance you being somewhere nearby? [16:35:58] joal: hello, yes, but in meetings for the next few hours [16:39:02] mobrovac: I was just willing to discuss the last patch you merged from fdans [16:39:56] mobrovac: I think the last modification you asked for and the last patch is not what we expected [16:43:50] joal: can you comment on the PR? [16:43:59] I can :) [16:51:23] mobrovac: just commented, please let me know if it makes sense and how you prefer the thing to be [16:53:35] !log Rerun pageview-druid-hourly-wf-2018-1-9-13 [16:53:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:02:38] ah sorry joal didn't check if I broke any jobs yet :( [17:02:59] no worries elukey :) [17:05:28] 10Analytics-Kanban, 10Operations, 10ops-eqiad: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3887025 (10elukey) Maintenance done, the mgmt interface is now up and running (Chris also did a reseat of the DIMM banks). @Marostegui, @jcrespo - We (as Analytics team) would like t... [17:09:10] nuria_: this is where our java.version is used --> https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-source-and-target.html [17:10:34] nuria_: Since java is (supposedly) backward compatible, we should have no problem running our codebase with java8 [17:11:03] joal: k [17:11:52] joal: we have 1.6 now, right? [17:11:56] https://www.irccloud.com/pastebin/5xL4lDpo/ [17:19:03] nuria_: we have 1.7 I think [17:20:44] nuria_: 1.7 in refinery/pom.xml , reused in sub-modules [17:26:33] elukey: do you know [17:26:39] where does analytics-hadoop-labs come from in your labs hadoop setup? [17:26:47] i don't see where profile::hadoop::common::cluster_name is set to it [17:27:42] somehting is setting $cluster_name to it [17:28:46] I've set everything in prefix-puppet, not sure if that one was in there too [17:29:44] joal: replied :) [17:30:20] elukey: looked, i don't see it in horizon hiera [17:30:24] at least, i didn't find it [17:31:22] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887160 (10Shilad) I have availability this week and next, but I think @Ottomata is right. It will be tough to do this work if it's attached to a production machi... [17:33:57] OH elukey i'm looking at the wrong prefix DUH [17:34:25] * elukey removes himself from the blame list of broken things [17:34:29] :D [17:34:35] Thanks mobrovac - I get your point - I think the detailed definition works fine - I get tricked by the fact that we store JSON in cassandra making it not generated by AQS - I'll add a comment to the PR to keep archives happy [17:34:58] kk great, thnx joal [17:39:24] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887218 (10RobH) Moving the GPU between boxes is not advised. Items are warrantied to work in the server they were ordered in, so its typically messy to move the... [17:39:33] 10Analytics: Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541#3887219 (10JAllemandou) [17:39:37] ottomata: sorry but i just blocked your ability to move the GPU (well, Dell did ;) [17:39:48] it only can fit in R730 and we dont have any of those [17:39:59] (we run the 730xd everywhere, not the normal version) [17:40:18] rats [17:40:23] we had to get stat1005 specifically in a chassis we don't use anywhere else (the non xd varient) to fit the gpu =[ [17:40:24] i mean, its partly our fault, partly ellery's :p [17:40:36] ellery requested the thing and then quit [17:40:38] haha [17:41:11] i just didnt want you guys to waste time trying to figure it out once i saw it couldnt happen ;D [17:41:18] ok cool [17:41:19] thanks [17:41:25] i mean there is a chance we could move it without support of dell [17:41:28] but im not sure it'll fit [17:41:39] ottomata: qq - shouldn't $hadoop_cluster_name = $::profile::hadoop::common::cluster_name require ::profile::hadoop::common first? [17:41:40] my understanding is its a daughter board to the mainboard [17:41:42] not just a PCIe card [17:41:46] yup [17:41:50] oh [17:41:52] elukey: hmm [17:41:55] i guess? refinery requires it [17:42:07] won't hurt though, so ok i'll ad it [17:42:18] or a comment, just for future refs [17:42:32] i'll add it [17:42:43] you know, that Luca guy might want to remove some code for some reason and end up breaking all [17:42:52] he is really the worst :D [17:42:56] robh: i'm fairly certain we will never use it if it stays in stat1005 [17:43:10] :) [17:43:14] likely should move stat1005 services to another box and repurpose the r730 ;] [17:43:19] oof [17:43:22] really don't want to [17:43:27] stat1005 also has tons of drives, etc. [17:43:27] its pretty pricy ahrdware to sit and not use the gpu [17:43:34] oh stat1005 is very used [17:43:36] just not hte gpu [17:43:41] did the chassis change make the price that much more? [17:44:46] not compared to the r730xd iirc [17:44:56] just the gpu cannot move into anything other than another r730xd [17:45:02] aye [17:45:03] so its just not goign to be used at all i suppose ;] [17:45:36] well, i guess we could get another r730xd chassis with cheaper specs [17:45:45] don't need all that storage to try the gpu [17:45:46] but [17:45:52] i think we'd need someone to justify that [17:45:54] non xd [17:45:56] probably research team [17:45:57] R730 [17:46:05] ah ok [17:46:07] whichever :) [17:46:08] R730xd wont fit it, becausfe dell uses annoying standards ;] [17:46:30] dario and ellery were the ones who wanted the gpu in the first place [17:46:43] shilad, who wants to use it, is a contractor working with research team [17:48:43] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887279 (10Ottomata) Rats, well then. This is partially our fault for sticking this thing in a 'production' box in the first place, but we did it to save some mo... [17:49:29] ottomata: your summary on that task seems right to me [17:49:31] =] [17:50:19] 10Analytics-Cluster, 10Analytics-Kanban, 10Analytics-Wikistats, 10RESTBase-API, 10Services (doing): Add "Pageviews by Country" AQS endpoint - https://phabricator.wikimedia.org/T181520#3887301 (10mobrovac) [PR #940](https://github.com/wikimedia/restbase/pull/940) adds `top-by-country` to the public API. I... [17:51:05] 10Analytics, 10RESTBase-API, 10Services (watching): Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541#3887305 (10mobrovac) [17:51:06] thanks robh :) [17:55:18] elukey: can you put eyes on this for a min? https://puppet-compiler.wmflabs.org/compiler02/9667/analytics1003.eqiad.wmnet/ i am atm baffled [17:55:35] i've tried pulling var out of profile hadoop common and out of cdh hadoop [17:55:42] both show empty here [17:57:54] OHHHHH DOH I KNOW [17:58:03] becausae the camus::job define is rendering the template [17:58:04] ok ok ok [17:58:06] phew [17:59:09] sorry just seen the ping :( [18:00:17] np i got it! [18:00:18] i think [18:00:19] ... [18:00:19] :) [18:02:12] a-team: staff? [18:02:31] ping ottomata [18:02:39] stafffff? [18:02:45] oh sorry thought we said we weren't doing it! [18:15:08] joal: ok! camus running [18:15:17] webrequest_text topic [18:15:17] ottomata: refinery deployed [18:15:30] ottomata: camus checker as well? [18:15:32] ya [18:15:35] :D [18:15:42] elukey had actually applied the whole coordinator role, which includes camus [18:15:52] so i just fixed the puppetization a little bit, now its running configured by puppet [18:39:35] ottomata: I love the requests you're sending :) [18:40:24] ottomata: I also have noticed that camus data changed ownership like 20 minutes ago - was that expected? [18:46:50] joal i chgrped it so we could read it [18:47:03] Ah, nice :) [18:49:46] ottomata: I'm having an error when starting an oozie job about hdfs nameservice: from conf, it is named analytics-hadoop-labs, but I get an error: java.net.UnknownHostException: analytics-hadoop-labs [18:49:59] ottomata: would it by any chance ring a bell? [18:55:05] ottomata: gone for diner, back after [18:57:16] 10Analytics, 10Operations, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#3887542 (10Ottomata) [18:57:26] hmmmm [18:57:28] joal: yes [18:57:32] but strange you get an error [18:57:43] where is your oozie conf? [18:58:35] or properties [19:32:08] 10Analytics-Kanban, 10Patch-For-Review: Make superset more scalable - https://phabricator.wikimedia.org/T182688#3887699 (10Ottomata) If we do celery workers, it will be as a different task. [19:45:57] 10Analytics, 10EventBus, 10TechCom-RFC, 10Wikidata, 10Services (watching): RFC: Requirements for change propagation - https://phabricator.wikimedia.org/T102476#3887747 (10daniel) This seems obsolete. Is there any interest in keeping this open and continue the RFC process here? Change propagation is an... [19:48:31] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3887751 (10BBlack) >>! In T182993#3871545, @Ottomata wrote: >> The sigalgs lists being negotiated for mutual certificate-based auth seem to i... [19:51:45] Heya ottomata [19:52:00] hey [19:52:38] ottomata: I use that conf to launch: https://gist.github.com/jobar/e0eb708a463f1d1278e4db511e3cceef [19:54:24] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887767 (10dr0ptp4kt) Bummer. I think we need to stall out this task until a future date, in alignment with @ottomata 's suggestion to @shilad about it needing an... [19:55:43] 10Analytics, 10EventBus, 10TechCom-RFC, 10Wikidata, 10Services (watching): RFC: Requirements for change propagation - https://phabricator.wikimedia.org/T102476#3887771 (10Scott_WorldUnivAndSch) In other languages? Wikimedia's Director Katherine Maher in Wikimania 2017 mentioned potentially 7k languages i... [19:55:57] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887772 (10Shilad) That makes sense. There are plenty of other avenues I can explore without a GPU. [20:02:32] ottomata: Have a minute for a batcave conversation related to the GPU ticket? [20:21:50] shilad: ya gimme a bit though [20:21:53] heads down in sumpin [20:22:06] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887877 (10dr0ptp4kt) @RobH what's the proper way for us to contact AMD for support about driver options? I wasn't sure if we needed to use a particular customer... [20:35:51] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3887930 (10RobH) @dr0ptp4kt: We don't have any contacts with AMD. You may want to ask on the ops list though, as someone may know someone (our team seems to know... [20:39:12] ottomata: another oozie error after having dealt with hostnames: org.apache.oozie.action.ActionExecutorException: File /user/oozie/share/lib does not exis [20:41:02] ottomata: And this is weird because files exist on hdfs :( [20:41:23] yargh missed you rping earlier joal [20:41:29] with you in just a few mins i promise! :) [20:44:56] ok joal [20:44:57] sorry [20:45:04] np ottomata :) [20:45:30] I managed to deal with hdfs namenode and resource manager, but I hit that ooie lib thing :( [20:45:44] what was wrong with namenode/resource manager? [20:46:15] ottomata: I think it's a DNS thing [20:46:30] when giving hdfs://hadoop-master-1 as namenode, seems to work [20:46:35] oh weird [20:46:36] hmm [20:46:39] same for resource manager [20:46:47] maybe they aren't properly conifgured for ha then hmmmmMm [20:46:48] weird [20:46:53] anyway ok [20:50:31] joal: am tryign to reproduce [20:50:58] ottomata: I can give you a procedure, or we can batcave for a minute [20:51:06] bc [21:08:33] shilad: OOOoOK [21:08:36] you still around? [21:12:04] 10Analytics, 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-eventlogging04 due to missing directory '/var/lib/superset'? - https://phabricator.wikimedia.org/T184238#3888024 (10greg) per [[Dev/Maint]] [21:13:11] ottomata: Stepped away, but I'm back! Now's okay? [21:33:05] ottomata: still an error because of hdfs name: java.lang.IllegalArgumentException: Wrong FS: hdfs://analytics-hadoop-labs/tmp/hive-staging_hive_2018-01-09_21-27-56_341_8599378756918520512-1/-ext-10000, expected: hdfs://hadoop-master-1 [21:33:31] Looks like we'll have to find a way to make hdfs://analytics-hadoop-labs work :S [21:34:35] Actually, looks like the restart did the job for HDFS as well ottomata :) [21:35:01] joal: you restarted hdfs and now the HA FS name works/ [21:35:02] ? [21:35:35] ottomata: I did nothing, just changed the FS name is ooie config (back to hdfs://analytics-hadoop-labs) [21:36:02] ottomata: still need to double check job succeed, but at least now the correct name works [21:38:01] ottomata, is /var/lib/superset supposed to be on the deployment host, or some target host? [21:38:17] target [21:38:46] Krenair: the instance that is there is just temporary for testing [21:38:50] puppet dev [21:38:54] it wont' survice long [21:39:00] survive [21:39:00] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3888118 (10dr0ptp4kt) Thanks! [21:39:10] ottomata, -eventlogging04? [21:41:11] superset is there? [21:41:13] ??? [21:41:44] looks like it yep [21:42:20] is it not supposed to be? [21:47:07] ottomata? [21:47:24] 10Analytics, 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-eventlogging04 due to missing directory '/var/lib/superset'? - https://phabricator.wikimedia.org/T184238#3888157 (10Krenair) I ran the exact same command that puppet does (as the user specified in the puppet file), and it appea... [21:49:39] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3888162 (10Ottomata) OO I have done some [[ https://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/JSSERefGuide.html#DisabledAl... [21:50:28] Krenair: no [21:50:31] no idea why it would be... [21:51:18] hm, ok [21:51:46] looks like someone added profile::superset to the instance's roles list in horizon puppet data [21:52:15] weird, did I? is it posssible I did that accidentally? we don't run druid in deployment-prep, dunno why i would... [21:52:22] maybe i had the wrong tab open? [21:53:07] it's possible [21:53:27] Krenair: removing. [21:54:50] unlike when it's done through git or the wikitech page, we have no record of changes being done to puppet config in horizon. feel free to remove it [21:55:14] wonder if we should have a way to do that [21:56:47] 10Analytics, 10Beta-Cluster-Infrastructure, 10Puppet: Puppet broken on deployment-eventlogging04 due to missing directory '/var/lib/superset'? - https://phabricator.wikimedia.org/T184238#3888234 (10Krenair) 05Open>03Resolved ottomata, -eventlogging04? superset is there? ??...