[00:54:17] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#2783668 (10srishakatux) Hi, Is there a consensus on the best way forward for this task? If so,... [02:38:56] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3212814 (10Ottomata) I think we should decline this task, as the information is already availa... [07:42:21] hello people [07:42:26] created https://etherpad.wikimedia.org/p/analytics-row-d-maintenance [08:08:05] so a couple of things to remember [08:08:36] 1) we'll lose for some minutes two aqs nodes, druid1003 and 4 worker nodes (no journal node in there) [08:09:55] 2) we'll loose for potentially a working day 8 worker nodes (1 journal node among them), the hadoop master standby, one zookeeper node and (potentially) two kafka nodes (if we don't manage to move one of them ahead of time) [08:10:09] not the best day since I've been working for WMF :D [08:11:22] My idea is to start now draining worker nodes, not disabling puppet but just masking the systemd unit for the hadoop daemons (to prevent them to start) [08:13:07] (03CR) 10DCausse: [C: 031] Support Wiki Abbreviation for Czech (cs vs cz) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/350247 (owner: 10Tjones) [08:31:33] all right starting with the node drain [08:41:44] sorry Hadoop cluster, I am shutting down 13 nodes [08:42:02] :) [08:42:13] o/ [08:44:04] (and now that we have debian everywhere I can just simply mask system-d units :D) [08:49:18] elukey: I'd love if you could explain more how this works (I am so ignorant in how S-d works ) [08:53:44] ah yes! I am not super expert too but this is what systemd is doing to nodemanager atm [08:53:47] "Created symlink from /etc/systemd/system/hadoop-yarn-nodemanager.service to /dev/null." [08:54:28] so if you try to start hadoop-yarn-nodemanager, then systemd will now that the unit is masked, and it will not do anything [08:54:34] (stop/restart/etc..) [08:55:29] (03PS1) 10Joal: Add dty.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/350376 [08:55:57] clever elukey ! [08:56:30] so no disable puppet etc.. [08:59:37] ok so stopped the nodemanagers, I'll start in a bit with datanodes [09:31:41] shutting down and masking hdfs datanode daemons [09:31:51] oozie is strangely quiet [09:40:24] ah there you, got one email [09:49:15] restarting the jobs [09:51:49] !log restart aqs-hourly-wf-2017-4-26-8 (failed due to an1036's hdfs daemon went down for maintenance) [09:51:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:53:28] !log restart mediacounts-load-wf-2017-4-26-7 (failed due to mainteance to the hadoop cluster) [09:53:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:59:07] joal: I am wondering (paranoid mode on) if aqs100[69] with network down at the same time could cause any cassandra blips [09:59:29] probably some writes taking local_quorum could fail [09:59:49] or even local_one directed to the instances on aqs100[69]? [09:59:58] we'll have potentially 4 instances down [10:00:01] for some minutes [10:10:25] ahahaha there you go oozie [10:10:32] der complainer [10:10:49] restarting them [10:12:25] !log restarted webrequest-load-(text|upload|misc|maps) failed jobs (Hadoop workers maintenance) [10:12:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:22:06] mmm again? [10:22:24] no bueno [10:25:12] Could not obtain block: BP-1552854784-10.64.21.110-1405114489661:blk_1195819595_122150785 file=/wmf/refinery/2017-03-21T06.01.11Z--scap_sync_2017-03-21_0001-dirty/oozie/util/hive/partition/add/workflow.xml [10:26:04] so oozie can't read the workflows [10:26:33] now I am wondering if we have hit the doomsday scenario of too many hdfs workers down [10:29:27] I know oozie 13 nodes is a bit too much for you, this time you are right to complain [10:30:23] the error is in all: E0710: Could not read the workflow definition, Could not obtain block: BP-1552854784-10.64.21.110-1405114489661:blk_1195819595_122150785 [10:30:46] I thought that 10.64.21.110 was a worker node but appartently it isn't [10:32:26] 10.64.21.0/24 is analytics-b subnet and 110 is not allocated.. [10:34:06] so on 1001 I can see a lot of [10:34:07] 2017-04-26 10:33:25,504 INFO BlockStateChange: BLOCK* ask 10.64.21.108:50010 to replicate blk_1189370597_115701744 to datanode(s) 10.64.21.112:50010 10.64.21.115:50010 [10:37:13] so these errors might just have been due to the hdfs cluster being changed [10:37:36] trying to re-run maps to see [10:38:57] nope, same error [10:44:12] very weird, from stat1004 I can run [10:44:13] sudo -u hdfs hdfs dfs -cat /wmf/refinery/2017-02-09T12.42.58Z--scap_sync_2017-02-09_0001/oozie/webrequest/load/workflow.xml [10:48:09] tried with another file, No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry... [10:48:12] sigh [10:48:16] joal: --^ [10:52:29] ahhhhhh now I got it! [10:53:26] I think I knocked down hosts in three racks [10:54:34] maybe by mistake though, re-checking [10:57:10] so theoretically no since I've used analytics10[35-45,67,68].eqiad.wmnet [10:57:23] that should be all in D2 and D4 [10:57:23] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [10:57:24] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [10:57:41] nono stashbot thanks for the help but that's not it [11:01:02] all right I think there are two options: [11:01:23] 1) more nodes than necessary have been shutdown by me causing this, but it is easy to check [11:01:44] 2) for some reason the net-topology that we think we have on hadoop is not the configured one [11:03:46] checked on an1028 -printTopology and analytics1067.eqiad.wmnet seems to be in the "default rack" [11:04:11] let's see it bringing it up works [11:19:19] tried also sudo -u hdfs hdfs dfsadmin -report and I can see missing blocks [11:28:26] If net.topology.script.file.name or net.topology.node.switch.mapping.impl is not set, the rack id ‘/default-rack’ is returned for any passed IP address. While this behavior appears desirable, it can cause issues with HDFS block replication as default behavior is to write one replicated block off rack and is unable to do so as there is only a single rack named ‘/default-rack’. [11:29:01] * elukey cries in a corner [11:30:32] so if this is the case, bringing up 1068 should fix? [11:30:35] * elukey tries [11:47:24] all right let's try to bring up D4 then [11:47:24] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [11:49:09] Missing blocks 9 [11:49:12] err 0 now [11:51:49] now the problem seems gone [11:54:29] So at the moment the only daemons that are stopped are the ones in the rack that will go under extensive maintenance [11:54:55] I think that our topology is not right and this screws up replication for some reason [11:59:20] a problem now might be Blocks with corrupt replicas: 4 [11:59:35] that will probably need some attention [12:01:50] all right sending an email to the team as recap [12:08:09] going to eat something very quickly, will be back in a few [12:08:11] * elukey lunch [12:12:33] Hi elukey [12:12:43] I'm sorry I wasn't here when it broke [12:13:02] I think the scenario we're experiencing is the one I described to ottomata (and you? I can't recall) [12:15:00] Meaning: There are HDFS blocks that are replicated over only-dead machines [12:16:02] What I suggest for now is, get half of the dead nodes back up, and pray that HDFS will recover [12:16:44] Even what we could do is to make all dead nodes back up, let HDFS recover, and then remove gently by 2, letting HDFS recover rfor replication [12:20:07] joal: HDFS is already recovered, there seems to be only 4 blocks listed as "corrupted" but I think it will be resolved by hdfs itself or a fsck [12:20:18] elukey: it should [12:20:28] elukey: did my explanation make sense? [12:21:38] joal: yep I agree, I solved the issue bringing back up some selected nodes (like one rack at the time), but it shouldn't have happened [12:21:52] why shouldn't it have happened [12:21:53] ? [12:22:34] because of rack awareness, I have only shut down two racks [12:22:44] hm [12:22:55] but when I checked sudo -u hdfs hdfs dfsadmin -printTopology it seems that we are also using the default rack [12:22:58] for new noes [12:23:00] *nodes [12:23:10] default racks? [12:23:23] meaning not the ones defined in puppet? [12:24:10] IIRC Andrew told me that for new nodes that net-topology.py script was generated after the hdfs daemon was up (or something similar), so by default if you don't specify anything you end up in the default rack [12:24:21] elukey: makes sense [12:24:38] this mess might be resolved by a simple HDFS masternode restart [12:24:46] elukey: Then we should have restarted the namenode after putting the new nodes up and checked topology files [12:24:52] yes [12:24:57] but I didn't want to do the cowboy move before you or Andrew were online :D [12:25:25] elukey: I think I don't know anybody taking as good care of our infra than you do :) [12:25:29] IIRC Andrew checked and said it was ok, but we have probably missed something [12:25:47] joal: thanks :) [12:25:48] elukey: I think the "default rack" might be it [12:26:07] I found this horrible thing in the internetz [12:26:08] elukey: I'd suggest forcing an hdfs repair (you probably already started that) [12:26:10] If net.topology.script.file.name or net.topology.node.switch.mapping.impl is not set, the rack id ‘/default-rack’ is returned for any passed IP address. While this behavior appears desirable, it can cause issues with HDFS block replication as default behavior is to write one replicated block off rack and is unable to do so as there is only a single rack named ‘/default-rack’. [12:26:27] ah no sorry scratch that [12:26:31] this might not be it [12:26:56] elukey: I think data is replicated, but INSIDE default-rack instead of through multiple racks [12:27:12] ahhhhh this could make sense! [12:27:21] so between the nodes inside the default rack [12:27:29] what a horrible thing [12:27:32] :D [12:27:47] joal: about the hdfs repair, do you mean the fsck? [12:27:51] correct [12:27:59] I didn't want to start anything before double checking with you first [12:28:04] but that was my next move [12:28:37] elukey: Here is my suggestion: stop the HDFS balancer (an1003???) -to prevent moving stuff around while operations happens [12:29:19] Then let's restart namenode (using failover) [12:29:25] we also have two more hours before the maintenance starts, FYI :) [12:29:33] In order to force better rack awareness [12:30:06] Then let's do fsck, hoping ofr every bolck to be available at least once [12:30:22] and by fsck being replicated in other racks [12:31:04] Makes sense? [12:34:18] joal: maybe I can bring up all the hdfs daemons first [12:34:28] justtobesure TM [12:35:28] at the moment all the blocks are available but the more replicas probably the better (cassandra docet) [12:39:03] just suspended webrequest-load [12:39:57] elukey: Had not thought about that [12:40:09] So, let's bring back all daemons [12:40:32] elukey: maybe batcave could make it easier for synchro? [12:41:05] sure! [12:42:41] I need to install mysql security updates on bohrium/piwik, ok to do that now? [12:43:48] moritzm: can we do it tomorrow? :) [12:44:00] sure, that works as well [12:50:57] thanks a lot! [13:05:48] joal: you comin? [13:05:50] joal? live systems? [13:05:52] :DDDDD [13:06:25] Hey guys, I'm with elukey on HDFS operations issues - will skip until fixed [13:06:31] milimetric, halfak --^ [13:06:39] no prob [13:14:27] ottomata: o/ [13:14:30] you there by any chance? [13:14:57] hiii ya [13:14:59] i saw your email [13:15:02] bout blocks [13:15:04] morningzzzz [13:15:13] still llooking through other emails, makin coffeeeeeee [13:15:19] sureeee [13:15:23] there are a couple of things [13:15:24] how goes? [13:16:02] not really great, I turned off hdfs datanode on D2 and D4 nodes ending up in blocks not being able to be read [13:16:02] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [13:16:04] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [13:16:21] -printTopology shows most of the new nodes in the default rack :( [13:16:32] so I am running fsck / to hdfs with Joseph in the cave [13:16:40] and all the nodes have been brought up [13:16:54] the plan is to restart the master nodes to see if we can get correct topology [13:17:06] because there is something weird ongoing when we turn off datanodes [13:17:17] and to add a bit of spiciness to the day [13:17:29] we'd need to move kafka1020 to row-b :) [13:19:23] ottomata: --^ [13:20:32] ah! what we checked topology though [13:20:33] gah [13:20:43] ok, hopefully master restart will help [13:21:42] elukey: oh good! there is row b ipv6 [13:21:43] great [13:21:45] it wasn't in dns [13:21:49] so chris and i assumed there wasn't [13:22:14] ahhh okok! super [13:22:40] will chris still have time for us? [13:22:46] have you talked to him already? [13:23:40] just pinged him on security but no answer :( [13:24:34] he wanted to do it an hour ago if we were gonna [13:24:46] i was going to prep and get up for it, but then we decided not to because of the IPv6 thing [13:25:08] argh [13:25:15] I didn't know it sorry [13:27:02] ok now topology is good! [13:27:46] phew [13:27:53] ok, so elukey i must have been looking at the wrong thing [13:27:55] i did not us CLI [13:28:11] so, topology is good now [13:28:22] elukey: but downtime isn't over, right? [13:28:40] cause, just having it right now isn't going to save us...we need hdfs to rebalance the blocks [13:29:36] this is a good point [13:30:41] ottomata: do you have a minute for a batcave? [13:31:43] yes [13:51:56] halfak, milimetric: Quickly checked th [13:52:15] the etherpad, sorry for having missed the thing [13:52:16] joal: we're done, both running to other meetings [13:52:19] np [13:52:24] joal: we can catch up after my meeting [13:52:42] milimetric, halfak, I'm kinda aware of those things going on, let's see how we can recombine :) [14:01:10] joal, I'm in the meeting, but give me 2 minutes please :] [14:04:55] fdans, yt? [14:04:58] fdans: Heya ! [14:05:20] if you want to come to the WS2.0 metrics meeting, we're there :] [14:05:32] sorry got distracted, omw [14:32:12] 10Analytics, 10Analytics-Cluster: Monitor hdfs-balancer - https://phabricator.wikimedia.org/T163907#3213994 (10Ottomata) [14:33:37] 10Analytics, 10Analytics-Cluster: Monitor HDFS blocks problems - https://phabricator.wikimedia.org/T163908#3214017 (10Ottomata) [14:36:03] 10Analytics, 10Analytics-Cluster: Monitor that no worker nodes are in the default rack in net topology - https://phabricator.wikimedia.org/T163909#3214058 (10Ottomata) [14:51:36] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3214112 (10Nuria) Declining task, this information is already available on DataLake, at this t... [14:51:40] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3214114 (10Nuria) 05Open>03declined [14:53:41] (03CR) 10Nuria: Add dty.wikipedia to pageview whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/350376 (owner: 10Joal) [14:54:32] 10Analytics, 10Analytics-EventLogging, 06Editing-Analysis, 10Wikimedia-Hackathon-2017, 07Easy: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3214116 (10Ottomata) It is also available in eventbus data, which will could soon be query-abl... [15:02:11] 06Analytics-Kanban, 10Analytics-Wikistats: Initial FE code for Wikistats 2.0. Dashboard skeleton - https://phabricator.wikimedia.org/T163814#3214142 (10Nuria) [15:02:39] 06Analytics-Kanban, 10Analytics-Wikistats: Implement pageviews and unique devices detail pages in Wikistats UI - https://phabricator.wikimedia.org/T163817#3214143 (10Nuria) [15:04:57] 10Analytics-Dashiki, 06Analytics-Kanban: Change default timeline for browser reports to be recent (not 2015) - https://phabricator.wikimedia.org/T160796#3111134 (10Milimetric) a:03Milimetric [15:21:18] (03PS2) 10Joal: Add dty.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/350376 [15:27:16] 10Analytics: Add alarm for hdfs balancer not being able to run - https://phabricator.wikimedia.org/T163913#3214309 (10Nuria) [15:44:44] joal: as FYI, I am stopping daemons on analytics[1035-1037,1043-1045,1067-1068].eqiad.wmnet and 1002 [15:44:55] (- is from to) [15:45:02] (03CR) 10Nuria: [V: 032 C: 032] Add dty.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/350376 (owner: 10Joal) [15:48:21] neural networks for neural networks: https://openreview.net/forum?id=r1Ue8Hcxg¬eId=r1Ue8Hcxg -- Singularity is COMINGGGGGGG ! [15:55:53] will the singularity need ops? [15:56:19] I can see a real case scenario of the singularity trying to kill the world and then freeze for a failed disk [15:56:22] ahahahha [16:02:01] elukey: Singularity will never survive angry-ops [16:02:04] :D [16:03:01] https://xkcd.com/705/ is old but good [16:03:32] *gold [16:53:54] elukey: xkcd knows it ALL [16:57:26] PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:57:47] hm [16:57:55] there seems to be a problem with downtimes [16:58:00] PROBLEM - Hadoop NodeManager on analytics1040 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:58:02] PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:58:10] PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:58:12] uhhhhh [17:27:52] elukey: webrequest jobs are still suspended - Is that expected? [17:28:34] joal: yes I stopped them while before to avoid any issue, since maintenance is almost over I'd wait a sec [17:30:23] elukey: no problem, was just double checking, since it makes a long time it's suspended :) [17:30:53] I know but it is a bit of a special use case :) [17:31:04] :) [17:59:19] 06Analytics-Kanban: Check how pivot updates schema (or maybe make schema explicit on pivot) - https://phabricator.wikimedia.org/T163697#3215029 (10JAllemandou) Looks like upgrading to newer pivot should help on that - Waiting. [17:59:46] 10Analytics: upgrade druid to 0.9.2 - https://phabricator.wikimedia.org/T157977#3215031 (10JAllemandou) [17:59:48] 06Analytics-Kanban: Check how pivot updates schema (or maybe make schema explicit on pivot) - https://phabricator.wikimedia.org/T163697#3215030 (10JAllemandou) [18:00:15] 10Analytics: upgrade druid and pivot - https://phabricator.wikimedia.org/T157977#3022179 (10JAllemandou) [18:00:50] 10Analytics: upgrade druid and pivot - https://phabricator.wikimedia.org/T157977#3022179 (10JAllemandou) Seemws ready now !!!! Let's go @Ottomata and @elukey ? [18:07:15] joal: do you want me to resume oozie jobs? [18:07:22] it seems that we'll have to wait a bit more [18:07:40] elukey: as you think best fit [18:07:48] elukey: in meeting right can't really talk [18:17:05] the good news is that we have network issues now [18:17:08] on the hosts that have moved :D [18:20:41] elukey: great! [18:23:36] msg milimetric [18:26:01] elukey: :( [18:26:26] joal: can you ssh to an1002 or an1035 or kafka1018 and see if it works? [18:26:31] (if you have time) [18:28:47] elukey: an1002 not good [18:29:03] not is an1035 [18:30:28] no bueno [18:33:03] ottomata: debrief + feedback now? [18:37:03] yarggh this interenet [18:37:22] elukey: i haven't been following [18:37:22] ottomata: batcave? [18:37:24] sure [18:37:42] what's up with networking/nodes? [18:37:53] sorry, moving to mw-sec for that discussion [18:38:09] ottomata: can't ssh to them, ipv6 not working.. seems to be an old issue with igmp snooping affecting connectivity, talking with Arzhel atm [18:40:51] elukey: just for D2 or for other racks as well? [18:40:52] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [18:41:36] ottomata: I think it is only D2 :( [18:43:28] ok [18:43:56] elukey: could we restart workers in d4? [18:44:11] joal: should be ok yes [18:44:13] And possibly also resume oozie flow? [18:44:25] ottomata: anything against it? [18:44:27] I'll take care of the oozie, tell me when ready ;) [18:45:12] joal: currently checking and enabling D4 hosts (4 in total) [18:45:13] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [18:46:20] RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:46:56] hello 1038 [18:47:40] RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:49:10] RECOVERY - Hadoop NodeManager on analytics1040 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:50:19] elukey: D4 coming up then? COooOoL [18:50:20] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [18:50:33] ottomata: I am unmasking the yarn units! [18:50:34] oh d4, right [18:50:36] coool [18:50:43] D4 up and running :) [18:50:50] shall we re-enable oozie? [18:50:52] great, so ya [18:50:53] let's do that [18:51:09] fingers crossed the workflow xml are balanced :) [18:51:09] Thanks elukey [18:51:10] RECOVERY - Hadoop NodeManager on analytics1041 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:51:20] / replicated [18:51:28] elukey: shall I go for it? [18:51:37] !log resumed oozie the complainer on Hue [18:51:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:51:40] :D [18:51:40] ok [18:52:16] elukey: don't for get the trick: https://hue.wikimedia.org/oozie/list_oozie_workflows/ [18:52:26] There still are some suspended in there ;) [18:53:06] I forgot the trick then.. why do we have suspended in there??? :O [18:53:50] (resumed) [18:54:17] elukey: my unsertanding is that coordinator-resuming doesn't resume all workflows [18:54:23] elukey: but I have no proofs [18:54:59] 10Analytics-Dashiki, 06Analytics-Kanban, 13Patch-For-Review: annotations should show on tab layout - https://phabricator.wikimedia.org/T162482#3215243 (10mforns) Improved the docs on Dashiki config: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Dashiki#On-wiki_configuration [18:55:39] team, if interested, look at improvements to Dashiki configuration docs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Dashiki#On-wiki_configuration [18:55:54] joal: I'll trust your widsom and I'll try to remember that :) [18:56:18] elukey: you shouldn't do that ;) [18:56:25] let's create a task to investigate [18:57:50] elukey: https://phabricator.wikimedia.org/T163933 [18:57:53] :) [18:58:25] ottomata: going afk for a bit now, not sure what Arzhel found but I don't see a lot of updates, hope that it will not be a horrible switch bug :) [18:58:34] a-team: leaving for diner, will pass-by after to double check hadoop and networking issue [18:58:40] me too [18:58:44] ok [18:58:54] i'll be here for anohter couple of hours [18:59:54] bye team! see you tomorrow :] [19:01:15] milimetric: yt? [19:02:19] hey nuria [19:03:57] milimetric: I am about ready to push change on aqs refactor dashiki (sorry reafctoring tests took so long) , since node upgrade karma tests output has changed and we needed to add special config to display console loggin [19:04:31] oh weird, I didn't know that [19:04:51] cool, I'll review when you push [19:17:03] (03CR) 10EBernhardson: [C: 032] Support Wiki Abbreviation for Czech (cs vs cz) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/350247 (owner: 10Tjones) [19:22:16] (03Merged) 10jenkins-bot: Support Wiki Abbreviation for Czech (cs vs cz) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/350247 (owner: 10Tjones) [19:42:22] ottomata: it it fine if I re-enable an1002's daemons and kafka? [19:42:28] (one at the time) [19:45:45] ok bringing up 1002 [19:49:05] mmm some issues with hadoop-hdfs-zkfc-init [19:50:25] in the meantime, restarted 1018 [19:50:28] (kafka) [19:52:38] elukey: sorry [19:52:39] cool [19:52:42] ya looks good [19:52:54] sorry i shoudla done some of that...was in my scala code and didn't look up very hard :) [19:53:03] i'll look at zkfc init...that's weird [19:54:05] ottomata: no issue, I thought to start to speed up the recovery :) [19:54:10] yeah [19:54:26] elukey: i just ran puppet, and zkfc init didn't happen, which makes sense...i think it hsouldn't [19:54:28] looking at puppet [19:54:34] weird [19:54:59] looking at it in puppet I might have seens /bin/echo N | /usr/bin/hdfs zkfc -formatZK failing [19:55:04] that is expected no? [19:55:53] yeah, it should fail, but it shoudln't run [19:56:01] # Don't attempt to run this command if the znode already exists. [19:56:01] unless => "/usr/lib/zookeeper/bin/zkCli.sh \ [19:56:01] -server ${zookeeper_hosts_string} \ [19:56:01] stat /hadoop-ha/${::cdh::hadoop::cluster_name} 2>&1 \ [19:56:01] | /bin/grep -q ctime", [19:56:13] i suppose that command was failing [19:56:18] talking to zk [19:56:21] but i'm not sure why it would [19:56:44] weird [19:56:50] but now everything works fine? [19:58:37] ya [19:58:45] well, puppet isn't failling or trying to run that command [19:58:49] because the unless is succeeding [19:59:37] (in the meantime, re-activated yarn and hdfs on the D2 workers) [19:59:37] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [19:59:40] elukey: does that mean D2 is going down again tomorrow? [19:59:49] https://phabricator.wikimedia.org/T148506#3215394 [20:00:49] ah probably, yes [20:00:57] but it should be ok [20:10:38] hm we have to do the same thing again then right? [20:10:41] downtime will be shorter [20:11:00] yeah, just asked to be sure to Arzhel [20:11:06] anyhow, an1042 looks weird [20:11:10] the datanode emits java.io.IOException: Not ready to serve the block pool, BP-1552854784-10.64.21.110-1405114489661. [20:11:22] tried to restart but didn't resolve [20:11:41] ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: analytics1042.eqiad.wmnet:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.64.53.22:54718 dst: /10.64.53.22:50010 [20:11:45] hmm [20:11:57] maybe only temporary [20:12:41] hmm [20:12:50] elukey: maybe stop nodemanager on that node for a minute [20:13:13] i'm guessing that is yarn trying to write [20:13:19] but hdfs datanode there is not ready? [20:13:22] not sure though [20:17:11] ah 1042 seems ok now [20:24:32] gahhhhh [20:24:33] internet [20:25:27] elukey: an42 dn logs look a little better, ya? [20:28:10] ottomata: ahhh I may I have an explanation for zk init [20:28:32] conf1003 is still down and /usr/lib/zookeeper/bin/zkCli.sh -server conf1003.eqiad.wmnet:2181 stat /hadoop-ha/analytics-hadoop | /bin/grep -q ctime returns an exception [20:28:48] it should have all servers listed though [20:29:24] /usr/lib/zookeeper/bin/zkCli.sh -server conf1001.eqiad.wmnet,conf1002.eqiad.wmnet,conf1003.eqiad.wmnet:2181 stat /hadoop-ha/analytics-hadoop | /bin/grep -q ctime [20:29:25] works [20:29:30] yep yep I am only adding the only weird thing [20:29:33] elukey: maybe it just picks one randomly and fails [20:29:40] (I just started conf1003 though) [20:29:46] oh [20:29:47] ok :) [20:30:21] anyhow, it is something to fix :) [20:30:35] the echo N protects us but.. :D [20:31:36] running puppet on an1001 and an1002 [20:32:57] ottomata: everything looks good [20:33:12] kafka1018 is still recovering, I didn't start 1020 yet [20:35:15] ok [20:36:10] super, going offline! [20:37:10] cool, later! i have to run too elukey will check back on it tonight [20:37:10] Arzhel said that D2 hosts will not be touched anymore, tomorrow I'll try to figure out which hosts will be impacted :) [20:37:11] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [20:37:16] ooook! [20:37:17] ttl! [20:38:02] ottomata: still there? [20:38:11] shall we enable puppet to run the balancer? [20:38:53] (will wait tomorrow, but puppet on an1003 is still disabled) [20:38:59] * elukey off! [20:39:16] a-team - the cluster is good now and kafka is "only" one node down, we should be ok :) [20:41:51] you rock! [20:47:36] * joal claps for an-ops team !