[08:24:26] joal: Hi! We'd need to reboot stat100[234] for kernel upgrades [08:25:06] what is usually the procedure? email to analytics@ one day in advance? [08:26:08] in the past I mailed analytics@ in addition to a headsup on IRC. since I wasn't sure everyone follows IRC [08:28:34] good! Will send an email soon then, and reboot tomorrow morning [08:30:55] ok, nice [08:33:17] Hi elukey [08:33:24] procedure sounds good to me :) [08:33:26] Hi moritzm [08:39:29] all right email sent [08:40:11] joal: I am going to stop camus and oozie to prepare the kernel upgrades, it it ok? [08:40:17] soun [08:40:23] sounds good [08:40:26] super [08:40:33] elukey: stop only laod jobs, right? [08:43:54] joal: I usually suspend all the bundles just to be sure, I know that they are dependent on each other but it is easier for me to just suspend all. Would it be good? [08:44:15] elukey: I'd rather just stop load: less error prone (less moving parts) [08:44:35] elukey: Stoping everything ould work, but means more changes [08:44:58] ehm already done [08:45:09] * elukey hides in a corner [08:45:10] It's ok [08:45:29] sorry I pressed suspend before checking IRC :( [08:45:41] will check yarn and start the reboot in a bit [08:45:56] don't bother, it's no big deal, just easier not to have to check 5 jobs instead of 1 :) [08:46:07] yeah got it, you are right :) [09:03:14] starting the reboot! [09:05:18] elukey: I managed to stream some tables locally on aqs [09:05:33] elukey: Could we have a look at ports when you'll be finished with the reboot? [09:05:42] elukey: I don't have the right to tcpdump :) [09:08:13] sure! [09:39:49] currently rebooted: 1028->1042 [09:40:33] I am proceeding in batches of 3 checking each time yarn/hdfs and journal if present [09:40:45] I'll also do 1001/1002 in the end [09:40:52] ETA: 30 mins? [09:43:56] Analytics, Revision-Slider, TCB-Team, WMDE-Analytics-Engineering, and 3 others: Data need: Explore range of article revision comparisons - https://phabricator.wikimedia.org/T134861#2414016 (Tobi_WMDE_SW) [10:16:23] Analytics-Cluster, Analytics-Kanban, Operations, ops-eqiad: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2414097 (elukey) Rebooted by mistake as part of last round of hadoop kernel upgrades, now getting: ``` All of the disks from your previous configuration... [10:29:59] all right all the worker nodes rebooted [10:30:06] joal: proceeding with 1001/2 [10:32:06] k elukey [10:33:17] elukey: Had you waited for jobs to be finished before startuing? Cause I see some prod jobs running now [10:34:59] joal: the only ones that I noticed were oozie's but I suspended them, so I proceeded after waiting a while with three nodes at the time [10:35:19] now I can see druid ones [10:35:43] elukey: there is 1 druid (I launched it having forgotten you were restarting, no big deal, it's testing) [10:35:51] But there still are 2 production ones [10:36:26] Normally they shouldn't fail, but the reason for which we stop then restart oozie jobs is to prevent having running jobs while reatsrting [10:36:46] If we don't wait for jobs to be finished, no point in even stop oozie [10:37:06] mmm but I thought that suspending them would have prevented their completion [10:37:23] elukey: oozie has no impact on already launched jobs [10:37:35] ah good didn't know this bit [10:37:54] I was watching failed jobs though and didn't see anything oozie related [10:38:05] and I proceeded very slowly [10:38:32] We'll receive alerts if anything fails, so no big deail [10:39:00] It's just that you've setup the thing but since you didn't wait, it is as if you hadn't set it up :) [10:40:06] well it prevented other jobs to start, I checked and there were only two oozie jobs running. My bad that I confused how SUSPENDED works for already launched jobs, but I wouldn't say that it was useless to stop everything [10:40:17] anyhow, lesson learned for the next reboot [10:40:32] elukey: yeah you're right, I'm a bit overstating this :) [10:41:02] elukey: sorry about me being harsh [10:42:02] nope it wasn't harsh, I really appreciate discussion especially from people that knows how stuff works, I was just defending my position :P [10:42:40] k good :) [10:45:59] all right 1001 rebooting, 1002 is now Yarn/HDFS master [10:46:37] elukey: the HA on namenode and resource manager is really great: ) [10:46:50] and they are super fast! [10:47:03] really impressed [10:47:05] :) [10:47:20] elukey: I am currently running in spark, the thing didn't even broke :) [10:47:36] just lost some workers, reinstanciated them with new manager [10:47:40] awesome :) [10:58:13] 1001 is back to primary, 1002 rebooting now [11:01:38] all good, 1002's Yarn is up and running as secondary, HDFS is bootstrapping [11:03:03] great :) [11:05:18] joal: I am going to re-enable camus and oozie as soon as 1002 is up and running fine, then I'll step afk for ~1hr. After that I'll be ready for the tcpdump and aqs related work, would it be ok for you? [11:05:29] elukey: sure :) [11:06:08] thanks! [11:07:53] hallo [11:08:07] was the stat1002 reboot done already? or is it going to happen? [11:08:19] aharoni: tomorrow morning EU time :) [11:08:25] thanks [11:08:43] I can run beeline queries today without issues, can I? [11:08:55] Not extremely long ones, a few minutes each [11:10:35] yep sure! [11:10:39] thanks! [11:12:09] :) [11:12:18] joal: camus re-enabled, proceeding with oozie [11:12:22] k [11:12:23] Analytics, Community-Tech, Pageviews-API, Pageviews-on-Labs, and 2 others: links on Pageviews in labs are broken with RTL UI - https://phabricator.wikimedia.org/T138928#2414238 (Amire80) [11:12:55] done! [11:17:16] all right all good, everything seems working fine. I am going afk but I'll have my laptop with me JUST IN CASE something comes up (rachable via hangouts!) [11:47:15] Curious. It feels that queries on stat1002 are faster. But we're still before the reboot. Is it because jobs are stopped? [11:47:23] Or maybe I'm just imagining that they are faster? :) [11:48:20] aharoni: jobs on the cluster have been stopped, and it takes time for the full dependency to restart [11:48:45] aharoni: when everything will be restarted , less resources will be available, therefore longer queries [11:53:04] so... this begs the village-fool question: can we make the queries faster all the time? :) [12:02:19] hi team! [12:04:34] hi mforns :) [12:45:29] o/ [12:45:43] heya elukey [12:45:53] elukey: from what I've seen nothing broke :) [12:46:10] yeah I watched icinga and the channel once in a while :P [12:46:19] I need to reboot 1027 too [12:46:29] and 1026, that I didn't know it existed.. [12:46:46] elukey: well nothing, some jobs to relaunch, but no major outge [12:47:15] ah snap! [12:47:19] # analytics1026 is spare, for now is just an analytics client. [12:47:20] node 'analytics1026.eqiad.wmnet' { role analytics_cluster::client include standard [12:47:23] } [12:48:08] but checking with last on the host I can't find other people than me [12:48:12] so I am going to reboot it :) [13:03:37] all right 1026 has been rebooted [13:04:08] and 1015 is another spare that we can reboot.. then 1027 [13:04:37] joal: afaiu 1027 could be rebooted after checking that camus is not running right? [13:05:10] analytics 1027 is a regular node? [13:05:15] elukey: --^ [13:05:50] nope, there is only camus from what I can see [13:06:15] ah no also hue! [13:06:15] https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp#L114 [13:06:34] and it seems the refinery target [13:06:40] elukey: sorry I forgot it was 1027 [13:07:02] too many numbers :) [13:07:09] I think the only critical thing is camus [13:07:14] probably better to stop hue as well [13:07:19] the only ones left, except the stats, are 1003 and 1027.. [13:07:23] oozie isn't running on that one? [13:07:35] I think it is on 1003 [13:07:37] checking [13:07:39] k [13:07:59] yeah https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp#L69 [13:08:08] together with the various databases [13:08:13] hive metastore, oozie, hue, etc.. [13:08:25] okey [13:08:30] elukey: do you need to do them? [13:08:42] yeah [13:08:52] elukey: then better to do them both together [13:09:07] Would actually have been good to do them while cluster was down ... [13:09:28] * elukey nods [13:09:51] So .... as usual, huh? :) [13:10:00] stop everything, reboot, restart :) [13:10:12] sure, going to do it [13:10:17] Just wait to be sure camus run is done, but except from that hopefully would be ok [13:22:56] all the oozie alarms are due to the reboots, right? [13:23:46] I was about to ask [13:23:56] there is an issue I thimnk [13:23:58] seems related to mobile apps only [13:24:02] a bit cryptic [13:24:03] :D [13:24:27] also hue.wikimedia.org is down atm, will be back shortly [13:24:54] elukey: you've just restarted oozie, right? [13:26:03] nope [13:26:17] hm [13:26:21] Something is wrong [13:26:33] :/ [13:26:37] jobs from old dates are relanching [13:28:35] mmmmm [13:28:51] batcave to try and figure it out? [13:28:56] yup [13:28:57] joining [13:30:01] so the site.pp related data about 1027 https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp#L114 says "used for launching" but I can't see anything related ot it [13:30:10] except refinery, but not sure [13:30:18] anyhow, will bring back hue and join [13:30:25] it seems not liking the reboot [13:33:49] mmmm we have both apache and nginx on 1027 [13:35:45] ah both are doing proxy [13:35:52] and nginx is working, apache is not [13:35:54] weird [13:36:05] will write down a note for andrew [13:46:54] hi, I just received a mail from a very old job failure (Fatal Error - Oozie Job load_mediawiki-wmf_raw.CirrusSearchRequestSet,2016,4,20,13-wf) [13:47:14] but the partition seems to exist (year=2016/month=4/day=20/hour=13) sounds safe to ignore? [13:48:26] dcausse: yes, please ignore, we're experiencing weird stuff on the cluster [13:48:37] ok, thanks! [13:49:34] dcausse: sorry for the spams ! [13:49:45] hey no problem :) [13:50:59] Analytics-Tech-community-metrics: Deployment of Demography panel - https://phabricator.wikimedia.org/T138757#2414774 (Qgil) [13:51:20] Analytics-Tech-community-metrics: Deployment of Gerrit Delays panel for engineering - https://phabricator.wikimedia.org/T138752#2414776 (Qgil) [13:51:48] Analytics-Tech-community-metrics, Developer-Relations: Measuring Time To First Code Change (TTFCC) - https://phabricator.wikimedia.org/T137201#2414777 (Qgil) [14:35:50] * elukey hates oozie once more [14:37:35] dcausse: we found out that an old oozie process was still installed on analytics1015 (now almost decomm), and after rebooting the host for kernel upgrades triggered its resurrection [14:38:15] elukey: ah ok, thanks for the explanation! [14:39:06] sorry for the trouble :( [14:40:07] elukey: everything seems back to normal :) [14:40:20] thanks for the super intuition joal [14:41:11] for nothing, something had to happenning somewhere :) [14:43:33] I removed the oozie package from 1015 for the moment [14:43:40] and stopped the hive metastore and server [14:43:42] great elukey :) [14:43:46] my plan is to wipe that host [14:43:48] asap [14:43:49] :D [14:43:52] :D [14:43:57] Kill the fucker ! [14:43:59] need to wait for Andrew just to check [14:44:00] hahaahah [14:46:55] elukey: have a minute for a tcpdump / port discussion? [14:48:27] yes sure! [14:48:55] batcave? [14:50:40] yes! [14:50:54] gimme 2 mins [14:50:57] (brb) [14:51:03] sure [14:57:46] joal: can't hear you in the bcave [14:58:08] just wanted to say that I have a 1:1 with nuria in a bit, but will be free in ~20/30 mins if you want [14:58:52] elukey: will depend on Lino :) [14:59:04] elukey: my internet broke, don't know why [14:59:13] Let's see what happens after 1-1 :) [14:59:21] sure [14:59:29] sorry about that, just seen the reminder [14:59:33] np [15:00:35] nuria OOT 28, 29, 30th of June [15:00:43] I guess that we can meet now joal :) [15:00:48] Ok joining :) [15:06:25] Gone again :( [15:44:28] Updated the phab task! [15:44:39] I guess that I can remove Thrift from Ferm too [15:44:49] and restart all the aqs100[456] instances [15:44:53] joal --^ [15:51:53] I am restarting the aqs instances atm [15:51:57] to remove thrift [16:39:31] going afk! [16:39:37] byeee o/ see you tomorrow :) [19:31:20] Analytics, Team-Practices, User-JAufrecht: Get regular traffic reports on TPG pages - https://phabricator.wikimedia.org/T99815#2415749 (JAufrecht) Open>declined Premise: - TPG provides value in several ways, including producing documentation to be read by TPGers, other WMF staff, or the mov... [19:39:56] Quarry, Patch-For-Review: Add database selector - https://phabricator.wikimedia.org/T76466#2415766 (Krenair) a:Krenair>None [20:20:20] bye a-team, see ya tomorrow :]