[08:24:26] <elukey>	 joal: Hi! We'd need to reboot stat100[234] for kernel upgrades
[08:25:06] <elukey>	 what is usually the procedure? email to analytics@ one day in advance?
[08:26:08] <moritzm>	 in the past I mailed analytics@ in addition to a headsup on IRC. since I wasn't sure everyone follows IRC
[08:28:34] <elukey>	 good! Will send an email soon then, and reboot tomorrow morning
[08:30:55] <moritzm>	 ok, nice
[08:33:17] <joal>	 Hi elukey
[08:33:24] <joal>	 procedure sounds good to me :)
[08:33:26] <joal>	 Hi moritzm
[08:39:29] <elukey>	 all right email sent
[08:40:11] <elukey>	 joal: I am going to stop camus and oozie to prepare the kernel upgrades, it it ok?
[08:40:17] <joal>	 soun
[08:40:23] <joal>	 sounds good
[08:40:26] <elukey>	 super
[08:40:33] <joal>	 elukey: stop only laod jobs, right?
[08:43:54] <elukey>	 joal: I usually suspend all the bundles just to be sure, I know that they are dependent on each other but it is easier for me to just suspend all. Would it be good?
[08:44:15] <joal>	 elukey: I'd rather just stop load: less error prone (less moving parts)
[08:44:35] <joal>	 elukey: Stoping everything ould work, but means more changes
[08:44:58] <elukey>	 ehm already done
[08:45:09] * elukey hides in a corner
[08:45:10] <joal>	 It's ok
[08:45:29] <elukey>	 sorry I pressed suspend before checking IRC :(
[08:45:41] <elukey>	 will check yarn and start the reboot in a bit
[08:45:56] <joal>	 don't bother, it's no big deal, just easier not to have to check 5 jobs instead of 1 :)
[08:46:07] <elukey>	 yeah got it, you are right :)
[09:03:14] <elukey>	 starting the reboot!
[09:05:18] <joal>	 elukey: I managed to stream some tables locally on aqs
[09:05:33] <joal>	 elukey: Could we have a look at ports when you'll be finished with the reboot?
[09:05:42] <joal>	 elukey: I don't have the right to tcpdump :)
[09:08:13] <elukey>	 sure!
[09:39:49] <elukey>	 currently rebooted: 1028->1042
[09:40:33] <elukey>	 I am proceeding in batches of 3 checking each time yarn/hdfs and journal if present
[09:40:45] <elukey>	 I'll also do 1001/1002 in the end
[09:40:52] <elukey>	 ETA: 30 mins?
[09:43:56] <wikibugs>	 Analytics, Revision-Slider, TCB-Team, WMDE-Analytics-Engineering, and 3 others: Data need:  Explore range of article revision comparisons - https://phabricator.wikimedia.org/T134861#2414016 (Tobi_WMDE_SW)
[10:16:23] <wikibugs>	 Analytics-Cluster, Analytics-Kanban, Operations, ops-eqiad: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2414097 (elukey) Rebooted by mistake as part of last round of hadoop kernel upgrades, now getting:  ``` All of the disks from your previous configuration...
[10:29:59] <elukey>	 all right all the worker nodes rebooted
[10:30:06] <elukey>	 joal: proceeding with 1001/2
[10:32:06] <joal>	 k elukey
[10:33:17] <joal>	 elukey: Had you waited for jobs to be finished before startuing? Cause I see some prod jobs running now
[10:34:59] <elukey>	 joal: the only ones that I noticed were oozie's but I suspended them, so I proceeded after waiting a while with three nodes at the time
[10:35:19] <elukey>	 now I can see druid ones
[10:35:43] <joal>	 elukey: there is 1 druid (I launched it having forgotten you were restarting, no big deal, it's testing)
[10:35:51] <joal>	 But there still are 2 production ones
[10:36:26] <joal>	 Normally they shouldn't fail, but the reason for which we stop then restart oozie jobs is to prevent having running jobs while reatsrting
[10:36:46] <joal>	 If we don't wait for jobs to be finished, no point in even stop oozie
[10:37:06] <elukey>	 mmm but I thought that suspending them would have prevented their completion
[10:37:23] <joal>	 elukey: oozie has no impact on already launched jobs
[10:37:35] <elukey>	 ah good didn't know this bit
[10:37:54] <elukey>	 I was watching failed jobs though and didn't see anything oozie related
[10:38:05] <elukey>	 and I proceeded very slowly
[10:38:32] <joal>	 We'll receive alerts if anything fails, so no big deail
[10:39:00] <joal>	 It's just that you've setup the thing but since you didn't wait, it is as if you hadn't set it up :)
[10:40:06] <elukey>	 well it prevented other jobs to start, I checked and there were only two oozie jobs running. My bad that I confused how SUSPENDED works for already launched jobs, but I wouldn't say that it was useless to stop everything
[10:40:17] <elukey>	 anyhow, lesson learned for the next reboot
[10:40:32] <joal>	 elukey: yeah you're right, I'm a bit overstating this :)
[10:41:02] <joal>	 elukey: sorry about me being harsh
[10:42:02] <elukey>	 nope it wasn't harsh, I really appreciate discussion especially from people that knows how stuff works, I was just defending my position :P
[10:42:40] <joal>	 k good :)
[10:45:59] <elukey>	 all right 1001 rebooting, 1002 is now Yarn/HDFS master
[10:46:37] <joal>	 elukey: the HA on namenode and resource manager is really great: )
[10:46:50] <elukey>	 and they are super fast!
[10:47:03] <elukey>	 really impressed
[10:47:05] <joal>	 :)
[10:47:20] <joal>	 elukey: I am currently running in spark, the thing didn't even broke :)
[10:47:36] <joal>	 just lost some workers, reinstanciated them with new manager
[10:47:40] <joal>	 awesome :)
[10:58:13] <elukey>	 1001 is back to primary, 1002 rebooting now
[11:01:38] <elukey>	 all good, 1002's Yarn is up and running as secondary, HDFS is bootstrapping
[11:03:03] <joal>	 great :)
[11:05:18] <elukey>	 joal: I am going to re-enable camus and oozie as soon as 1002 is up and running fine, then I'll step afk for ~1hr. After that I'll be ready for the tcpdump and aqs related work, would it be ok for you?
[11:05:29] <joal>	 elukey: sure :)
[11:06:08] <elukey>	 thanks!
[11:07:53] <aharoni>	 hallo
[11:08:07] <aharoni>	 was the stat1002 reboot done already? or is it going to happen?
[11:08:19] <elukey>	 aharoni: tomorrow morning EU time :)
[11:08:25] <aharoni>	 thanks
[11:08:43] <aharoni>	 I can run beeline queries today without issues, can I?
[11:08:55] <aharoni>	 Not extremely long ones, a few minutes each
[11:10:35] <elukey>	 yep sure!
[11:10:39] <aharoni>	 thanks!
[11:12:09] <elukey>	 :)
[11:12:18] <elukey>	 joal: camus re-enabled, proceeding with oozie
[11:12:22] <joal>	 k
[11:12:23] <wikibugs>	 Analytics, Community-Tech, Pageviews-API, Pageviews-on-Labs, and 2 others: links on Pageviews in labs are broken with RTL UI - https://phabricator.wikimedia.org/T138928#2414238 (Amire80)
[11:12:55] <elukey>	 done!
[11:17:16] <elukey>	 all right all good, everything seems working fine. I am going afk but I'll have my laptop with me JUST IN CASE something comes up (rachable via hangouts!)
[11:47:15] <aharoni>	 Curious. It feels that queries on stat1002 are faster. But we're still before the reboot. Is it because jobs are stopped?
[11:47:23] <aharoni>	 Or maybe I'm just imagining that they are faster? :)
[11:48:20] <joal>	 aharoni: jobs on the cluster have been stopped, and it takes time for the full dependency to restart
[11:48:45] <joal>	 aharoni: when everything will be restarted , less resources will be available, therefore longer queries
[11:53:04] <aharoni>	 so... this begs the village-fool question: can we make the queries faster all the time? :)
[12:02:19] <mforns>	 hi team!
[12:04:34] <joal>	 hi mforns :)
[12:45:29] <elukey>	 o/
[12:45:43] <joal>	 heya elukey
[12:45:53] <joal>	 elukey: from what I've seen nothing broke :)
[12:46:10] <elukey>	 yeah I watched icinga and the channel once in a while :P
[12:46:19] <elukey>	 I need to reboot 1027 too
[12:46:29] <elukey>	 and 1026, that I didn't know it existed..
[12:46:46] <joal>	 elukey: well nothing, some jobs to relaunch, but no major outge
[12:47:15] <elukey>	 ah snap!
[12:47:19] <elukey>	 # analytics1026 is spare, for now is just an analytics client.
[12:47:20] <elukey>	 node 'analytics1026.eqiad.wmnet' { role analytics_cluster::client include standard
[12:47:23] <elukey>	 }
[12:48:08] <elukey>	 but checking with last on the host I can't find other people than me
[12:48:12] <elukey>	 so I am going to reboot it :)
[13:03:37] <elukey>	 all right 1026 has been rebooted
[13:04:08] <elukey>	 and 1015 is another spare that we can reboot.. then 1027
[13:04:37] <elukey>	 joal: afaiu 1027 could be rebooted after checking that camus is not running right?
[13:05:10] <joal>	 analytics 1027 is a regular node?
[13:05:15] <joal>	 elukey: --^
[13:05:50] <elukey>	 nope, there is only camus from what I can see
[13:06:15] <elukey>	 ah no also hue!
[13:06:15] <elukey>	 https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp#L114
[13:06:34] <elukey>	 and it seems the refinery target
[13:06:40] <joal>	 elukey: sorry I forgot it was 1027
[13:07:02] <elukey>	 too many numbers :)
[13:07:09] <joal>	 I think the only critical thing is camus
[13:07:14] <joal>	 probably better to stop hue as well
[13:07:19] <elukey>	 the only ones left, except the stats, are 1003 and 1027..
[13:07:23] <joal>	 oozie isn't running on that one?
[13:07:35] <elukey>	 I think it is on 1003
[13:07:37] <elukey>	 checking
[13:07:39] <joal>	 k
[13:07:59] <elukey>	 yeah https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp#L69
[13:08:08] <elukey>	 together with the various databases
[13:08:13] <elukey>	 hive metastore, oozie, hue, etc..
[13:08:25] <joal>	 okey
[13:08:30] <joal>	 elukey: do you need to do them?
[13:08:42] <elukey>	 yeah
[13:08:52] <joal>	 elukey: then better to do them both together
[13:09:07] <joal>	 Would actually have been good to do them while cluster was down ...
[13:09:28] * elukey nods
[13:09:51] <joal>	 So .... as usual, huh? :)
[13:10:00] <joal>	 stop everything, reboot, restart :)
[13:10:12] <elukey>	 sure, going to do it
[13:10:17] <joal>	 Just wait to be sure camus run is done, but except from that hopefully would be ok
[13:22:56] <milimetric>	 all the oozie alarms are due to the reboots, right?
[13:23:46] <elukey>	 I was about to ask
[13:23:56] <joal>	 there is an issue I thimnk
[13:23:58] <elukey>	 seems related to mobile apps only
[13:24:02] <elukey>	 a bit cryptic
[13:24:03] <elukey>	 :D
[13:24:27] <elukey>	 also hue.wikimedia.org is down atm, will be back shortly
[13:24:54] <joal>	 elukey: you've just restarted oozie, right?
[13:26:03] <elukey>	 nope
[13:26:17] <joal>	 hm
[13:26:21] <joal>	 Something is wrong
[13:26:33] <elukey>	 :/
[13:26:37] <joal>	 jobs from old dates are relanching
[13:28:35] <elukey>	 mmmmm
[13:28:51] <milimetric>	 batcave to try and figure it out?
[13:28:56] <joal>	 yup
[13:28:57] <joal>	 joining
[13:30:01] <elukey>	 so the site.pp related data about 1027 https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp#L114 says "used for launching" but I can't see anything related ot it
[13:30:10] <elukey>	 except refinery, but not sure
[13:30:18] <elukey>	 anyhow, will bring back hue and join
[13:30:25] <elukey>	 it seems not liking the reboot
[13:33:49] <elukey>	 mmmm we have both apache and nginx on 1027
[13:35:45] <elukey>	 ah both are doing proxy
[13:35:52] <elukey>	 and nginx is working, apache is not
[13:35:54] <elukey>	 weird
[13:36:05] <elukey>	 will write down a note for andrew
[13:46:54] <dcausse>	 hi, I just received a mail from a very old job failure (Fatal Error - Oozie Job load_mediawiki-wmf_raw.CirrusSearchRequestSet,2016,4,20,13-wf)
[13:47:14] <dcausse>	 but the partition seems to exist (year=2016/month=4/day=20/hour=13) sounds safe to ignore?
[13:48:26] <joal>	 dcausse: yes, please ignore, we're experiencing weird stuff on the cluster
[13:48:37] <dcausse>	 ok, thanks!
[13:49:34] <joal>	 dcausse: sorry for the spams !
[13:49:45] <dcausse>	 hey no problem :)
[13:50:59] <wikibugs>	 Analytics-Tech-community-metrics: Deployment of Demography panel - https://phabricator.wikimedia.org/T138757#2414774 (Qgil)
[13:51:20] <wikibugs>	 Analytics-Tech-community-metrics: Deployment of Gerrit Delays panel for engineering - https://phabricator.wikimedia.org/T138752#2414776 (Qgil)
[13:51:48] <wikibugs>	 Analytics-Tech-community-metrics, Developer-Relations: Measuring Time To First Code Change (TTFCC) - https://phabricator.wikimedia.org/T137201#2414777 (Qgil)
[14:35:50] * elukey hates oozie once more
[14:37:35] <elukey>	 dcausse: we found out that an old oozie process was still installed on analytics1015 (now almost decomm), and after rebooting the host for kernel upgrades triggered its resurrection
[14:38:15] <dcausse>	 elukey: ah ok, thanks for the explanation!
[14:39:06] <elukey>	 sorry for the trouble :(
[14:40:07] <joal>	 elukey: everything seems back to normal :)
[14:40:20] <elukey>	 thanks for the super intuition joal
[14:41:11] <joal>	 for nothing, something had to happenning somewhere :)
[14:43:33] <elukey>	 I removed the oozie package from 1015 for the moment
[14:43:40] <elukey>	 and stopped the hive metastore and server
[14:43:42] <joal>	 great elukey :)
[14:43:46] <elukey>	 my plan is to wipe that host
[14:43:48] <elukey>	 asap
[14:43:49] <elukey>	 :D
[14:43:52] <joal>	 :D
[14:43:57] <joal>	 Kill the fucker !
[14:43:59] <elukey>	 need to wait for Andrew just to check
[14:44:00] <elukey>	 hahaahah
[14:46:55] <joal>	 elukey: have a minute for a tcpdump / port discussion?
[14:48:27] <elukey>	 yes sure!
[14:48:55] <joal>	 batcave?
[14:50:40] <elukey>	 yes!
[14:50:54] <elukey>	 gimme 2 mins
[14:50:57] <elukey>	 (brb)
[14:51:03] <joal>	 sure
[14:57:46] <elukey>	 joal: can't hear you in the bcave
[14:58:08] <elukey>	 just wanted to say that I have a 1:1 with nuria in a bit, but will be free in ~20/30 mins if you want
[14:58:52] <joal>	 elukey: will depend on Lino :)
[14:59:04] <joal>	 elukey: my internet broke, don't know why
[14:59:13] <joal>	 Let's see what happens after 1-1 :)
[14:59:21] <elukey>	 sure
[14:59:29] <elukey>	 sorry about that, just seen the reminder
[14:59:33] <joal>	 np
[15:00:35] <elukey>	 nuria OOT 28, 29, 30th of June
[15:00:43] <elukey>	 I guess that we can meet now joal :)
[15:00:48] <joal>	 Ok joining :)
[15:06:25] <joal>	 Gone again :(
[15:44:28] <elukey>	 Updated the phab task!
[15:44:39] <elukey>	 I guess that I can remove Thrift from Ferm too
[15:44:49] <elukey>	 and restart all the aqs100[456] instances
[15:44:53] <elukey>	 joal --^
[15:51:53] <elukey>	 I am restarting the aqs instances atm
[15:51:57] <elukey>	 to remove thrift
[16:39:31] <elukey>	 going afk!
[16:39:37] <elukey>	 byeee o/ see you tomorrow :)
[19:31:20] <wikibugs>	 Analytics, Team-Practices, User-JAufrecht: Get regular traffic reports on TPG pages - https://phabricator.wikimedia.org/T99815#2415749 (JAufrecht) Open>declined Premise:   - TPG provides value in several ways, including producing documentation to be read by TPGers, other WMF staff, or the mov...
[19:39:56] <wikibugs>	 Quarry, Patch-For-Review: Add database selector - https://phabricator.wikimedia.org/T76466#2415766 (Krenair) a:Krenair>None
[20:20:20] <mforns>	 bye a-team, see ya tomorrow :]