[08:23:20] <elukey>	 morningggg
[08:23:34] <elukey>	 that recurring IRC spammer is really annoying
[08:30:09] <elukey>	 need to plan the reboot of stat* and analytics1003
[08:30:28] <elukey>	 I am thinking to send an announce for Wed at around 10 CET
[08:31:00] <elukey>	 the main issue though is that analytics1003 holds the database for Druid
[08:31:25] <elukey>	 so overlords will not be happy for sure during the reboot
[08:51:06] <elukey>	 ok email sent, now I'll try to reach out to all the people running a tmux/screen session
[08:53:35] <moritzm>	 elukey: stat* hosts all have the new kernel installed already
[08:54:24] <elukey>	 nice!
[08:54:33] <elukey>	 I am sending the emails to announce the maintenance 
[09:05:23] <elukey>	 all right all emails sent
[09:11:01] <elukey>	 !log reboot aqs1004 for kernel updates
[09:11:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:31:17] <elukey>	 all right aqs1004 is running with the new kernel
[09:32:42] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=aqs1004&var-datasource=eqiad%20prometheus%2Fops looks good for the moment, I'll let it boil for a bit before rebooting the rest
[09:32:46] <elukey>	 now kafka2001
[09:52:14] <elukey>	 kafka2001 up and running, all good
[09:58:37] <elukey>	 !log rolling reboots of aqs hosts (1005->1009) for kernel updates
[09:58:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:53:40] <elukey>	 currently rebooting aqs1007
[12:09:14] <elukey>	 all right current status is
[12:09:22] <elukey>	 AQS:1004->1007 rebooted
[12:09:29] <elukey>	 KAFKA: 200[12] rebooted
[12:09:36] <elukey>	 will do the rest after lunch1
[12:09:37] <elukey>	 !!!
[12:09:40] * elukey afk!
[12:37:35] <wikibugs>	 10Analytics-Tech-community-metrics: Inconsistent numbers for new Gerrit developers ("Summary" vs "New Authors" widgets) - https://phabricator.wikimedia.org/T177569#3900412 (10Aklapper)
[12:37:37] <wikibugs>	 10Analytics-Tech-community-metrics: Number of changeset submitters in "gerrit_main_numbers" widget differs from number of submitters in "gerrit_top_developers" widget - https://phabricator.wikimedia.org/T184741#3900414 (10Aklapper)
[12:38:59] <wikibugs>	 10Analytics-Tech-community-metrics: Include gerrit DB's "author_bot" field also in the gerrit_demo DB - https://phabricator.wikimedia.org/T184907#3900416 (10Aklapper) p:05Triage>03Low
[12:55:44] <joal>	 Hi elukey - Thanks a lot for the reboots this morning :)
[13:04:12] <elukey>	 :)
[13:04:19] <elukey>	 going to complete aqs now
[13:04:37] <joal>	 elukey: How may I help with java8?
[13:05:39] <elukey>	 joal: if we resolve the issue with spark2 then we are done
[13:05:50] <elukey>	 the main issue though is when to schedule this upgrade
[13:05:51] <joal>	 ok elukey - Will investigate more
[13:06:01] <elukey>	 before or after SF?
[13:06:15] <joal>	 hm - I'd say after, just for safety of possible issue
[13:09:09] <elukey>	 yep I am thinking the same
[13:20:31] <fdans>	 joal: if you like we can go through the wikistats deployment process :)
[13:20:42] <joal>	 Yes fdans!
[13:20:45] <joal>	 le's do that :)
[13:20:53] <fdans>	 cave?
[13:20:59] <joal>	 OMW !
[13:28:52] <wikibugs>	 10Analytics-Tech-community-metrics: One account (in "gerrit_top_developers" widget) counted as two accounts (in "gerrit_main_numbers" widget) - https://phabricator.wikimedia.org/T184741#3900475 (10Aklapper)
[13:32:34] <wikibugs>	 (03PS1) 10Fdans: Release 2.1.4 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/404292
[13:32:40] <elukey>	 so people now I'd need to reboot eventlog1001
[13:33:09] <elukey>	 the idea would be to send an email to analytics@ explaining the maintenance, stop eventlogging completely (will affect graphs), reboot, re-enable
[13:33:14] <wikibugs>	 (03CR) 10Fdans: [V: 032 C: 032] Release 2.1.4 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/404292 (owner: 10Fdans)
[13:33:37] <elukey>	 mforns: o/ - anything against --^
[13:33:51] <mforns>	 elukey, reading
[13:33:57] <wikibugs>	 (03CR) 10Fdans: [V: 032 C: 032] "Merging to update release" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/404292 (owner: 10Fdans)
[13:34:16] <mforns>	 elukey, nothing against :]
[13:34:26] <elukey>	 all right, proceeding :)
[13:37:06] <wikibugs>	 (03PS1) 10Fdans: Release 2.1.4 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/404295
[13:38:11] <wikibugs>	 (03CR) 10Fdans: [V: 032 C: 032] "Pushing to release branch to deploy to production" [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/404295 (owner: 10Fdans)
[13:39:05] <elukey>	 !log stop eventlogging and reboot eventlog1001 for kernel updates
[13:39:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:40:51] <elukey>	 AQS reboots completed
[13:40:58] <elukey>	 same thing for main-codfw
[13:42:00] <elukey>	 eventlog1001 is rebooting
[13:42:41] <elukey>	 fdans: when you submit I can run puppet on thorium to force the git pull if you want
[13:44:02] <wikibugs>	 (03Merged) 10jenkins-bot: Release 2.1.4 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/404295 (owner: 10Fdans)
[13:45:18] <elukey>	 there you go, puppet ran
[13:45:59] <fdans>	 thank youuuu elukey 
[13:45:59] <elukey>	 eventlog1001 back in service
[13:46:15] <joal>	 Thank you elukey :)
[13:47:22] * joal now has witnessed it's first deployment of wikistats2 !!!
[13:47:57] <fdans>	 huzzaaaaa \o/
[13:54:50] <wikibugs>	 10Analytics-Kanban, 10Pageviews-API: Use country ISO codes instead of country names in top by country pageviews - https://phabricator.wikimedia.org/T184911#3900513 (10fdans)
[13:55:26] <wikibugs>	 10Analytics-Kanban, 10Pageviews-API: Use country ISO codes instead of country names in top by country pageviews - https://phabricator.wikimedia.org/T184911#3900527 (10fdans)
[13:55:28] <wikibugs>	 10Analytics-Kanban: Add ISO code to AQS data per country - https://phabricator.wikimedia.org/T184748#3900529 (10fdans)
[13:56:06] <wikibugs>	 (03PS1) 10Fdans: Use ISO country codes instead of country names [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911)
[14:01:58] <wikibugs>	 (03CR) 10Joal: "Small update in commit message" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911) (owner: 10Fdans)
[14:05:27] <wikibugs>	 (03PS2) 10Fdans: Load cassandra with ISO country codes instead of country names [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911)
[14:05:45] <wikibugs>	 (03CR) 10Fdans: Load cassandra with ISO country codes instead of country names (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911) (owner: 10Fdans)
[14:06:40] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911) (owner: 10Fdans)
[14:07:06] <fdans>	 thank you joal !
[14:07:15] <joal>	 Thanks YOU fdans :)
[14:11:08] <wikibugs>	 (03CR) 10Fdans: [C: 032] Add documentation links to each metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402466 (https://phabricator.wikimedia.org/T183188) (owner: 10Milimetric)
[14:39:56] <wikibugs>	 (03PS9) 10Fdans: Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529)
[14:41:43] * fdans is wondering why the damn hell is the wikiselector not working in his local env
[14:43:44] <joal>	 elukey: labs cluster is in bad shape - 1 lost node and 1 unhealthy one :(
[14:43:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[14:51:44] <elukey>	 joal: I was about to tell you that I was experimenting a bit with hive
[14:52:01] <elukey>	 lost node where? hdfs/yarn/etc// ?
[14:52:14] <joal>	 elukey: no problem - Yes, nodes donw from YARN UI
[14:52:51] <elukey>	 which ones?
[14:54:06] <joal>	 elukey: lost worker-2, unhelathy worker-1, working worker-3, but jobs are not launching, so probably stuck because of too much down
[14:57:03] <elukey>	 going to fix the cluster in 1 min
[14:59:16] <wikibugs>	 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3900728 (10elukey) About Hive, I tried to re-apply the changes to the metastore and this is the difference in ps:  ``` elukey@hadoop-coordinator-1:~$ ps...
[15:01:30] <elukey>	 java.io.IOException: No space left on device
[15:01:33] <elukey>	 there you go
[15:02:15] <joal>	 Arf elukey :(
[15:02:31] <joal>	 A-team - Gone to grab Lino, will be back for standup, maybe a bit late
[15:02:45] <elukey>	 ah yes hdfs data :D
[15:03:02] <elukey>	 it is filling up all the disk space
[15:43:59] <wikibugs>	 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3900880 (10elukey) So with a better ps what happens is clear:  ``` hive       318  0.0  0.0  13396  3180 ?        S    15:35   0:00 bash /usr/lib/hive/bi...
[16:01:28] <wikibugs>	 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3900898 (10Nuria) If the server that fronts mediawiki downloads is backed up by varnish (is it?) this data exists in hadoop most likely. Can @Legoktm  answer this quest...
[16:03:03] <mforns>	 a-team, standup?
[16:04:49] <wikibugs>	 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3900899 (10elukey) I added a `timeout 3` bash command and it worked fine, but then a similar issue re-happened when I tried to restart the metastore serv...
[16:05:35] <mforns>	 elukey?
[16:06:02] <elukey>	 ahahha sooorrryyyy
[16:06:09] <wikibugs>	 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3900905 (10Addshore) Looking at the response headers dumps.wikimedia.org is not behind varnish  {F12576840}
[16:06:11] <elukey>	 I was super into a task and didn't see the hour
[16:06:13] <elukey>	 coming mforns !
[16:07:03] <milimetric>	 mforns: it’s mlk day here, just in case other us people aren’t there
[16:23:15] <wikibugs>	 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3900927 (10Legoktm) >>! In T119772#3900898, @Nuria wrote: > If the server that fronts mediawiki downloads is backed up by varnish (is it?) this data exists in hadoop mo...
[16:24:05] <wikibugs>	 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3900929 (10Addshore) >>! In T119772#3900927, @Legoktm wrote: >>>! In T119772#3900905, @Addshore wrote: >> Looking at the response headers dumps.wikimedia.org is not beh...
[16:27:09] <wikibugs>	 (03CR) 10Nuria: [C: 032] "Let's choose a schema that will benefit from it being loaded into druid and proceed with setting up crons and such." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)
[16:32:55] <wikibugs>	 (03Merged) 10jenkins-bot: Add core class and job to import EL hive tables to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)
[16:45:56] <joal>	 Hi a-team - tasking cancelled I guesS?
[16:46:14] <mforns>	 joal, yes, we're very few, and only 1 task in incomin
[16:46:14] <joal>	 I'm super sorry for having missed standup :(
[16:46:22] <joal>	 k
[16:46:24] <mforns>	 np :]
[16:46:29] <joal>	 Thanks for the headsup mforns :)
[16:46:39] <mforns>	 :]
[16:46:57] <mforns>	 hey joal btw, I think we should switch ops week
[16:47:11] <mforns>	 no?
[16:47:35] <joal>	 mforns: very possible - better for you?
[16:48:03] <mforns>	 joal, yes sure!
[16:48:21] <joal>	 mforns: same for me, so if better for you, let's do it :)
[16:48:25] <mforns>	 joal, next week I'll be in vacation on monday and tuesday, preparing for trip
[16:48:39] <joal>	 mforns: ok great - I leave you this week then :)
[16:48:42] <joal>	 I'll take next
[16:48:44] <mforns>	 ok, perfect
[16:48:50] <mforns>	 today was lost though
[16:48:56] <joal>	 no worries mforns :)
[16:49:09] <joal>	 I'm not like, **counting** days of ops ;)
[16:49:10] <mforns>	 ok
[16:49:18] <mforns>	 hehe
[16:51:59] <joal>	 Heya elukey - Sorry to have dropped earlier on - Can we try and fix the labs cluster?
[16:57:14] <elukey>	 joal: hey, the fix is to delete some hdfs data
[16:57:24] <joal>	 elukey: Ha!
[16:57:26] <joal>	 CAn do :)
[16:58:35] <elukey>	 :D
[17:03:50] <joal>	 elukey: SUPAWEIRD: 
[17:03:55] <joal>	 Filesystem                       Size    Used  Available  Use%
[17:03:55] <joal>	 hdfs://analytics-hadoop-labs  111.5 G  30.4 G      1.1 G   27%
[17:03:56] <joal>	 Filesystem                       Size    Used  Available  Use%
[17:03:56] <joal>	 hdfs://analytics-hadoop-labs  111.5 G  30.4 G      1.1 G   27%
[17:03:58] <joal>	 sorry
[17:05:38] <joal>	 elukey: And interestingly, the biggest use is from refinery package :)
[17:07:11] <elukey>	 some nodes are down so maybe the "available" is for that reason?
[17:07:23] <joal>	 probably elukey 
[17:08:46] <elukey>	 root partitions on the datanode are 99/100 used
[17:09:53] <elukey>	 so if we manage to remove some data they might get better
[17:10:57] <joal>	 elukey: probably logs - will look
[17:13:09] <joal>	 elukey: I've deleted data on HDFS, but down nodes don't delete it (obviously ...)
[17:13:27] <joal>	 Shall I delete manually /var/lib/hadoop/data ?
[17:14:36] <elukey>	 joal: all datanodes are up as far as I can see, they should be working
[17:14:42] <joal>	 hm
[17:16:17] <elukey>	 if I had more judgement I'd have probably used more powerful instances :(
[17:17:33] <joal>	 elukey: something weird here: looks like we have 19G available on each worker, but HDFS sees 115G available total - weiord
[17:18:44] <elukey>	 is data replicated 3 times? If so it might be too much, two is ok for our use case
[17:18:50] <elukey>	 it might save some space
[17:19:08] <joal>	 elukey: We can even set it to 1
[17:19:12] <joal>	 in hdfs-site
[17:19:52] <joal>	 elukey: however I still find it weird that hdfs is formatted wh seeing 111.5G available while we only have 19*3 = 57G
[17:20:18] <elukey>	 it looks like double of the real space
[17:26:49] <elukey>	 joal: logging off for today, if you have time we can experiment tomorrow
[17:27:05] <joal>	 elukey: sure, thanks
[17:27:20] <elukey>	 have a nice evening! 
[17:27:24] <joal>	 bye :)
[19:08:29] <wikibugs>	 (03CR) 10Ottomata: "Nice!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403891 (owner: 10Joal)
[20:01:38] <joal>	 Gone for tonight team - See you tomorrow
[23:42:24] <wikibugs>	 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3901647 (10demon) It is in Hadoop, we just need to surface the data somehow.
[23:55:11] <wikibugs>	 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Have "Last Attracted Developers" information for Gerrit (already exists for Git) automatically updated - https://phabricator.wikimedia.org/T151161#3901657 (10Aklapper)