[08:23:20] morningggg [08:23:34] that recurring IRC spammer is really annoying [08:30:09] need to plan the reboot of stat* and analytics1003 [08:30:28] I am thinking to send an announce for Wed at around 10 CET [08:31:00] the main issue though is that analytics1003 holds the database for Druid [08:31:25] so overlords will not be happy for sure during the reboot [08:51:06] ok email sent, now I'll try to reach out to all the people running a tmux/screen session [08:53:35] elukey: stat* hosts all have the new kernel installed already [08:54:24] nice! [08:54:33] I am sending the emails to announce the maintenance [09:05:23] all right all emails sent [09:11:01] !log reboot aqs1004 for kernel updates [09:11:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:31:17] all right aqs1004 is running with the new kernel [09:32:42] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=aqs1004&var-datasource=eqiad%20prometheus%2Fops looks good for the moment, I'll let it boil for a bit before rebooting the rest [09:32:46] now kafka2001 [09:52:14] kafka2001 up and running, all good [09:58:37] !log rolling reboots of aqs hosts (1005->1009) for kernel updates [09:58:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:53:40] currently rebooting aqs1007 [12:09:14] all right current status is [12:09:22] AQS:1004->1007 rebooted [12:09:29] KAFKA: 200[12] rebooted [12:09:36] will do the rest after lunch1 [12:09:37] !!! [12:09:40] * elukey afk! [12:37:35] 10Analytics-Tech-community-metrics: Inconsistent numbers for new Gerrit developers ("Summary" vs "New Authors" widgets) - https://phabricator.wikimedia.org/T177569#3900412 (10Aklapper) [12:37:37] 10Analytics-Tech-community-metrics: Number of changeset submitters in "gerrit_main_numbers" widget differs from number of submitters in "gerrit_top_developers" widget - https://phabricator.wikimedia.org/T184741#3900414 (10Aklapper) [12:38:59] 10Analytics-Tech-community-metrics: Include gerrit DB's "author_bot" field also in the gerrit_demo DB - https://phabricator.wikimedia.org/T184907#3900416 (10Aklapper) p:05Triage>03Low [12:55:44] Hi elukey - Thanks a lot for the reboots this morning :) [13:04:12] :) [13:04:19] going to complete aqs now [13:04:37] elukey: How may I help with java8? [13:05:39] joal: if we resolve the issue with spark2 then we are done [13:05:50] the main issue though is when to schedule this upgrade [13:05:51] ok elukey - Will investigate more [13:06:01] before or after SF? [13:06:15] hm - I'd say after, just for safety of possible issue [13:09:09] yep I am thinking the same [13:20:31] joal: if you like we can go through the wikistats deployment process :) [13:20:42] Yes fdans! [13:20:45] le's do that :) [13:20:53] cave? [13:20:59] OMW ! [13:28:52] 10Analytics-Tech-community-metrics: One account (in "gerrit_top_developers" widget) counted as two accounts (in "gerrit_main_numbers" widget) - https://phabricator.wikimedia.org/T184741#3900475 (10Aklapper) [13:32:34] (03PS1) 10Fdans: Release 2.1.4 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/404292 [13:32:40] so people now I'd need to reboot eventlog1001 [13:33:09] the idea would be to send an email to analytics@ explaining the maintenance, stop eventlogging completely (will affect graphs), reboot, re-enable [13:33:14] (03CR) 10Fdans: [V: 032 C: 032] Release 2.1.4 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/404292 (owner: 10Fdans) [13:33:37] mforns: o/ - anything against --^ [13:33:51] elukey, reading [13:33:57] (03CR) 10Fdans: [V: 032 C: 032] "Merging to update release" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/404292 (owner: 10Fdans) [13:34:16] elukey, nothing against :] [13:34:26] all right, proceeding :) [13:37:06] (03PS1) 10Fdans: Release 2.1.4 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/404295 [13:38:11] (03CR) 10Fdans: [V: 032 C: 032] "Pushing to release branch to deploy to production" [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/404295 (owner: 10Fdans) [13:39:05] !log stop eventlogging and reboot eventlog1001 for kernel updates [13:39:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:40:51] AQS reboots completed [13:40:58] same thing for main-codfw [13:42:00] eventlog1001 is rebooting [13:42:41] fdans: when you submit I can run puppet on thorium to force the git pull if you want [13:44:02] (03Merged) 10jenkins-bot: Release 2.1.4 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/404295 (owner: 10Fdans) [13:45:18] there you go, puppet ran [13:45:59] thank youuuu elukey [13:45:59] eventlog1001 back in service [13:46:15] Thank you elukey :) [13:47:22] * joal now has witnessed it's first deployment of wikistats2 !!! [13:47:57] huzzaaaaa \o/ [13:54:50] 10Analytics-Kanban, 10Pageviews-API: Use country ISO codes instead of country names in top by country pageviews - https://phabricator.wikimedia.org/T184911#3900513 (10fdans) [13:55:26] 10Analytics-Kanban, 10Pageviews-API: Use country ISO codes instead of country names in top by country pageviews - https://phabricator.wikimedia.org/T184911#3900527 (10fdans) [13:55:28] 10Analytics-Kanban: Add ISO code to AQS data per country - https://phabricator.wikimedia.org/T184748#3900529 (10fdans) [13:56:06] (03PS1) 10Fdans: Use ISO country codes instead of country names [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911) [14:01:58] (03CR) 10Joal: "Small update in commit message" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911) (owner: 10Fdans) [14:05:27] (03PS2) 10Fdans: Load cassandra with ISO country codes instead of country names [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911) [14:05:45] (03CR) 10Fdans: Load cassandra with ISO country codes instead of country names (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911) (owner: 10Fdans) [14:06:40] (03CR) 10Joal: [V: 032 C: 032] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404297 (https://phabricator.wikimedia.org/T184911) (owner: 10Fdans) [14:07:06] thank you joal ! [14:07:15] Thanks YOU fdans :) [14:11:08] (03CR) 10Fdans: [C: 032] Add documentation links to each metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402466 (https://phabricator.wikimedia.org/T183188) (owner: 10Milimetric) [14:39:56] (03PS9) 10Fdans: Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) [14:41:43] * fdans is wondering why the damn hell is the wikiselector not working in his local env [14:43:44] elukey: labs cluster is in bad shape - 1 lost node and 1 unhealthy one :( [14:43:44] (03CR) 10jerkins-bot: [V: 04-1] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans) [14:51:44] joal: I was about to tell you that I was experimenting a bit with hive [14:52:01] lost node where? hdfs/yarn/etc// ? [14:52:14] elukey: no problem - Yes, nodes donw from YARN UI [14:52:51] which ones? [14:54:06] elukey: lost worker-2, unhelathy worker-1, working worker-3, but jobs are not launching, so probably stuck because of too much down [14:57:03] going to fix the cluster in 1 min [14:59:16] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3900728 (10elukey) About Hive, I tried to re-apply the changes to the metastore and this is the difference in ps: ``` elukey@hadoop-coordinator-1:~$ ps... [15:01:30] java.io.IOException: No space left on device [15:01:33] there you go [15:02:15] Arf elukey :( [15:02:31] A-team - Gone to grab Lino, will be back for standup, maybe a bit late [15:02:45] ah yes hdfs data :D [15:03:02] it is filling up all the disk space [15:43:59] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3900880 (10elukey) So with a better ps what happens is clear: ``` hive 318 0.0 0.0 13396 3180 ? S 15:35 0:00 bash /usr/lib/hive/bi... [16:01:28] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3900898 (10Nuria) If the server that fronts mediawiki downloads is backed up by varnish (is it?) this data exists in hadoop most likely. Can @Legoktm answer this quest... [16:03:03] a-team, standup? [16:04:49] 10Analytics-Kanban, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794#3900899 (10elukey) I added a `timeout 3` bash command and it worked fine, but then a similar issue re-happened when I tried to restart the metastore serv... [16:05:35] elukey? [16:06:02] ahahha sooorrryyyy [16:06:09] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3900905 (10Addshore) Looking at the response headers dumps.wikimedia.org is not behind varnish {F12576840} [16:06:11] I was super into a task and didn't see the hour [16:06:13] coming mforns ! [16:07:03] mforns: it’s mlk day here, just in case other us people aren’t there [16:23:15] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3900927 (10Legoktm) >>! In T119772#3900898, @Nuria wrote: > If the server that fronts mediawiki downloads is backed up by varnish (is it?) this data exists in hadoop mo... [16:24:05] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3900929 (10Addshore) >>! In T119772#3900927, @Legoktm wrote: >>>! In T119772#3900905, @Addshore wrote: >> Looking at the response headers dumps.wikimedia.org is not beh... [16:27:09] (03CR) 10Nuria: [C: 032] "Let's choose a schema that will benefit from it being loaded into druid and proceed with setting up crons and such." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [16:32:55] (03Merged) 10jenkins-bot: Add core class and job to import EL hive tables to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [16:45:56] Hi a-team - tasking cancelled I guesS? [16:46:14] joal, yes, we're very few, and only 1 task in incomin [16:46:14] I'm super sorry for having missed standup :( [16:46:22] k [16:46:24] np :] [16:46:29] Thanks for the headsup mforns :) [16:46:39] :] [16:46:57] hey joal btw, I think we should switch ops week [16:47:11] no? [16:47:35] mforns: very possible - better for you? [16:48:03] joal, yes sure! [16:48:21] mforns: same for me, so if better for you, let's do it :) [16:48:25] joal, next week I'll be in vacation on monday and tuesday, preparing for trip [16:48:39] mforns: ok great - I leave you this week then :) [16:48:42] I'll take next [16:48:44] ok, perfect [16:48:50] today was lost though [16:48:56] no worries mforns :) [16:49:09] I'm not like, **counting** days of ops ;) [16:49:10] ok [16:49:18] hehe [16:51:59] Heya elukey - Sorry to have dropped earlier on - Can we try and fix the labs cluster? [16:57:14] joal: hey, the fix is to delete some hdfs data [16:57:24] elukey: Ha! [16:57:26] CAn do :) [16:58:35] :D [17:03:50] elukey: SUPAWEIRD: [17:03:55] Filesystem Size Used Available Use% [17:03:55] hdfs://analytics-hadoop-labs 111.5 G 30.4 G 1.1 G 27% [17:03:56] Filesystem Size Used Available Use% [17:03:56] hdfs://analytics-hadoop-labs 111.5 G 30.4 G 1.1 G 27% [17:03:58] sorry [17:05:38] elukey: And interestingly, the biggest use is from refinery package :) [17:07:11] some nodes are down so maybe the "available" is for that reason? [17:07:23] probably elukey [17:08:46] root partitions on the datanode are 99/100 used [17:09:53] so if we manage to remove some data they might get better [17:10:57] elukey: probably logs - will look [17:13:09] elukey: I've deleted data on HDFS, but down nodes don't delete it (obviously ...) [17:13:27] Shall I delete manually /var/lib/hadoop/data ? [17:14:36] joal: all datanodes are up as far as I can see, they should be working [17:14:42] hm [17:16:17] if I had more judgement I'd have probably used more powerful instances :( [17:17:33] elukey: something weird here: looks like we have 19G available on each worker, but HDFS sees 115G available total - weiord [17:18:44] is data replicated 3 times? If so it might be too much, two is ok for our use case [17:18:50] it might save some space [17:19:08] elukey: We can even set it to 1 [17:19:12] in hdfs-site [17:19:52] elukey: however I still find it weird that hdfs is formatted wh seeing 111.5G available while we only have 19*3 = 57G [17:20:18] it looks like double of the real space [17:26:49] joal: logging off for today, if you have time we can experiment tomorrow [17:27:05] elukey: sure, thanks [17:27:20] have a nice evening! [17:27:24] bye :) [19:08:29] (03CR) 10Ottomata: "Nice!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403891 (owner: 10Joal) [20:01:38] Gone for tonight team - See you tomorrow [23:42:24] 10Analytics, 10MediaWiki-Releasing: Create dashboard showing MediaWiki tarball download statistics - https://phabricator.wikimedia.org/T119772#3901647 (10demon) It is in Hadoop, we just need to surface the data somehow. [23:55:11] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Have "Last Attracted Developers" information for Gerrit (already exists for Git) automatically updated - https://phabricator.wikimedia.org/T151161#3901657 (10Aklapper)