[05:57:50] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] Upgrade to upstream 1.8.1 version [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/469198 (https://phabricator.wikimedia.org/T197276) (owner: 10Elukey)
[06:07:55] <elukey>	 morning!
[06:08:07] <elukey>	 so for the druid upgrade:
[06:08:31] <elukey>	 1) I downloaded 0.11 debs to each host from apt so we have a quick rollback in case something goes wrong
[06:08:40] <elukey>	 2) uploaded to apt 0.12.3-1
[06:08:47] <elukey>	 3) merged the turnilo changes
[06:09:13] <elukey>	 I'd say that at around 10 CEST we start stopping indexation jobs just to be sure
[06:09:18] <elukey>	 and then we proceed :)
[06:33:39] <wikibugs>	 (03PS1) 10Elukey: Add za.wikimedia to the pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/469557
[07:08:13] <wikibugs>	 10Analytics-Tech-community-metrics, 10Code-Health, 10Release-Engineering-Team (Kanban): Develop canonical/single record of origin, machine readable list of all repos deployed to WMF sites. - https://phabricator.wikimedia.org/T190891 (10Quiddity)
[07:25:40] <joal>	 I'm with you elukey :)
[07:31:16] <elukey>	 joal: morning! So I might have found a way to integrate https://issues.apache.org/jira/browse/HIVE-12582 in our puppet repo, didn't think about it before
[07:31:27] <elukey>	 it should allow us to add the prometheus jmx exporter to hive server
[07:31:33] <elukey>	 but I need to test it
[07:31:47] <joal>	 \o/ !!
[07:31:58] <joal>	 Thanks for keeping searching elukey :)
[07:32:22] <joal>	 elukey: questio
[07:32:49] <joal>	 https://gerrit.wikimedia.org/r/c/operations/puppet/cdh/+/469499 has been merged - I'm assuming that in order to be dpeloyed, we need to update CDH submodule in main puppet?
[07:38:01] <elukey>	 joal: I think that andrew deployed it, lemme check
[07:38:21] <joal>	 elukey: I'm sure he deployed one of the two (the previous one), but I wonder about that one
[07:39:34] <elukey>	 joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469515/
[07:40:00] <joal>	 hm
[07:40:17] <elukey>	 you are not convinced :D
[07:40:28] <joal>	 This is bizarre then - The supposedly created folder still doesn't exist on stat1004 for instance :(
[07:42:42] <joal>	 Ah ! You've been faster for za.wikimedia
[07:42:45] <joal>	 elukey: --^
[07:42:59] <joal>	 elukey: many thanks for that - It's my ops week, I should have done it yesterday
[07:43:00] <elukey>	 did earlier on :)
[07:43:06] <elukey>	 nah
[07:43:35] <joal>	 elukey: Have you manually patch the list?
[07:44:10] <elukey>	 nope still haven't, was waiting for the review
[07:44:26] <elukey>	 mmmm it is strange that the tmp file is not deployed
[07:44:28] <joal>	 right - will review and update the list (except if you have a strong will to do it :)
[07:44:47] <elukey>	 I have not :D
[07:45:40] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Merging !" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/469557 (owner: 10Elukey)
[07:46:01] <elukey>	 elukey@cumin2001:~$ sudo cumin 'R:file = "/tmp/hive-parquet-logs"' 'ls -l' --dry-run
[07:46:04] <elukey>	 57 hosts will be targeted:
[07:46:06] <elukey>	 an-coord1001.eqiad.wmnet,analytics[1028-1077].eqiad.wmnet,analytics-tool1001.eqiad.wmnet,notebook[1003-1004].eqiad.wmnet,stat[1004-1005,1007].eqiad.wmnet
[07:46:09] <elukey>	 DRY-RUN mode enabled, aborting
[07:46:28] <joal>	 hm - That means the folder exists?
[07:46:34] <elukey>	 it is in the puppet catalog
[07:47:09] <joal>	 ahem ... /me doesn't know what the puppet catalog does :(
[07:47:09] <elukey>	 but I can see it only on stat1005.eqiad.wmnet
[07:47:30] <joal>	 right - I manually create that one as a test before the patch
[07:47:55] <elukey>	 ah ok
[07:48:05] <elukey>	 so the puppet catalog is basically a list of resources for each host
[07:48:18] <elukey>	 if you define a File in a manifest for example it gets added
[07:48:32] <joal>	 ok
[07:49:56] <joal>	 and those resources are automagically created on the host once called on the system?
[07:50:29] <elukey>	 yes they get created according to the config
[07:50:38] <elukey>	 is /tmp/hive-parquet-logs a dir right?
[07:50:41] <joal>	 yes
[07:51:20] <joal>	 I wonder if my patch was correct in that regard elukey 
[07:51:43] <joal>	 I don't remember having done anything to make puppet know it's a dir
[07:51:50] <elukey>	 let's try with https://gerrit.wikimedia.org/r/469563
[07:52:37] <joal>	 !log Manually add za.wikimedia to pageview-witelist (patch merged: https://gerrit.wikimedia.org/r/469557)
[07:52:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:53:59] <elukey>	 I am updating the submodules in puppet
[07:56:17] <elukey>	 joal: Notice: /Stage[main]/Cdh::Hive/File[/tmp/hive-parquet-logs]/ensure: created
[07:56:20] <elukey>	 stat1004 :)
[07:56:39] <joal>	 Many thanks elukey :)
[07:57:07] <elukey>	 elukey@stat1004:~$ ls -l /tmp/hive-parquet-logs
[07:57:07] <elukey>	 total 0
[07:57:14] <elukey>	 ok mistery solved :)
[07:57:22] <joal>	 elukey: I knew it was bizarre that my puppet was correct at first step :)
[07:57:53] <joal>	 elukey: now testing the reason for that folder to exist
[07:58:04] <elukey>	 ??
[07:58:38] <joal>	 \o/ it works :)
[07:58:38] <elukey>	 joal: this is my proposal for hive https://gerrit.wikimedia.org/r/469562
[07:58:55] <joal>	 elukey: if you ls now in the folder, you'll see :)
[07:59:44] <joal>	 I'm going to let chelsyx and bearloga know - I think they'll be happy
[07:59:53] <elukey>	 ah ok you were saying "testing the new thing", I thought that you were looking why we had to add it :D
[07:59:56] <elukey>	 nevermind
[07:59:59] <elukey>	 :)
[08:04:38] <elukey>	 joal: let's stop indexations in say 10/15 mins?
[08:04:51] <joal>	 elukey: checking current state
[08:06:17] <joal>	 elukey: we are currently refining webrequest-hour-7, and indexing stuff for hour-6
[08:06:51] <joal>	 I suggest we wait for current indexations to be done, and suspend the jobs (to prevent hour 7 to start once refine is done)
[08:07:27] <joal>	 elukey: only 2 jobs to suspend: webrequest and pageview - the other ones are daily
[08:09:32] <elukey>	 +1
[08:10:58] <joal>	 !log Suspend webrequest-druid-hourly and pageview-druid-hourly oozie jobs
[08:10:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:11:02] <joal>	 elukey: --^
[08:11:21] <elukey>	 super thanks!
[08:15:13] <elukey>	 I added the patch to the hive server sh file
[08:15:26] <elukey>	 so now I can test again in labs the prometheus stuff
[08:16:10] <joal>	 awesome elukey :)
[08:41:47] <elukey>	 it works!!!
[08:41:50] <elukey>	 \o/
[08:42:02] <joal>	 ! \o/
[08:42:12] <joal>	 hiveServer2 metrics, here you come :)
[08:42:23] <elukey>	 sadly mbeans only for jvm
[08:51:25] <elukey>	 joal: ready for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469585/
[08:51:39] <elukey>	 do you think that we could squueze in a hive restart? :P
[08:52:14] <joal>	 elukey: I'm a squeezing master - no problem :)
[08:54:13] <elukey>	 all right merging changes, an-coord1001 with puppet disabled
[08:54:23] <elukey>	 I am seeing hive jobs in yarn uff
[08:54:37] <joal>	 elukey: I don't 
[08:54:47] <joal>	 elukey: I htink now is actually good
[08:54:55] <elukey>	 goooood they are done! doing it
[08:56:33] <elukey>	 !log restart hive-server on an-coord1001 to pick up new prometheus settings
[08:56:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:57:11] <elukey>	 aaand hive -> show databases works fine
[08:57:13] <elukey>	 \o/
[08:57:38] <joal>	 elukey: we should test with beeline as well (to use haveserver2)
[08:57:55] <joal>	 works :)
[08:58:00] * elukey dances
[08:58:29] <elukey>	 the last daemon not covered seems oozie
[08:58:45] <joal>	 elukey: indeed, the last daemon
[09:01:34] <elukey>	 joal: https://grafana.wikimedia.org/dashboard/db/analytics-hive
[09:01:36] <elukey>	 yessss
[09:02:00] * joal bow to elukey - Master of metrics
[09:02:09] <elukey>	 \o/
[09:02:47] <elukey>	 so now we can see if Neil's metric causes troubles
[09:02:54] <joal>	 yes indeed
[09:07:24] <joal>	 elukey: I also found config params that could be interesting in that regard
[09:08:10] <elukey>	 anything interesting?
[09:08:33] <joal>	 possibly - need to check
[09:14:03] <elukey>	 joal: so we could deploy turnilo now and see if there is anything weird with druid 0.11
[09:14:21] <joal>	 ok elukey
[09:14:27] <elukey>	 all right 
[09:14:27] <joal>	 following you
[09:15:03] <joal>	 elukey: after more and more reading, it seems that the memory limitation is actually due to a general hadoop setting - Need to triple check though
[09:16:04] <elukey>	 Turnilo (version 1.8.1) is open source under the Apache 2.0 license.
[09:16:07] <elukey>	 done :)
[09:16:19] <elukey>	 !log upgrade turnilo to 1.8.1
[09:16:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:16:52] <joal>	 elukey: I don't know it is related to turnilo update or druid being better, but man it;s snappy !!!
[09:17:19] <joal>	 also elukey: 
[09:17:19] <joal>	 https://turnilo.wikimedia.org/#unique_devices_per_domain_monthly/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ACgqwATJXlUg7MGuiy4CVgIwBNSy0daQRMTEM7SSh5TXwfAF8AXSSmKE0kNEdnDW1uCyY7CDZsKE9TcyL9EABzd2wYBFoIDW5fAFko8Po8UENjAjN8lwhDcgwcbjgocmJsQuxqkHimJBZm/HqEBBSQNim3YkdQaDaGjHwpRChiVIgFhGCYbAgARxhDk0PWdCqWM4gVCAnq93lBPtEij9JMC3h
[09:17:24] <joal>	 9MFIpNd6EwYaCTE87AovpD5CBkkxNHcSHYACKVEqeLKJAlE4h2ADKXW4qI+2JYUJWxGqs0ieE2CCYlAg1UoSBF3X58SAA===
[09:17:27] <joal>	 oops sorry
[09:17:51] <joal>	 elukey: https://gist.github.com/jobar/0725225f092bfe17038cf41bbaa75ef1
[09:17:54] <joal>	 \o/ !!!!
[09:18:17] <elukey>	 the x axis is fixed!
[09:18:36] <joal>	 elukey: I think this is a strong yes dfor this new version :D
[09:18:50] <elukey>	 nice, kudos to them :)
[09:19:03] <elukey>	 it would be nice to give them some feedback
[09:19:23] <elukey>	 now IIRC marcel was joining for the druid upgrade at 12?
[09:19:52] <joal>	 elukey: IIRC it was the opposite - Marcel wanted to join for turnillo :)
[09:20:10] <joal>	 But yes timing was 12
[09:20:26] <elukey>	 nono he wanted to see the druid upgrade
[09:20:33] <joal>	 Ah ok - my bad :)
[09:34:13] <elukey>	 so my rsync for home dirs stat1005 -> stat1007 is still running from yesterday :P
[09:41:31] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) Hi @mpopov! We are in the process of migrating everybody from stat1005 to stat1007 (not announced yet for users) but I am wondering if I could move the `statistics:...
[09:43:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey)
[09:43:42] <elukey>	 brb in 15 mins :)
[09:56:32] <joal>	 elukey: I THINK I FOUND IT !
[09:56:51] <joal>	 elukey: no bump in hiveserver2 memory usage when running the query
[09:57:09] <joal>	 elukey: therefore must be a parameter - Well it is well hidden :)
[09:58:08] <joal>	 elukey: https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
[09:58:15] <joal>	 elukey: look for mapred.child.java.opts
[10:01:13] * elukey reads https://stackoverflow.com/questions/24070557/what-is-the-relation-between-mapreduce-map-memory-mb-and-mapred-map-child-jav
[10:01:34] <elukey>	 very interesting!
[10:02:18] <elukey>	 given the heap consumption values and this discovery I'd propose to lower down hive server/metastore Xmx
[10:02:41] <joal>	 elukey: maybe not if we bump the mapred.child.java.opts?
[10:03:20] <elukey>	 sure one thing at the time, but I think that those heaps are huge
[10:03:27] <joal>	 elukey: my view is that this parameter is not used by mapreduce per se but by the local task of hiveserver2
[10:03:36] <joal>	 I hear that elukey 
[10:04:02] <elukey>	 the thing that I am worried about is that it is more an overhead than a benefit
[10:04:10] <elukey>	 but, let's first tweak mapred.child.java.opts
[10:04:15] <elukey>	 then we see how heap changes
[10:04:34] <elukey>	 have you issued a big query now?
[10:04:41] <elukey>	 I can see some interesting GC metrics
[10:04:44] <elukey>	 for server
[10:05:01] <joal>	 elukey: nope, only small queries on my side
[10:06:20] <elukey>	 an interesting thing could be to apply CMS for old gen
[10:07:26] <elukey>	 yeah even cloudera reccomends it
[10:10:09] <elukey>	 anyway, given it is a bit late for our schedule I'd say to start the druid upgrade
[10:10:17] <elukey>	 then we'll fill marcel in when he joins
[10:10:41] <mforns>	 hey teamm
[10:10:45] <elukey>	 o/
[10:10:47] <elukey>	 gooooodd
[10:10:50] <elukey>	 I was about to start :)
[10:10:50] <joal>	 elukey: you know how to call for Marcel :)
[10:11:03] <mforns>	 sorry for 10 minutes late!
[10:11:15] <elukey>	 so following http://druid.io/docs/latest/operations/rolling-updates.html I'd start with historicals
[10:11:19] <elukey>	 one at the time
[10:11:32] <mforns>	 ok, are you guys in da cave, or just here?>
[10:11:50] <elukey>	 just here
[10:11:52] <mforns>	 k
[10:12:45] <elukey>	 ok so first thing I am installing druid-commons, that is the deb containing all the shared libs
[10:13:01] <elukey>	 that will be picked up by each daemon after the restart
[10:13:25] <elukey>	 and then now I just upgraded the historical on druid1001
[10:13:33] <elukey>	 https://grafana.wikimedia.org/dashboard/db/druid
[10:14:02] <mforns>	 ok, I'm following that
[10:14:22] <elukey>	 so in the logs the daemon is loading the segments
[10:14:31] <elukey>	 like
[10:14:32] <elukey>	 2018-10-25T10:14:26,018 INFO io.druid.server.coordination.SegmentLoadDropHandler: Loading segment[3688/3942][mediawiki_history_beta_2018-04-01T00:00:00.000Z_2018-05-01T00:00:00.000Z_2018-10-08T16:33:51.393Z_5]
[10:15:44] <joal>	 seems ready elukey (from druid-coord UI)
[10:15:46] <mforns>	 memory-mapping?
[10:15:58] <elukey>	 super
[10:16:45] <elukey>	 I don't recall if it uses memory mapping explicitly or only leverages the linux page cache, I'd say the latter
[10:18:07] <elukey>	 historical on 1002 upgraded
[10:19:32] <elukey>	 loaded all segments
[10:19:59] <mforns>	 ok
[10:20:46] <elukey>	 coordinator's metrics looks good, joal ok to restart the last historical?
[10:20:56] <joal>	 +1~
[10:20:57] <joal>	 !
[10:20:59] <joal>	 sorry :)
[10:22:33] <elukey>	 ok 1003 upgraded as well, segments loaded
[10:23:12] <elukey>	 now it is the turn of the overlords
[10:23:21] <joal>	 cache size dropped as expected :) 
[10:24:34] <elukey>	 mysql conns looks good
[10:24:40] <mforns>	 where do you see that joal, I can not see that in any grafana dashboard...
[10:24:56] <joal>	 mforns: in grafan, historical section
[10:25:13] <joal>	 turnilo serves data as well
[10:25:17] <joal>	 So far so good :)
[10:25:46] <mforns>	 can see it now, was looking at druid_public (default)
[10:26:12] <elukey>	 overlords done, proceeding with middlemanagers
[10:27:39] <elukey>	 done
[10:28:04] <elukey>	 now the brokers, all good so far?
[10:28:06] <joal>	 elukey: testing those will need indexation jobs :)
[10:28:21] <elukey>	 yep yep
[10:28:37] <mforns>	 in half an hour EL2Druid indexation will trigger
[10:29:06] <elukey>	 broker on 1001 is up
[10:29:22] <joal>	 elukey: I forgot about EL2Druid !
[10:29:34] <joal>	 elukey: lucky we are, we could have broken those
[10:29:45] <joal>	 I am sorry mforns :( 
[10:30:05] <joal>	 I suspended the oozie indexation jobs but did n't think of EL2Druid
[10:30:11] <elukey>	 broker on 1002 coming up
[10:30:13] <mforns>	 no prob, it was done for sure at 12:10h and won't start until 13:00h
[10:30:55] <wikibugs>	 (03PS8) 10Fdans: Add change_tag to mediawiki_history sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465416
[10:30:55] <elukey>	 aaaand all brokers up
[10:31:00] <elukey>	 last ones, coordinators
[10:31:03] <elukey>	 and then we are done
[10:31:17] <mforns>	 :]
[10:31:39] <wikibugs>	 (03CR) 10Fdans: Add change_tag to mediawiki_history sqoop (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465416 (owner: 10Fdans)
[10:32:06] <wikibugs>	 (03CR) 10Fdans: [V: 032] Upgrade packages and commit package-lock to remove vulnerabilities [analytics/aqs] - 10https://gerrit.wikimedia.org/r/467733 (https://phabricator.wikimedia.org/T206474) (owner: 10Fdans)
[10:32:59] <elukey>	 so an interesting trick that moritz taught me is to use lsof -Xd DEL
[10:33:02] <elukey>	 when upgrading
[10:33:29] <elukey>	 this will tell you the files deleted but that are still referenced by a file descriptor, so used by a daemon 
[10:33:32] <joal>	 elukey --verbose
[10:33:34] <joal>	 Ah
[10:33:44] <joal>	 nice trick !
[10:33:53] <elukey>	 for example, when I upgraded druid-common a ton of those popped up
[10:33:57] <elukey>	 and now they are gone
[10:34:11] <mforns>	 aha
[10:34:43] <elukey>	 everything upgraded!
[10:35:23] <elukey>	 !log upgraded Druid on druid100[1-3] to 0.12.3-1
[10:35:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:35:45] <joal>	 elukey: launching indexation back?
[10:35:49] <elukey>	 +2
[10:36:34] <joal>	 !log Resuming oozie webrequest and pageview druid hourly indexation jobs
[10:36:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:36:46] <aharoni>	 Hallo.
[10:37:31] <elukey>	 overlord master is 1001 (if anybody wants to check the ui)
[10:37:40] <elukey>	 https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Druid
[10:37:42] <elukey>	 mforns: --^
[10:37:53] <aharoni>	 If I use mw.track() to log something to EventLogging, and there is a parameter that doesn't appear in the schema, what will happen?
[10:38:15] <mforns>	 elukey, yes, already on it
[10:38:28] <aharoni>	 Will it be silently ignored? Or will the loggin fail? Or something else?
[10:38:55] <mforns>	 oh, no, I'm on the coordinator
[10:39:13] <mforns>	 also overlord
[10:39:19] <joal>	 elukey: webrequest indexing task started
[10:40:45] <mforns>	 aharoni, if the field is optional, nothing will happen, the field will receive the value NULL in the DB
[10:41:14] <mforns>	 if the field is required, then validation will fail in the eventlogging_processor, and the event will not make it to the database
[10:41:30] <joal>	 mforns: I think it;s the opposite - value seems not to be present in the schema
[10:41:33] <mforns>	 oh! 
[10:41:36] <mforns>	 understand
[10:41:47] <mforns>	 ha good question
[10:42:13] <mforns>	 aharoni, if I recall correctly, it also fails validation
[10:43:05] <mforns>	 we had such a case a couple days ago, where some fields were renamed before sending the event, and thus recognized by the eventlogging_processor as not pertaining to the schema, and raising errors
[10:46:28] <aharoni>	 mforns, joal - thanks
[10:46:42] <mforns>	 np :]
[10:49:56] <elukey>	 joal: indexations succeded right!!??
[10:49:59] <elukey>	 seems all good
[10:50:47] <mforns>	 :]
[10:51:00] <elukey>	 ah wow query/cache/caffeine/delta/evictionBytes
[10:51:15] <elukey>	 this is something that we could add to the druid exporter
[10:51:34] <joal>	 Nice
[10:51:42] <joal>	 elukey: indeed, everything looks good :)
[10:53:40] <joal>	 elukey: back to my hive issue
[10:54:01] <joal>	 elukey: I think I have explanations, and those are really weird!
[10:54:52] <elukey>	 so druid upgraded! \o/
[10:56:40] <elukey>	 joal: I am all ears, lemme know :)
[10:57:35] <joal>	 elukey: actually I need again more investigations - But the issue seems realted to running hadoop in local mode, more than memory issues
[11:03:13] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: turnilo x axis improperly labeled - https://phabricator.wikimedia.org/T197276 (10elukey) p:05Triage>03Normal
[11:03:42] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: turnilo x axis improperly labeled - https://phabricator.wikimedia.org/T197276 (10elukey) Turnilo 1.8.1 has been deployed, as far as I can see the issue seems fixed!
[11:07:33] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Upgrade to Druid 0.12.3 - https://phabricator.wikimedia.org/T206839 (10elukey) druid100[1-3] upgraded today, we'll proceed with druid public on monday if no issue will be registered!
[11:07:47] <elukey>	 joal: ok to upgrade druid public on monday?
[11:07:51] <elukey>	 if nothing comes up
[11:07:56] <joal>	 yessir !
[11:09:48] <mforns>	 EL2Druid indexations also successful
[11:10:08] <elukey>	 niceeee
[11:23:08] <joal>	 Man - This hadoop debugging is hell
[11:34:07] <elukey>	 :(
[11:35:58] <joal>	 elukey: I have found an interesting param: hive.mapred.local.mem
[11:36:03] <joal>	 however, seems unrelated
[11:39:14] <elukey>	 are we talking about T206279 right?
[11:39:15] <stashbot>	 T206279: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279
[11:39:22] <joal>	 correct elukey 
[11:40:01] <elukey>	 in the task though I don't see a OOM error.. I was trying to read the exact error that we are trying to solve
[11:40:13] <joal>	 elukey: indeed
[11:40:29] <joal>	 elukey: problem comes from a command that fail
[11:40:51] <joal>	 elukey: HiveServer2 runs a hadoop command when starting a job
[11:41:04] <joal>	 And the one it tries to launch in that precise case fails
[11:41:53] <elukey>	 but we don't have more info about the why
[11:42:30] <elukey>	 (I was about to add a comment for the hive-server2 heap size metrics)
[11:42:48] <elukey>	 anyway, need to step afk, going to test my shoulder at the gym for a bit :)
[11:42:54] <elukey>	 let's see if it has recovered :)
[11:42:56] <joal>	 nope elukey 0 I'm fighting to try to get that why more precisly
[11:43:09] <elukey>	 I'll try to help when I am back!
[13:04:22] <bearloga>	 joal: I'm back online, what's this about changes to hive that Chelsy and I will be happy about? :D
[13:26:33] <wikibugs>	 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Ottomata) Great thanks!  Fine with Edit2, as long as there are good docs explaining as you say....
[13:45:56] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Decide whether to use schema references in the schema registry - https://phabricator.wikimedia.org/T206824 (10Ottomata) BTW, we should probably be using `$id` for schemas using their versio...
[13:46:50] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Schema Registry Implementation - https://phabricator.wikimedia.org/T207869 (10Ottomata) OOPs past me had already done this and forgot. DOH merging in.
[13:48:18] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata)
[13:48:28] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata)
[13:48:34] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Schema Registry Implementation - https://phabricator.wikimedia.org/T207869 (10Ottomata)
[13:48:40] <elukey>	 bearloga: o/ - try to use hive, no more parquet spam :)
[13:50:29] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata)
[14:05:01] <elukey>	 renamed the dashboard to ikimedia/ottomata) to #wikimedia-analytics                                                              │
[14:05:09] <elukey>	 argh
[14:05:18] <elukey>	 how the hell did I copy/paste that :D
[14:05:19] <elukey>	 anyhow
[14:05:35] <elukey>	 renamed the dashboard to https://grafana.wikimedia.org/dashboard/db/hive
[14:08:36] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) @Pchelolo https://snowplowanalytics.com/blog/2014/05/15/introducing-self-desc...
[14:10:24] <joal>	 Heya bearloga - Normally no more painfull parquet-logs in the middle of hive results :)
[14:11:32] <ottomata>	 ncue
[14:11:33] <ottomata>	 nice
[14:11:46] <bearloga>	 ooooooooh
[14:12:01] <joal>	 ottomata: I had forgotten the 'directory' line for folder creation in puppet :(
[14:12:08] <bearloga>	 joal: 🙌
[14:12:14] <ottomata>	 elukey:  we should probably put the hadoop_cluster label on the hive jvms too
[14:12:18] <ottomata>	 joal me too!
[14:13:45] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10elukey) Neil, today we added (finally! \o/) JVM metrics to the Hive server, you can see them in https://grafana.wikimedia.org/dashbo...
[14:14:18] <joal>	 elukey, ottomata - The error about --^ is actually trickier than just memory
[14:15:20] <wikibugs>	 (03CR) 10Milimetric: Memoizing results of state functions (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria)
[14:15:26] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10elukey) Druid upgraded on druid100[1-3], so we can finally start looking into this :)
[14:16:53] <joal>	 I have spent they day troubleshooting, and it seems that the hive MapredLocalTask (called from the hive ExecDriver) tries to execute the wrong jar in hadoop when run in local-task mode (see 
[14:17:06] <elukey>	 lovely
[14:17:38] <joal>	 In hive server log: Executing: /usr/lib/hadoop/bin/hadoop jar /usr/lib/hive/lib/hive-common-1.1.0-cdh5.10.0.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan .......
[14:18:07] <joal>	 And when trying to execute this command manually well the org.apache.hadoop.hive.ql.exec.mr.ExecDriver class is not available in /usr/lib/hive/lib/hive-common-1.1.0-cdh5.10.0.jar
[14:18:21] <joal>	 The correct jar to be used would be /usr/lib/hive/lib/hive-exec-1.1.0-cdh5.10.0.jar
[14:18:54] <joal>	 There are conf utilities to get the jar to run, and at the end it all comes from the jar set up for the original job
[14:19:57] <elukey>	 I am kinda lost
[14:20:36] <joal>	 elukey: no wonder :)
[14:20:47] <ottomata>	 wuut  weird
[14:21:26] <joal>	 elukey: the weirdest is that hive actually runs VERY differently for a regular yarn job or for a local one
[14:21:44] <joal>	 regular-yarn-jobs are handled through hadoop job conf, settings and all
[14:22:07] <joal>	 local jobs are launched through a CLI !
[14:22:16] <elukey>	 what are local jobs?
[14:22:27] <joal>	 elukey: local-tasks should I say
[14:22:56] <joal>	 elukey: They are a specific type of tasks (map or red) hapenning locally to prevent having to go for the yarn overhead
[14:23:01] <joal>	 They are usefull for small data
[14:24:36] <elukey>	 !log added AAAA DNS records to all the druid nodes
[14:24:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:26:36] <elukey>	 joal: I am wondering - is it fine if I upgrade now druid public? The new version looks very stable and performant
[14:27:01] <joal>	 +1 elukey - I need to leave to catch the kids, but I feel safe you doing it :)
[14:27:13] <elukey>	 last famous words :D
[14:27:18] <elukey>	 all right
[14:27:23] <elukey>	 ottomata: you ok with me upgrading druid public?
[14:27:53] <joal>	 elukey: if you may - Gently on historical (for segments and cache) - please
[14:28:00] <joal>	 Except from that, all good
[14:28:09] <ottomata>	 elukey:  cool yeah!
[14:28:10] <elukey>	 I am always gentle with historicals :D
[14:28:17] <joal>	 :)
[14:28:20] <ottomata>	 i' here if you  need me
[14:28:25] <elukey>	 ack!
[14:28:50] <elukey>	 !log upgrade druid on druid100[4-6] to Druid 0.12.3
[14:28:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:50:55] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) a:05Cmjohnson>03RobH @robh added label on server, added to switch asw-a-eqiad ge-6/0/18       up    up   weblog1001 and in private1-a
[14:58:48] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson)
[14:59:07] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson)
[14:59:25] <elukey>	 druid public upgraded!
[14:59:50] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson)
[15:05:23] <nuria>	 elukey: NICE with turnilooo
[15:06:42] <elukey>	 \o/
[15:06:52] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T207644 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson I had one remaining 4TB spare disks on-site.  Replaced the disk, cleared the cache and all disks are back
[15:09:58] <wikibugs>	 (03PS2) 10Fdans: Handle null name values in top metrics from UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468964 (https://phabricator.wikimedia.org/T206968)
[15:10:23] <fdans>	 nuria milimetric I think this one pleases everyone :)
[15:10:43] <nuria>	 fdans: we are SO  high maintenance
[15:12:29] <fdans>	 WAIT DONT REVIEW I GOTTA REPLACE THE TABS WITH SPACES
[15:12:33] <fdans>	 goddammit
[15:12:35] <nuria>	 elukey: and monthly granularity mostly works cc joal wow
[15:13:46] <wikibugs>	 (03PS3) 10Fdans: Handle null name values in top metrics from UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468964 (https://phabricator.wikimedia.org/T206968)
[15:15:19] <nuria>	 elukey: https://bit.ly/2q7dBcq
[15:19:26] <wikibugs>	 10Analytics, 10Analytics-Kanban: Reboot Analytics hosts for kernel security upgrades - https://phabricator.wikimedia.org/T203165 (10Cmjohnson)
[15:19:30] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) 05Open>03Resolved  @elukey  okay
[15:21:36] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Cmjohnson) @elukey dbstore1002 is out of warranty and has 1.2T disks. I don't have disks this size but can replace with a 2TB disk..
[15:27:50] <nuria>	 milimetric: after having taken a look at the transpiling of html templates (that is happening for all browsers, regardless of whether they support templates or not)... mmm ahem... i think string concatenation is much simpler
[15:29:49] <elukey>	 nuria: sorry I was in a meeting - all good right?
[15:30:13] <nuria>	 elukey: ya, super good catually, there is data we have not been able to plot in turnillo for 1.5 years that we can see now
[15:31:22] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Upgrade to Druid 0.12.3 - https://phabricator.wikimedia.org/T206839 (10elukey) Druid public upgraded too!
[15:33:01] <elukey>	 \o/
[15:36:20] <elukey>	 !log shutdown aqs1006 to replace one broken disk
[15:36:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:41:46] <bearloga>	 Can one of you experienced sqoopers please check my command? I'm too scared to run it without having ever sqoop'd before. https://www.irccloud.com/pastebin/klkfNkZI/wb_terms_sqoop
[16:02:41] <nuria>	 ping milimetric 
[16:10:43] <nuria>	 bearloga: plenty of sqoop docs here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration#Dumping_data_via_sqoop_from_eventlogging_to_hdfs
[16:10:55] <bearloga>	 nuria: yup, that's what I'm using as reference
[16:12:51] <bearloga>	 nuria: joal is helping me right now :) I'll def ping milimetric if I need more help
[16:13:05] <nuria>	 bearloga: milimetric is out today
[16:20:46] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10elukey)
[16:22:12] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) I sent HP a diagnostic log showing disk 5 as failed {F26794607}  {F26794615}
[16:25:18] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: turnilo x axis improperly labeled - https://phabricator.wikimedia.org/T197276 (10Nuria) 05Open>03Resolved
[16:25:31] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Upgrade to Druid 0.12.3 - https://phabricator.wikimedia.org/T206839 (10Nuria) 05Open>03Resolved
[16:26:06] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Nuria) 05Open>03Resolved
[16:26:24] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking): Ingest data into druid for readingDepth schema - https://phabricator.wikimedia.org/T205562 (10Nuria) 05Open>03Resolved
[16:28:44] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10Ottomata) Reopening because we have an idea:  We will set `SET hive.auto.convert.join=false;` in hive-server2 only, (not for hive CL...
[16:29:09] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10Ottomata)
[16:33:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10mpopov) >>! In T205846#4694101, @elukey wrote: > Hi @mpopov! We are in the process of migrating everybody from stat1005 to stat1007 (not announced yet for users) but I am w...
[16:36:20] <elukey>	 bearloga: o/
[16:36:22] <elukey>	 about --^
[16:36:35] <elukey>	 if you have 5 mins and it is quick we can do it now
[16:44:29] <bearloga>	 elukey: sure
[16:44:37] <elukey>	 \o/
[16:44:49] <elukey>	 so as far as I can see, the cron starts at 5 AM UTC
[16:45:05] <elukey>	 so I can deploy the class now to 1007, then disable the cron on 1005
[16:45:19] <elukey>	 does it sound good? Will take me ~5 min
[16:45:34] <bearloga>	 sounds good
[16:46:06] <bearloga>	 gives me plenty of time to install the packages and try the scripts/queries to check whether it's safe to enable the cron on stat1007
[16:46:23] <bearloga>	 are you going to copy all the existing published-datasets files?
[16:46:50] <elukey>	 yep I am syncing the srv dir but still wip (home dirs now)
[16:46:54] <elukey>	 anything specific that you need?
[16:47:20] <bearloga>	 elukey: the entirety of /srv/published-datasets/discovery
[16:47:49] <elukey>	 ah ok still not there, then let's resync next week for the move
[16:48:23] <bearloga>	 since when cron runs it would use reportupdater and that would append to the existing datasets
[16:48:46] <elukey>	 ah so it needs to be moved with report updater?
[16:48:51] <elukey>	 still only on stat1005
[16:49:43] <bearloga>	 the discovery stuff is separate from AE's reportupdater stuff
[16:50:29] <bearloga>	 elukey: see https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/main.sh
[16:50:31] <elukey>	 ah ok I was mixing up things
[16:51:20] <elukey>	 okok so you only need things rsynced
[16:51:44] <bearloga>	 yep! :)
[16:52:16] <elukey>	 bearloga: all right I'll reping you on Monday then
[16:52:28] <bearloga>	 elukey: works for me! :D
[16:52:29] <elukey>	 do you have a preferred time ?
[16:53:32] <bearloga>	 us east coast 10a-12p would be perfect for this
[16:54:17] <elukey>	 ok so ~16 CEST
[16:54:22] <elukey>	 seems fine to me, thanks a lot!
[16:55:45] <bearloga>	 elukey: thanks for being understanding that this migration is...involved
[16:59:47] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests, 10User-Elukey: eqiad | (14 + 6) hadoop hardware refresh and expansion - https://phabricator.wikimedia.org/T199673 (10Cmjohnson)
[16:59:50] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests, 10User-Elukey: eqiad | (3) Labs Data Lake hardware - https://phabricator.wikimedia.org/T199674 (10Cmjohnson)
[17:02:28] <elukey>	 mforns: qq - what do we need to do to move report updater to stat1007? I mean, people to ping, etc..
[17:02:43] <mforns>	 elukey, thinking
[17:04:26] <elukey>	 so I can see
[17:04:27] <elukey>	 reportupdater_reportupdater-queries-browser
[17:04:31] <elukey>	 reportupdater_limn-language-data-interlanguage
[17:04:34] <elukey>	 in the hdfs crontab
[17:04:53] <mforns>	 aha only?
[17:04:59] <mforns>	 there should be a lot more no?
[17:05:03] <elukey>	 yeah I think the others are on stat1006
[17:05:11] <mforns>	 oh! ok, makes sens
[17:05:19] <mforns>	 1005 only runs hive queries
[17:05:51] <elukey>	 namely (on stat1006)
[17:05:52] <elukey>	 # Puppet Name: reportupdater_limn-edit-data-edit-beta-features
[17:05:52] <elukey>	 # Puppet Name: reportupdater_limn-language-data-language
[17:05:52] <elukey>	 # Puppet Name: reportupdater_limn-flow-data-flow
[17:05:52] <elukey>	 # Puppet Name: reportupdater_discovery-stats-interactive
[17:05:55] <elukey>	 # Puppet Name: reportupdater_limn-ee-data-ee-beta-features
[17:05:57] <elukey>	 # Puppet Name: reportupdater_limn-ee-data-ee
[17:06:00] <elukey>	 # Puppet Name: reportupdater_limn-flow-data-flow-beta-features
[17:06:02] <elukey>	 # Puppet Name: reportupdater_reportupdater-queries-page-creation
[17:06:05] <elukey>	 # Puppet Name: reportupdater_reportupdater-queries-pingback
[17:06:07] <elukey>	 # Puppet Name: reportupdater_limn-language-data-published_cx2_translations
[17:06:13] <mforns>	 ok, so people to notify are just interlanguage folks
[17:06:24] <mforns>	 looking who the owner is
[17:06:30] <elukey>	 super thanks!
[17:10:28] <mforns>	 elukey, it should be Amir and Kartik
[17:10:50] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) hey hey heyyy, the nodes are in!  https://phabricator.wikimedia.org/T204177#4695147  How can we move this forwa...
[17:12:09] <mforns>	 elukey, but I wouldn't say they look at that data from within stat1005
[17:12:18] <mforns>	 rather they use their dashiki dashboard: language-reportcard.wmflabs.org
[17:12:39] <elukey>	 does that one need to be updated when we move ?
[17:12:45] <elukey>	 (I am super ignorant about this)
[17:13:32] <mforns>	 elukey, the reports outputed by RU are rsync'ed to thorium, IIRC, and are requested from there by dashiki
[17:13:56] <elukey>	 ah ack, so I need to verify that afterwards then
[17:13:58] <mforns>	 so, as long as the rsync continues to copy RU reports over to former stat1001, I think all's good!
[17:14:12] <elukey>	 super
[17:14:15] <elukey>	 I'll take a note
[17:14:52] <elukey>	 thanks mforns!
[17:15:03] <elukey>	 team logging off for today, talk with you tomorrow!
[17:15:52] <mforns>	 elukey, no problem, the reports live in /srv/reportupdater/output
[17:15:59] <mforns>	 byeeeeee
[17:17:17] <ottomata>	 byeee
[17:33:24] <joal>	 dcausse: Hi! Are you still around or should I ask tomorrow?
[17:33:52] <dcausse>	 joal: in standup atm, in 15/20min?
[17:33:57] <joal>	 sure B!
[17:45:11] <dcausse>	 joal: I'm around
[17:45:18] <joal>	 yo dcausse :)
[17:45:22] <dcausse>	 yo! :)
[17:45:58] <joal>	 dcausse: I have noticed plenty workflows in suspended state in oozie (mostly discovery-clicks)
[17:46:19] <joal>	 Is that expected?
[17:46:27] <dcausse>	 what is suspended?
[17:46:35] <dcausse>	 is an action on our side?
[17:47:24] <joal>	 dcausse: The parent coord is https://hue.wikimedia.org/oozie/list_oozie_coordinator/0023681-180705103628398-oozie-oozi-C/
[17:47:34] <joal>	 query-clicks-hourly
[17:48:12] <joal>	 It is in running state, but from what I see in https://hue.wikimedia.org/oozie/list_oozie_workflows/, all the workflows it uses are in suspended state?
[17:48:14] <dcausse>	 I don't think that's expected...
[17:50:32] <joal>	 I have seen a similar pattern here https://hue.wikimedia.org/oozie/list_oozie_coordinator/0048633-180705103628398-oozie-oozi-C/
[17:52:43] <dcausse>	 if it fails too many times it enters this suspend state apparently
[17:53:09] <dcausse>	 can it be because it waits for too long on the webrequest and/or cirrus data?
[17:53:26] <joal>	 dcausse: fail to many times -- I think this is something I have not yet experienced with oozie :)
[17:54:25] <joal>	 ok dcausse - I think I see the problem
[17:58:52] <dcausse>	 ?
[18:06:25] <joal>	 dcausse: Was related to the an-coord1001 move we made a few days ago
[18:06:54] <joal>	 dcausse: We updated the /user/hive/hive-site.xml file and inadvertantly changed its permissions, preventing the jobs to run
[18:07:28] <joal>	 dcausse: problem manually fixed, I'm currently rerunning all the suspended jobs, and ottomata is working on making sure we don't do that again :)
[18:13:24] <joal>	 !log Manually copy /etc/hive/conf/hive-site.xml to hdfs:///user/hive and set permissions to 644 to allow all users to run oozie jobs
[18:13:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:13:36] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10Cmjohnson)
[18:13:42] <dcausse>	 joal: thanks!!
[18:14:03] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Cmjohnson)
[18:14:24] <joal>	 !log Manually resume the bunch of suspended jobs (mostly from ebernhardson and chelsyx - our apologizes for not noticing earlier)
[18:14:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:55:52] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10faidon) So, this is quite the can of worms :) There are several pieces to this, and honestly, I feel like VLANs is kind o...
[19:02:27] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) FYI, networking considerations being worked out in {T207321}
[19:09:19] <wikibugs>	 10Quarry: REPORTS-68 Implement dynamic cache duration - https://phabricator.wikimedia.org/T60826 (10Framawiki) 05Open>03Invalid Not applicable to Quarry.
[19:09:21] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Restore WikiStats features disabled for mere performance reasons - https://phabricator.wikimedia.org/T44318 (10Framawiki)
[19:12:17] <wikibugs>	 10Quarry: Recurring queries - https://phabricator.wikimedia.org/T101835 (10Framawiki) Ca be incorporated into {T206482}
[19:13:13] <wikibugs>	 10Quarry: Add date when query was last run - https://phabricator.wikimedia.org/T77941 (10Framawiki) Related: {T206482}
[19:16:06] <wikibugs>	 10Quarry, 10Documentation: admin docs: quarry - https://phabricator.wikimedia.org/T206710 (10Framawiki) I was a bit trolling in my last comment of course. But ie. {T205150} is an essential task, that involve WMF peoples (join prod instance, labs one ?) and creating a monitoring tool is not at the scope of #qua...
[19:17:48] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Decide whether to use schema references in the schema registry - https://phabricator.wikimedia.org/T206824 (10Ottomata) Hm a tricky bit about $refs and generating fully dereferenced schemas...
[19:22:18] <wikibugs>	 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10nettrom_WMF) I've created [[ https://meta.wikimedia.org/wiki/Schema:Edit2 | Schema:Edit2 ]], an...
[19:38:44] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10JAllemandou) >>! In T206279#4695030, @Ottomata wrote: > Reopening because we have an idea: >  > We will set `S...
[19:49:54] <wikibugs>	 10Quarry: Recurring queries - https://phabricator.wikimedia.org/T101835 (10Wurgl) I think there is no need to run such a query by cron or similar.  There will sure be forgotten queries which you execute over and over again, but no one is looking at the results. So you waste CPU-time.  With some cron-like mechani...
[19:52:44] <SMalyshev>	 ottomata, elukey: question: I see occasional messages (very rare, like one per 2 hours) in codfw.mediawiki.revision-create - is it normal? Why they are there?
[19:53:26] <ottomata>	 SMalyshev:  what are the messages?
[19:53:33] <ottomata>	 do they look like revision-creates?
[19:53:45] <SMalyshev>	 yes
[19:53:52] <ottomata>	 is it possible an app server is creating a revision in codfw?
[19:54:03] <SMalyshev>	 ottomata: https://pastebin.com/HPXY6tC3
[19:54:28] <SMalyshev>	 but why? 
[19:54:39] <ottomata>	 good q!
[19:54:52] <SMalyshev>	 these seem to be completely random edits, no different from others, why they suddently went to codfw?
[19:55:37] <ottomata>	 the eventbus proxy service is routed by eventbus.discovery.wmnet and lvs, maybe that resolves differently somewhere, or, maybe there is some reason these edits are actually happening in codfw
[19:55:41] <ottomata>	 jobqueue maybe?
[19:55:47] <ottomata>	 we probably should ask petr in service
[19:56:23] <SMalyshev>	 no doesn't look like jobqueue
[19:57:44] <SMalyshev>	 just one random edit per hour suddenly ends up in codfw
[20:13:20] <SMalyshev>	 got another one, from 2018-10-25T20:03:56+00:00
[20:13:34] <SMalyshev>	 looks like once-per-hour thing
[20:15:08] <ottomata>	 maybe some monitoring type thing?
[20:16:36] <SMalyshev>	 but these seem to be completely random edits from different users
[20:16:58] <SMalyshev>	 on different wikis even
[20:17:40] <SMalyshev>	 sometimes bots, sometimes humans
[20:18:43] <ottomata>	 https://grafana.wikimedia.org/dashboard/db/eventbus?refresh=1m&panelId=28&fullscreen&orgId=1
[20:18:52] <ottomata>	 it happens for other topics too
[20:19:15] <ottomata>	 recentchange, revision-score
[20:19:24] <ottomata>	 ah but revisions-score is a reaction to revision-create
[20:19:30] <ottomata>	 so that will always happen at the same time
[20:19:44] <ottomata>	 but if it is in recent change too, it does indeed look like mediawiki is sending the event
[20:19:46] <SMalyshev>	 revision-create is 0 there
[20:19:55] <SMalyshev>	 but it's not really 0
[20:19:58] <ottomata>	 naw
[20:20:00] <ottomata>	 its not 0
[20:20:04] <ottomata>	 it spikes
[20:20:08] <ottomata>	 click on just the revision-create topic
[20:20:20] <ottomata>	 it look slike the behavior you are descriibing
[20:21:25] <SMalyshev>	 yes exactly
[20:21:31] <SMalyshev>	 so I wonder what's going on there?
[20:26:38] <SMalyshev>	 ottomata: do you know who might have any insight into this?
[20:28:01] <ottomata>	 SMalyshev:  i'd ask in services and/or ops.  this has to do mostly with mediawiki request routing
[20:28:15] <SMalyshev>	 ok, thanks
[20:28:18] <ottomata>	 sorry attention is slightly elsewhere atm
[20:43:18] <wikibugs>	 10Quarry: Recurring queries - https://phabricator.wikimedia.org/T101835 (10Framawiki) » {T203791}
[20:56:15] <aharoni>	 hallo
[20:56:47] <aharoni>	 yesterday elukey showed me a way to look at incoming EventLogging events: `kafkacat -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_ContentTranslationAbuseFilter -o beginning`
[20:58:08] <aharoni>	 the code that sends events to the ContentTranslationAbuseFilter schema was deployed earlier today.
[20:58:27] <aharoni>	 if I run kafkacat now, I see an event that I'd expect to be logged in the output.
[20:58:48] <aharoni>	 but I don't yet see it in MySQL
[20:59:20] <wikibugs>	 (03CR) 10Joal: [C: 031] "LGTM !" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/467700 (owner: 10Milimetric)
[20:59:42] <aharoni>	 that is, I'd expect to run `use log;` and `show tables;`, and to see the ContentTranslationAbuseFilter table, but I don't see it
[20:59:54] <aharoni>	 will it be auto-created? when can I expect to see it?
[21:03:03] <joal>	 aharoni: I'm no expert in EL, but I think the system must be restarted to pick the latest config (the one that includes your topic in the mysql whitelist)
[21:03:40] <aharoni>	 joal: ow... does it happen regularly? Or do I have to request it?
[21:03:59] <joal>	 aharoni: I would think it happens regularly - I'll ping Luca tomorrow about that
[21:04:02] <ottomata>	 aharoni:  you have to request it
[21:04:16] <ottomata>	 we used to blacklist things from going to mysql, now we whitelist things
[21:04:16] <joal>	 Thanks for teliing me wrong ottomata 
[21:04:18] <ottomata>	 :)
[21:04:42] <aharoni>	 ottomata: it's whitelisted in puppet. So what's the process for requesting it?
[21:04:44] <ottomata>	 OH
[21:04:46] <ottomata>	 it is?
[21:04:49] <ottomata>	 sorry
[21:04:59] <ottomata>	 then ya i think we just gotta bounce the processer, checking...
[21:05:02] <joal>	 something else interesting here: no ContentTranslationAbuseFilter table in hive either
[21:05:10] <aharoni>	 ottomata: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469419/
[21:05:11] <joal>	 ottomata: --^
[21:05:24] <aharoni>	 joal: yes, not in hive either
[21:05:25] <joal>	 ottomata: no log about EL processors having been bumped
[21:05:29] <ottomata>	 great
[21:06:00] <ottomata>	 ya they weren't bounced
[21:06:01] <ottomata>	 doing now
[21:06:23] <joal>	 ottomata: I thought the whitelist was for Mysql only
[21:06:38] <ottomata>	 !log bouncing eventlogging-processor client side* to pick up mysql whitelist change for ContentTranslationAbuseFilter (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469419/)
[21:06:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:10:47] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10Wikidata, and 7 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10mobrovac)
[21:11:25] <aharoni>	 ottomata: joal - is it done now?
[21:11:32] <wikibugs>	 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Pchelolo)
[21:11:36] <aharoni>	 (or am I extremely impatient?)
[21:13:47] <nuria>	 aharoni: yes it is, takes minutes 
[21:14:07] <nuria>	 aharoni: did you tested your schema in beta cluster to make sure your events are valid? before going to prod?
[21:14:26] <nuria>	 aharoni: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster
[21:15:12] <wikibugs>	 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10mobrovac) Could it have to do with automatic checks somehow?
[21:18:23] <wikibugs>	 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Pchelolo) Nope. The events are all legitimate edits, just random portion of them. Here's an example:   ``` {"comment": "/* wbeditentity-update:0| */ Updat...
[21:20:50] <joal>	 aharoni: Just check logs in hive refine process - Last hourly job had processed hour-17, and your schema has 2 events for hour 19
[21:21:50] <joal>	 aharoni: On the hive side (the one I kinda know better) I think your events will show up in say, ~2 hours :)
[21:22:49] <joal>	 Gone for tonight team - See you tomorrow
[21:26:09] <wikibugs>	 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10mobrovac) Hm interesting. These are all original events, so either the EventBus proxy service messes up the DNS (less likely) or somehow legitimate reques...
[21:44:51] <wikibugs>	 10Analytics, 10Multimedia: Add mediacounts to pageview API - https://phabricator.wikimedia.org/T88775 (10jmatazzoni) If you're looking for a project that might use this API, the [[ https://phabricator.wikimedia.org/project/manage/3543/ | Event Metrics tool ]] would love to be able to get an accurate count for...
[21:45:42] <wikibugs>	 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Pchelolo) Things I've checked so far: - There are no logs anywhere associated with these events. - All the events are legitimate edits, they exist in DB,...
[22:08:55] <wikibugs>	 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10mobrovac) >>! In T207994#4696066, @Pchelolo wrote: > Things I've checked so far: > - There are no logs anywhere associated with these events. > - All the...
[22:24:53] <wikibugs>	 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Pchelolo) We can keep theorizing about this, but we need more information before we could make any real theory.   I think the first and foremost we need t...
[23:32:04] <wikibugs>	 10Analytics, 10EventBus, 10MediaWiki-Watchlist, 10WMF-JobQueue, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Etonkovidova)  I checked testwiki, cawiki (wmf.1), enwiki (wmf.26) and betalabs.   testwiki, cawiki (wmf.1) Deletes 50 items...
[23:50:21] <wikibugs>	 10Analytics, 10EventBus, 10MediaWiki-Watchlist, 10WMF-JobQueue, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Pchelolo) @Etonkovidova sorry I didn't update this ticket.. We've had an outage caused by the fix to it so it was reverted t...
[23:53:10] <wikibugs>	 10Analytics, 10EventBus, 10MediaWiki-Watchlist, 10WMF-JobQueue, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Etonkovidova) >>! In T207329#4696235, @Pchelolo wrote: > @Etonkovidova sorry I didn't update this ticket.. We've had an outa...