[00:01:01] <wikibugs>	 (03CR) 10Nuria: [C: 032] Defer to the config to specify the area [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/464906 (https://phabricator.wikimedia.org/T188792) (owner: 10Milimetric)
[00:05:33] <wikibugs>	 (03Merged) 10jenkins-bot: Defer to the config to specify the area [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/464906 (https://phabricator.wikimedia.org/T188792) (owner: 10Milimetric)
[01:45:39] <wikibugs>	 10Quarry: Include query execution time - https://phabricator.wikimedia.org/T126888 (10Huji) The only advantage would be to know if queries are getting stuck in "pending" mode for too long. It used to be an issue a while back but hasn't been for a long time.
[02:41:13] <wikibugs>	 10Quarry: Include query execution time - https://phabricator.wikimedia.org/T126888 (10zhuyifei1999) It was probably  {T172143}. It would stay pending anyhow.
[03:27:09] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Tbayer) @mforns Great to hear that Druid [[https://phabricator.wikimedia.org/T201873#4633754 |already allows]]...
[03:36:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Tbayer) Another question: It seems that the dimensions lack e.g. `Ua Browser Major` and other user agent deriv...
[03:48:58] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Tbayer) BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the...
[04:07:27] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Tbayer) Back to the view in Turnilo: This looks very exciting indeed!  I have to mention that @ovasileva and I...
[07:14:23] <elukey>	 !log stopped all crons on analytics1003 as prep step for migration to an-coord1001
[07:14:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:14:26] <elukey>	 morning :)
[07:14:36] <joal>	 Hi elukey :) Investigating the error on MWH-reduced
[07:36:58] <elukey>	 joal: o/ - I am checking the refine failure, as Nuria mentioned it seems that it fails to allocate direct memory
[07:37:11] <elukey>	 rings a bell? Do we have a specific config for it?
[07:37:46] <joal>	 elukey: "direct memory" is not really a term I have heard of so far for Spark - interested to understand more
[07:38:51] <elukey>	 so the stack trace doesn't really mention spark
[07:38:54] <elukey>	 but I can see
[07:38:54] <elukey>	 18/10/08 17:20:50 ERROR RetryingBlockFetcher: Failed to fetch block shuffle_664_1_2, and will not retry (0 retries)
[07:38:57] <elukey>	 io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1006632960, max: 1012924416)
[07:39:46] <joal>	 elukey: shuffle issue from what I see - Could be related to dynamic allocation :(
[07:42:32] <elukey>	 it is interesting because it says that it already allocated 1G of direct memory
[07:42:46] <elukey>	 so we might have a specific setting for it?
[07:42:52] <elukey>	 or a default that we don't tune
[07:44:22] <elukey>	 the other thing is that https://yarn.wikimedia.org/cluster/app/application_1538849321221_6306 seems succeeded
[07:44:44] <elukey>	 ah yes but the failed refinement is separate
[07:44:49] <elukey>	 okok nevermind
[07:45:17] <joal>	 elukey: the global-refine-job has plenty small refinements (per schema) and tracks the failed ones
[07:46:40] <elukey>	 ahh okok
[07:46:46] <elukey>	 other n00b question if I may
[07:47:00] <elukey>	 the docs says that I should find a _REFINE_FAILURE flag
[07:47:00] <elukey>	 elukey@analytics1003:~$ ls /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_ReadingDepth/hourly/2018/10/08/15
[07:47:04] <elukey>	 eventlogging_ReadingDepth.1002.0.340743.901450646.1539010800000  eventlogging_ReadingDepth.1002.0.3652632.905111487.1539010800000
[07:47:07] <elukey>	 but there isn't
[07:47:26] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Tbayer) >>! In T153821#2962822, @Nuria wrote: > @Krenair if wikitech is not behing varnish pageviews cannot be collected. Correct.  Seem...
[07:49:15] <joal>	 elukey: hm - I assume it would be related to the type of job failure, but I'm not sure
[07:50:15] <elukey>	 JOSEPH IS NOT SURE?
[07:50:46] <elukey>	 joal: you are always sure with the right answer, don't fool me
[07:50:48] <elukey>	 :D
[07:50:59] <joal>	 meh :)
[07:51:12] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10JAllemandou) `wikitech` is not part of the projects to account for in PageviewDefinition code (https://github.com/wikimedia/analytics-re...
[07:58:38] <elukey>	 there is something weird, even from the logs on an1003 I can see success
[07:59:36] <joal>	 hm
[07:59:50] <elukey>	 Maybe Nuria's attempt fixed it
[08:00:19] <elukey>	 anyhow, it is 10 CEST
[08:00:39] <joal>	 BRACE YOURSELF! an1003 will explode !
[08:01:00] <elukey>	 ahhaha
[08:01:11] <elukey>	 jobs still running, going to amend my puppet patch
[08:01:21] <elukey>	 (in the meantime while we wait)
[08:04:18] <joal>	 elukey: hdfs dfs -ls hdfs://analytics-hadoop/wmf/data/event/ReadingDepth/year=2018/month=10/day=8/hour=15
[08:05:22] <elukey>	 ah I checked raw!
[08:05:35] <elukey>	 my bad, thanks :)
[08:05:45] <elukey>	 I just realized it now
[08:08:26] <elukey>	 there is likely some event that is problematic
[08:10:27] <joal>	 hm
[08:11:27] <elukey>	 anyway, it can wait a bit :)
[08:11:55] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Krenair) >>! In T153821#4651184, @Tbayer wrote: >>>! In T153821#2962822, @Nuria wrote: >> @Krenair if wikitech is not behing varnish pag...
[08:14:35] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Krenair) >>! In T153821#4651186, @JAllemandou wrote: > `wikitech` is not part of the projects to account for in PageviewDefinition code...
[08:22:23] <elukey>	 I think that while we wait we can move superset/hue to an-coord1001
[08:22:48] <joal>	 +1 elukey 
[08:23:13] <joal>	 elukey: I alos think the currently running jobs shouldn't prevent us from moving
[08:25:49] <elukey>	 ok so I am currently updating the puppet compiler with the new hosts (an-master/coord) so I'll be able to check the puppet patch
[08:26:59] <elukey>	 the patch is this one https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461997/
[08:28:10] <elukey>	 so my idea is simple: stop oozie, stop hive (metastore/server2), stop hue, stop superset. Dump their databases, and copy them to an-coord1001
[08:28:15] <elukey>	 then run puppet 
[08:28:19] <elukey>	 and restart those daemons
[08:28:57] <joal>	 we could stop superset and hue first, the oozie, then hive :)
[08:29:10] <joal>	 The rest sounds good :)
[08:30:31] <elukey>	 yep 
[08:30:39] <elukey>	 so superset/hue stopped, dumping databases
[08:34:05] <elukey>	 all right databases imported
[08:34:23] <elukey>	 joal: good to stop oozie/hive?
[08:34:33] <joal>	 elukey: +1 !
[08:36:16] <elukey>	 stopped, dumping databases
[08:38:40] <elukey>	 oozie's db seems to take more than the others
[08:39:19] <joal>	 I'm not suprised - Every hive-partition has multiple-steps-jobs to generate them
[08:40:27] <elukey>	 581M oozie_09102018.sql.gz
[08:40:28] <elukey>	 :D
[08:41:40] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10JAllemandou) > Why is this not documented on the wiki creation page? I don't underdstand what the 'wiki creation page' is, but I think t...
[08:43:32] <elukey>	 loading the databases now
[08:43:38] <elukey>	 (on an-coord1001
[08:46:13] <elukey>	 ok joal merging the patches
[08:46:15] <elukey>	 *patch
[08:46:18] <joal>	 k elukey 
[08:46:23] <elukey>	 will start with hue and superset
[08:46:30] <joal>	 why ?
[08:46:49] <joal>	 nevermind -- 
[08:47:12] <elukey>	 I am still importing the oozie/hive databases :)
[08:47:19] <joal>	 that's a good reason :)p
[08:51:46] <elukey>	 doing hive and oozie
[08:56:54] <elukey>	 I don't see any connection to mysql on analytics1003 anymore, keeping it monitored
[08:57:18] <joal>	 elukey: I think currently running job will possibly do
[09:00:32] <elukey>	 oozie is up
[09:00:49] <elukey>	 same thing for hue and superset
[09:00:56] <elukey>	 aaand also hive
[09:01:03] <elukey>	 now it is a matter of testing them :)
[09:02:08] <joal>	 elukey: hue tells me oozie is happy (so far) - it seems to have recovered the running job
[09:03:39] <joal>	 elukey: the hive-query-editor in hue gives me an error: 10:57:18 < joal> elukey: I think currently running job will possibly do
[09:03:42] <joal>	 11:00:32 <@elukey> oozie is up
[09:03:50] <joal>	 mwarf - wrong copy paste sorry
[09:03:55] <joal>	 Failed to open new session: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
[09:04:30] <joal>	 elukey: same on spark from stat1004
[09:04:46] <joal>	 elukey: has puppet run on the statmachineS?
[09:05:06] <elukey>	 yes yes, but hive-server seems complaining
[09:05:14] <joal>	 :(
[09:10:10] <elukey>	 joal: now it should work, it needed a bit of encouragement 
[09:10:19] <joal>	 currently testing spark, seems ok
[09:10:35] <elukey>	 the status of the init.d scripts for hive/oozie is embarassing
[09:10:45] <joal>	 :S
[09:11:04] <joal>	 elukey: we should look at big top before commiting to that, should we?
[09:11:31] <joal>	 elukey: spark happy, hive-query-editor in hue happy 
[09:11:32] <elukey>	 joal: I strongly suspect that those are the same
[09:11:39] <elukey>	 good :)
[09:12:41] <joal>	 elukey: are the drons restarted from an-coord01?
[09:13:00] <elukey>	 nope, it is still cron-less :)
[09:13:42] <joal>	 k
[09:13:54] <joal>	 Shall we move on that?
[09:14:56] <elukey>	 I am preparing the code change now
[09:14:59] <joal>	 k
[09:18:21] <elukey>	 currently deploying all (including refinery via puppet)
[09:19:33] <wikibugs>	 (03PS1) 10Elukey: Replace analytics1003 with an-coord1001 [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/465362 (https://phabricator.wikimedia.org/T205509)
[09:19:57] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] Replace analytics1003 with an-coord1001 [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/465362 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey)
[09:21:24] <elukey>	 joal: puppet complete, camus restarting
[09:21:42] <elukey>	 new coordinator deployed )
[09:21:43] <elukey>	 :)
[09:22:12] <joal>	 \o/
[09:28:18] <joal>	 elukey: manually killing previsouly started oozie job (MWH-reduced) - Seems stuck
[09:28:22] <elukey>	 ack
[09:29:10] <icinga-wm>	 PROBLEM - Age of most recent Analytics meta MySQL database backup files on an-master1002 is CRITICAL: CRITICAL: 0/1 -- /srv/backup/mysql/analytics-meta: No files
[09:29:12] <joal>	 elukey: refine job started - Looks like cron is cronning
[09:29:23] <joal>	 interesting alert
[09:29:30] <elukey>	 yeah working on it
[09:29:54] <elukey>	 I just re-enabled puppet on an-master1002, I wanted to save the last analytics1003's backup
[09:32:19] <elukey>	 checked all the crons on an1003, and disabled the systemd timers
[09:32:27] <elukey>	 so we should be good
[09:33:19] <joal>	 elukey: looks like oozie has an issue with hive metastore
[09:33:24] <joal>	 failed job
[09:33:37] <joal>	 it tried to connect to an1003 for metastore
[09:33:47] <joal>	 elukey: https://hue.wikimedia.org/jobbrowser/jobs/job_1538849321221_8108/single_logs
[09:34:40] <joal>	 Hoooo ! elukey : hdfs://user/hive/hive-site.xml !!!!
[09:35:27] <joal>	 indeed - incorrect value for hive.metastore.uris
[09:36:49] <elukey>	 one at the time :)
[09:39:15] <elukey>	 joal: was the oozie job recent or an old one?
[09:40:07] <elukey>	 also, can you tell me more about the hive uris setting and where you found it?
[09:40:19] <elukey>	 otherwise it is difficult to understand where to check :)
[09:44:07] <joal>	 elukey: was gone to bathroom sorry
[09:44:30] <joal>	 elukey: oozie jobs are configured to read their hive setttings from a hive-site.xml file
[09:44:35] <joal>	 this file is to on hadoop
[09:44:54] <joal>	 we have it stored here: hdfs://user/hive/hive-site.xml
[09:45:08] <joal>	 and it contains old (an1003) values
[09:45:20] <elukey>	 ah ok now it makes sense
[09:46:03] <elukey>	 in theory it should have been updated
[09:46:41] <joal>	 hm
[09:47:33] <elukey>	 we have an exec in puppet to uplaod it
[09:47:40] <elukey>	 but it might not have worked as expected
[09:47:44] <elukey>	 let's change it now manually
[09:47:54] <joal>	 k
[09:48:13] <elukey>	 joal: are you doing it?
[09:48:22] <joal>	 I can :)
[09:51:35] <elukey>	 joal: even if I can simply upload an-coord's one
[09:51:36] <elukey>	 lemme try
[09:52:54] <joal>	 elukey: done
[09:53:01] <joal>	 elukey: retrying ooie
[09:53:31] <elukey>	 done
[09:53:33] <elukey>	 ahhahah okok
[09:53:36] <joal>	 :D
[09:53:42] <elukey>	 I just did  sudo -u hdfs hdfs dfs -put -f /etc/hive/conf.analytics-hadoop/hive-site.xml  /user/hive/hive-site.xml
[09:54:15] <elukey>	 so the code is in profile::hive::site_hdfs
[09:54:23] <elukey>	 it uploads the new file only if it gets refreshed
[09:54:28] <elukey>	 so not this cas
[09:54:31] <elukey>	 *case probably
[09:55:27] <joal>	 Oozie problem solved - job started
[10:01:23] <joal>	 !log Restart failed oozie jobs (webrequest, virtual-pageviews, mwh-reduced)
[10:01:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:02:31] <icinga-wm>	 PROBLEM - Check the last execution of check_webrequest_partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions
[10:06:51] <icinga-wm>	 RECOVERY - Age of most recent Analytics meta MySQL database backup files on an-master1002 is OK: OK: 1/1 -- /srv/backup/mysql/analytics-meta: 0hrs
[10:08:06] <elukey>	 ok this is good --^
[10:19:49] <elukey>	 mmmm check_webrequest_partitions is a bit weird
[10:20:06] <joal>	 elukey: could cause cluster is a bit late?
[10:20:44] <elukey>	 ahhh no I confused all those M _ etc..
[10:20:51] <elukey>	 it is only complaining about the last hour
[10:20:57] <joal>	 righ
[10:21:08] <elukey>	 so yeah all proceeding good
[10:26:58] <elukey>	 setting analytics1003 as spare host to prevent any accidental attempt to come back to live
[10:28:39] <joal>	 elukey: no zombies in an-vlan :)
[10:29:45] <mforns>	 hey team!
[10:30:25] <elukey>	 o/
[10:32:55] <mforns>	 elukey, can I help with alarms?
[10:33:25] <elukey>	 mforns: all good, we replaced analytics1003
[10:33:36] <mforns>	 ok :]
[10:34:07] <mforns>	 elukey, does it have another name now?
[10:36:58] <elukey>	 mforns: another host, an-coord1001.eqiad.wmnet :)
[10:37:33] <mforns>	 elukey, so data drop and refine jobs run there now, right?
[10:38:13] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Krenair) >>! In T153821#4651201, @JAllemandou wrote: >> Why is this not documented on the wiki creation page? > I don't underdstand what...
[10:38:29] <elukey>	 mforns: yep!
[10:38:34] <mforns>	 k :]
[10:40:07] <elukey>	 mforns: just to be sure, can you check that you can access the host via ssh ?
[10:40:11] <mforns>	 sure
[10:41:24] <mforns>	 elukey, yes, and I can sudo -u to hdfs user
[10:41:33] <elukey>	 nice
[10:49:05] <wikibugs>	 10Analytics, 10DC-Ops, 10decommission, 10User-Elukey: Decommission analytics1003 - https://phabricator.wikimedia.org/T206524 (10elukey) p:05Triage>03Normal
[10:53:43] <wikibugs>	 10Analytics-Kanban, 10User-Elukey: Upgrade Analytics infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T192642 (10elukey) 05Open>03Resolved
[10:53:48] <elukey>	 \o/
[10:54:29] <mforns>	 :D
[12:20:30] <fdans>	 joal: in a scale between 0 and America, how free are your right now? for 2 min in the cave
[12:34:31] <wikibugs>	 (03PS1) 10Fdans: [wip] Add change_tag to mediawiki_history sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465416
[12:37:35] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Q1 2018/19 Analytics procurement - https://phabricator.wikimedia.org/T198694 (10elukey)
[12:39:21] <elukey>	 Original exception: java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://analytics1003.eqiad.wmnet:10000/default;user=yarn;password=: 
[12:39:24] <elukey>	 ah!
[12:39:42] <elukey>	 In theory this one should work only restarting it
[12:43:34] <elukey>	 all the jobs failed probably had hive-site.xml cached
[12:43:51] <elukey>	 better, were reading from the version of hdfs that was still mentioning an1003
[12:47:15] <elukey>	 !log re-run apis-wf-2018-10-9-8
[12:47:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:48:54] <elukey>	 !log re-run all the failed projectview-hourly-coord and aqs-hourly-coord workflows (restarting them via hue)
[12:48:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:55:44] <elukey>	 lovely it keeps failing
[13:01:21] <wikibugs>	 10Analytics: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10bmansurov) @Miriam any updates on this? Did you get a chance to talk with Michele and Tiziano?
[13:04:13] <elukey>	 the main cause seems to be
[13:04:14] <elukey>	 Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out (Connection timed out)
[13:04:28] <elukey>	 so I am pretty sure that they are still using an1003
[13:08:20] <joal>	 Heya fdans - I was in america, now distance 0 :)
[13:09:47] <elukey>	 joal: 
[13:09:48] <elukey>	 elukey@stat1004:/mnt/hdfs/user/oozie$ grep -rni analytics1003 *
[13:09:48] <elukey>	 share/lib/lib_20170228165236/spark/hive-site.xml:17:    <value>thrift://analytics1003.eqiad.wmnet:9083</value>
[13:09:51] <elukey>	 share/lib/lib_20170228165236/spark2.2.1/hive-site.xml:17:    <value>thrift://analytics1003.eqiad.wmnet:9083</value>
[13:09:54] <elukey>	 share/lib/lib_20170228165236/spark2.3.0/hive-site.xml:17:    <value>thrift://analytics1003.eqiad.wmnet:9083</value>
[13:09:57] <elukey>	 share/lib/lib_20170228165236/spark2.3.1/hive-site.xml:17:    <value>thrift://analytics1003.eqiad.wmnet:9083</value>
[13:10:00] * elukey cries in a corner
[13:10:05] <joal>	 wow
[13:11:25] <elukey>	 no idea if this is used or not
[13:11:38] <elukey>	 but an1003 is cached in other places because some jobs are still failing
[13:11:50] <joal>	 elukey: I think it would be used by spark jobs
[13:12:00] <joal>	 We're gonna know soon
[13:12:09] <elukey>	 well we have a ton of failures :D
[13:12:10] <joal>	 about other jobs failing, this is weird :(
[13:12:19] <elukey>	 but those are not spark's right?
[13:12:57] <joal>	 Most are not (i don't know about API
[13:13:08] <joal>	 I've seen you;ve restarted the failed ones?
[13:13:24] <elukey>	 yeah, failed again
[13:13:41] <joal>	 ok - API is spark
[13:13:43] <icinga-wm>	 RECOVERY - Check the last execution of check_webrequest_partitions on an-coord1001 is OK: OK: Status of the systemd unit check_webrequest_partitions
[13:14:56] <joal>	 !log rerun failed aqs-hourl jobs
[13:14:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:15:50] <elukey>	 I already did it joal, but they failed.. did you change anything?
[13:16:05] <joal>	 How crap - didn't notice there was 2 emails
[13:16:09] <joal>	 sorry
[13:16:11] <joal>	 hm
[13:16:54] <joal>	 Ok understood
[13:17:26] <joal>	 elukey: the /user/hive/hive-site.xml file is in use since about more than a year I htink
[13:17:52] <elukey>	 I am wondering if /usr/local/bin/spark2_oozie_sharelib_install is the issue
[13:18:04] <joal>	 Before that, we were using a hive-site.xml file copied on HDFS with refiner
[13:18:06] <joal>	 Before that, we were using a hive-site.xml file copied on HDFS with refinery
[13:18:15] <elukey>	     # If running on an oozie server, we can build and install a spark2
[13:18:18] <elukey>	     # sharelib in HDFS so that oozie actions can launch spark2 jobs.
[13:18:30] <joal>	 We started to use the new to prevent the exact error we're having now: having to restart everything for a change in hive-site.xml
[13:19:02] <joal>	 If you look at the config on failing projectview or AQS, you'll see hive-site is not /user/hive/...
[13:19:22] <joal>	 Ok - Restarting the failing jobs with correct config (and sending associated patch)
[13:20:54] <elukey>	 ahhhh
[13:23:06] <joal>	 elukey: interesting ! those jobs conf have been updated already
[13:23:17] <joal>	 They must not have been restarted since a long time :)
[13:23:22] <joal>	 Doing so
[13:24:04] <wikibugs>	 (03PS1) 10Joal: Update hive-site.xml path in spark util [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465422
[13:25:52] <joal>	 !log full restart of projectview_hourly
[13:25:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:26:05] <elukey>	 need go afk for ~1h, but will try to check!
[13:26:50] <joal>	 !log Full restart of aqs oozie job
[13:26:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:29:15] <joal>	 Ok problem solved for aqs_hourly and projectview_hourly
[13:29:20] <joal>	 APIs left
[13:30:43] <joal>	 I confirm spark jobs making use of hive-metastore are stuck :S
[13:31:06] <joal>	 ottomata: as a good morning I need some help :S
[13:32:53] <ottomata>	 hiiii
[13:33:01] <ottomata>	 hm ok
[13:33:02] <joal>	 o/ !
[13:33:06] <ottomata>	 what's up?
[13:33:26] <joal>	 an-coord1001 is live and analtics1003 is dead :)
[13:33:35] <joal>	 but there still are some issues left
[13:34:22] <joal>	 Namely, oozie sharelibs for spark each have a copy of hive-site.xml referencing metastore being analytics1003
[13:34:26] <joal>	 ottomata: --^
[13:36:17] <ottomata>	 ohhh
[13:36:20] <ottomata>	 ok
[13:36:28] <ottomata>	 annoying sharelibs
[13:36:39] <joal>	 !log fully restart projectview_geo oozier job
[13:36:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:37:34] <joal>	 ottomata: Not sure if we need to fully make them again, of if a change of file is enough
[13:37:49] <ottomata>	 looking
[13:39:32] <ottomata>	 hm, spark2.3.1 doesn' thave hive-site
[13:39:37] <ottomata>	 for spark 1 joal?
[13:40:05] <ottomata>	 or, did you remove it?
[13:40:09] <ottomata>	 hm, i see it in spark2.3.0
[13:40:29] <joal>	 It should be in spark2.3.1 per elukey searhc
[13:40:41] <ottomata>	 its not, but it should be, maybe it was removed?
[13:40:43] <ottomata>	 i will put it there
[13:41:10] <joal>	 weird ottomata 
[13:41:41] <wikibugs>	 10Analytics: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10Miriam) @bmansurov yes, sorry for the delay. We propose to cap the citation text in order to avoid these errors. Would that be ok? Thanks!
[13:42:30] <ottomata>	 joal:  i'm also going to remove the spark2.2.1 and spark2.3.0 sharelib dirs
[13:42:37] <ottomata>	 hive-site.xml is now in 2.3.1
[13:42:41] <ottomata>	 sooo, try a job?
[13:42:47] <joal>	 Will try one yes :)
[13:43:07] <joal>	 About other libs, we should make sure they're not used anymore before removing?
[13:43:21] <joal>	 Actually, it'll be a good way to know wherre they're still in use :)
[13:43:50] <ottomata>	 haha too late!  (they are in trash)
[13:44:00] <ottomata>	 but yeah i doubt they are used, since we don't have those spark .debs installed anymore
[13:44:02] <ottomata>	 so they shouldn't be!
[13:44:25] <joal>	 ottomata: The apis jobs for instance still uses 2.3.0 :)
[13:44:28] <joal>	 Restarting it now
[13:44:47] <ottomata>	 :o
[13:45:08] <ottomata>	 joal that is a little weird, I should add that to upgrade steps for spark 2 somewhere
[13:45:35] <wikibugs>	 (03PS2) 10Joal: Correct oozie jobs after move to an-coord1001 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465422
[13:46:26] <joal>	 !log Restarting oozie-api job
[13:46:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:49:59] <joal>	 actually ottomata - All our oozie spark job were still on spark2.3.0 :)(
[13:50:03] <joal>	 Restarting them now
[13:50:09] <ottomata>	 yikes sorry
[13:50:11] <ottomata>	 i can put it back joal?
[13:50:23] <ottomata>	 this is probably better though^
[13:50:57] <joal>	 I'll do it
[13:51:03] <joal>	 it's better this way
[13:51:12] <ottomata>	 ok
[13:51:51] <joal>	 ottomata: Still a failure
[13:52:03] <joal>	 Now spark is ok with new sharelib, but the job fails
[13:53:07] <joal>	 table not found ottomata - I assume it's a related issue :)
[13:53:23] <joal>	 ottomata: Have you run that command in oozie about the change of something in sharelib?
[13:54:34] <ottomata>	 joal:  no i doubt it'd be needed or do anything, since we didn't make a new sharelib
[13:54:49] <joal>	 hm
[13:54:51] <ottomata>	 maybe oozie itself needs a restart?  maybe it caches that value?
[13:54:51] <ottomata>	 hm
[13:54:57] <joal>	 we can do that
[13:54:58] <ottomata>	 seems unlikely
[13:55:05] <ottomata>	 since it is in the job config
[13:55:14] <ottomata>	 joal:  is that the same error from before?  table not found?
[13:55:36] <joal>	 the value is in jobconfig (spark-share-lib) - I think oozie caches the sharelib content though
[13:55:54] <joal>	 ottomata: I have not double checked previous error :(
[13:55:57] <ottomata>	 hm
[13:56:03] <ottomata>	 ok let's bounce oozie server, can/should I just do that
[13:56:10] <joal>	 please
[13:56:10] <ottomata>	 ?
[13:56:19] <ottomata>	 !log bouncing oozie server on an-coord1001
[13:56:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:56:27] <joal>	 ottomata: I also think that command about sharelib update could be good
[13:57:38] <ottomata>	 will run it
[13:57:39] <ottomata>	 can't hurt
[13:57:51] <ottomata>	 done
[13:58:14] <joal>	 Rerunning a job
[13:59:55] <joal>	 Success ottomata :)
[14:00:19] <ottomata>	 yes!
[14:00:36] <joal>	 ottomata: need to catch the kids, will be back for standup - Still to do: restart oozie-spark jobs with new spark-share-lib - Will do wen I'm back
[14:00:37] <ottomata>	 hhai wonder which did the trick!  shoulda tried a controlled experiement first!
[14:00:44] <joal>	 :D
[14:00:44] <ottomata>	 i betha oozie restart would ahve been enough
[14:00:52] <ottomata>	 i doubt update sharelib would have done anythign
[14:01:05] <ottomata>	 joal:  i can work on that, are there more that need to be committed?
[14:01:12] <ottomata>	 more changes in refinery oozie porperties?
[14:01:21] <joal>	 ottomata: git st
[14:01:24] <joal>	 oops
[14:01:54] <wikibugs>	 (03PS3) 10Joal: Correct oozie jobs after move to an-coord1001 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465422
[14:01:57] <joal>	 ottomata: --^
[14:02:37] <joal>	 ottomata: Don't worry I'll restart the jobs manually with the settings in an hour or so - I have MWH-reduced to maonitor, so I'd arther be on it if ou don't mind
[14:03:08] <ottomata>	 ok...
[14:26:13] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "I like it." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465202 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal)
[14:26:47] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), 10Patch-For-Review: Improve Dashiki extension messaging - https://phabricator.wikimedia.org/T205644 (10Milimetric) No, this isn't deployed.  There are two gerrit changes:  1. https://gerrit.wikimedia.org/r/463309 impl...
[14:28:54] <milimetric>	 mforns: are you working on this?  https://phabricator.wikimedia.org/T199693
[14:29:00] <milimetric>	 it's moved into kanban but with no assignee
[14:29:31] <mforns>	 milimetric, no! I moved it there yesterday, because I was sharing screen in groskin' meeting
[14:29:42] <mforns>	 we said that someone should grab it
[14:29:57] <milimetric>	 hm... I have some nits on our process, will bring up today
[14:29:59] <mforns>	 that's why it's assigned to no one
[14:30:07] <mforns>	 hehe ok
[14:36:54] <wikibugs>	 (03CR) 10Ottomata: "Hm, so I'd love to be able to easily grasp all the parts here without needing a walkthrough.  I mostly understand, but since it isn't obvi" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465206 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal)
[14:46:41] <elukey>	 hello!!
[14:47:11] <elukey>	 ottomata,joal - sorry I forgot to log in here, I tried to remove the spark 2.3.1 lib from the oozie hdfs dir and re-create it
[14:47:22] <elukey>	 but hive-site.xml was not added for some reason
[14:47:36] <elukey>	 (then I had to go afk and I forgot to ask sorry)
[14:47:46] <elukey>	 ottomata: how dod you fix it? Manually copied the file?
[14:49:03] <ottomata>	 yeah
[14:49:07] <ottomata>	 and restarted oozie?
[14:49:18] <elukey>	 nope I didn't
[14:52:20] <ottomata>	 i did
[14:52:23] <ottomata>	 that seemed to do it
[14:52:26] <ottomata>	 but i'm not certain why
[14:53:17] <joal>	 !log Restart clickstream oozie job to pick new spark-lib
[14:53:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:53:44] <elukey>	 ottomata: ah ack! 
[14:53:55] <elukey>	 so now if I got it correctly we should be good right?
[14:53:59] <elukey>	 nothing exploding
[14:54:58] <elukey>	 about https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465381/, let's decide where to put it
[14:56:19] <joal>	 !log Restart check_denormalize oozie job
[14:56:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:57:32] <joal>	 !log restart mediawiki-history denormalize oozie job
[14:57:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:58:58] <ottomata>	 elukey:  yeah it looks good now, joal is restarting some jobs to update the sharelib path (we were using an older spark2 version
[14:58:58] <ottomata>	 )
[14:59:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015 (10Milimetric)
[14:59:36] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Read Dashiki annotations into Wikistats - https://phabricator.wikimedia.org/T194702 (10Milimetric) 05declined>03Open The task has more than just the title, and some of it still needs to get done.
[15:00:12] <joal>	 !log restart wikidata-article-placeholder oozie job
[15:00:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:01:41] <ottomata>	 nuria:  yoohooo
[15:04:07] <wikibugs>	 (03CR) 10Milimetric: [C: 032] Change the label to the last day of the week [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/465152 (https://phabricator.wikimedia.org/T206456) (owner: 10Amire80)
[15:06:00] <elukey>	 ottomata: on friday we were chatting about this (eventlog1002)
[15:06:01] <elukey>	 /dev/mapper/eventlog1002--vg-data  870G  717G  110G  87% /srv
[15:06:12] <ottomata>	 oo
[15:06:21] <ottomata>	 yso much?  oh because we have more events
[15:06:22] <ottomata>	 hm
[15:06:30] <wikibugs>	 10Analytics, 10Analytics-Kanban: Table view of timely results in wikistats 2 should be ordered in time descending - https://phabricator.wikimedia.org/T199693 (10Milimetric) a:03Milimetric
[15:06:48] <wikibugs>	 10Analytics, 10Patch-For-Review: Time dimension carried on url for top metrics - https://phabricator.wikimedia.org/T206479 (10Nuria) a:05fdans>03Nuria
[15:08:08] <joal>	 !log restart wikidata-specialentites oozie job
[15:08:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:08:13] <joal>	 !log restart wikidata-coeditors oozie job
[15:08:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:08:53] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Time dimension carried on url for top metrics - https://phabricator.wikimedia.org/T206479 (10Nuria)
[15:10:30] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Update DataFrameToHive for dynamic partitions (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465202 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal)
[15:10:35] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Time dimension carried on url for top metrics - https://phabricator.wikimedia.org/T206479 (10Nuria)
[15:10:39] <joal>	 !log restart Mediawiki-history-reduced
[15:10:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:10:58] <wikibugs>	 10Analytics: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10bmansurov) According to [[ https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&from=now-7d&to=now-5m&var-datasource=eqiad%20prometheus%2Fops&var-topic=eventloggin...
[15:11:45] <wikibugs>	 (03CR) 10Milimetric: [C: 032] "I have marked the jobs to be rerun since the beginning of time.  This should happen over the next few hours and the data will automaticall" [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/465152 (https://phabricator.wikimedia.org/T206456) (owner: 10Amire80)
[15:13:46] <wikibugs>	 10Analytics: eventlogging logs taking a huge amount of space on eventlog1002 and stat1005 - https://phabricator.wikimedia.org/T206542 (10elukey) p:05Triage>03High
[15:13:53] <elukey>	 ottomata: opened --^ to track the issue
[15:17:29] <ottomata>	 hm, interesting.  we keep for 30 days.
[15:17:33] <ottomata>	 not really that long...
[15:33:33] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10JAllemandou) Right, I get it now :) We discussed withbthe team and our plan is to change how we detect/filter pageviews from a domain pe...
[15:36:51] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10bd808) Adding @harej and @srodlund as subscribers as I think they will be interested in the outcome here.
[15:44:31] <milimetric>	 ottomata: are we doing this MEP meeting?
[15:45:00] <mforns>	 elukey, is it possible for me to test something in Turnilo's config.yaml file? How could I do it?
[15:45:44] <ottomata>	 milimetric:  yes
[15:52:38] <elukey>	 mforns: in theory we could test it live on the host, but if it is a quick thing.. otherwise I can try to set up something in labs
[15:53:29] <mforns>	 elukey, it would be adding a measure to a given datasource
[15:53:49] <mforns>	 with a formula
[15:54:09] <elukey>	 we can quickly try on the fly
[15:54:28] <mforns>	 ok, let me know when it's good for you, it doesn't neet to be today
[15:55:30] <elukey>	 mforns: now it is fine, can you give me the change?
[15:55:40] <mforns>	 elukey, yes, one minute
[15:55:59] <mforns>	 elukey, should I create a puppet patch?
[15:56:29] <elukey>	 mforns: if you want to make it permanent afterwards yes
[15:56:40] <wikibugs>	 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Nuria) Clarifying:   - wikitech pageviews can now be computed as now wikitech wiki  is behind varnish, webrequest table  gets all data (...
[15:57:31] <nuria>	 joal: added note to ticket: https://phabricator.wikimedia.org/T153821
[15:57:43] <mforns>	 elukey, actually, I need more time to familiarize with the config syntax, not sure if I can just add one measure to a given datasource or I have to specify everything else for that datasource as well...
[15:58:20] <mforns>	 elukey, let me ping you tomorrow, and I'll have something ready-ish
[15:58:29] <elukey>	 ack!
[15:58:44] <mforns>	 thaaanks :]
[15:59:27] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015 (10Nuria) @milimetric: i have transfer most dashiki annotations (there were not that many) by hand, is this task still needed?
[16:02:14] <milimetric>	 ottomata: jumping in? https://meet.google.com/dcd-vvqb-dhd
[16:02:50] <ottomata>	 OH OOPS wow how did 5 minutes go by
[16:26:32] <wikibugs>	 10Analytics, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10AndyRussG) Hi!!! Many apologies for the delay here...  I think it makes sense to build the realtime data consumer based on the EventLogging stream. The only drawback would be that initiall...
[16:36:09] <elukey>	 really weird
[16:36:10] <elukey>	 Could not open client transport with JDBC Uri: jdbc:hive2://analytics1003.eqiad.wmnet:10000/default;user=yarn;password=: java.net.ConnectException: Connection timed out (Connection timed out))
[16:36:16] <elukey>	 this is from 15mins ago
[16:45:50] <elukey>	 but it happens only for TestSearchSatisfaction2 and SearchSatisfaction
[16:46:46] <ottomata>	 HMMMM
[16:47:00] <ottomata>	 yeah saw those in refine alert, been meanign to check in after emails/lunch
[16:47:05] <ottomata>	 strange it is only those too....
[16:47:06] <ottomata>	 hm
[16:47:22] <elukey>	 I am wondering if those are handled separately by say Erik
[16:47:30] <nuria>	 mmm ya
[16:48:14] <nuria>	 are they coming to kafka via EL javascript client or serverr side? ... although for refine that would not matter
[16:52:11] <nuria>	 elukey: i deleted those alarms cause i assumed it was teh switch w/o noticing they were only for those schemas
[16:53:23] <elukey>	 yep yep, I noticed analytics1003 in those for a recent alert and I thought it was weird
[16:54:47] <nuria>	 ottomata: plis ping me when you are looking into it  
[16:57:03] <nuria>	 elukey: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=SearchSatisfaction
[16:57:08] <nuria>	 elukey: it has no traffic
[17:01:53] <ottomata>	 making lunch...
[17:02:02] <ottomata>	 but yeah, even so, where would it be getting analytics 1003 from???
[17:18:49] <nuria>	 ottomata: from a stale dns 
[17:18:54] <ottomata>	 naww
[17:18:56] <nuria>	 ottomata: jdbc connection
[17:19:13] <ottomata>	 its a short lived cron job
[17:20:25] <elukey>	 ottomata: and we know for sure that it is generated from our cron right?
[17:20:37] <ottomata>	 I know!~
[17:20:38] <nuria>	 ottomata: mmmm.. the parameters of the connection it opens might not be so short lived as the cron
[17:20:57] <ottomata>	 refinery source code has analytics1003 as a default param, and it isn't overridden in the properties
[17:21:25] <ottomata>	 i will remove the default...
[17:21:28] <ottomata>	 and manually set it
[17:21:33] <elukey>	 ah there  you go :)
[17:22:00] <nuria>	 elukey: so i understand, the analytics1003 is no longer used for these jobs anymore, right?
[17:22:04] <ottomata>	 elukey:  is there anywhere in puppet that hive server url is set...? looking
[17:23:47] <ottomata>	 would like to use it from hiera rather than hardcoding in ::job::refine class
[17:23:52] <ottomata>	 looks like not thoug
[17:24:35] <elukey>	 nuria: yep, nothing runs on it anymore
[17:24:51] <ottomata>	 oh yes there is
[17:24:54] <ottomata>	 i must ahve been grepping wrong
[17:25:02] <ottomata>	 profile hive client has it
[17:28:26] <ottomata>	 elukey:  https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465458/1/modules/profile/manifests/analytics/refinery/job/refine.pp
[17:29:47] <elukey>	 sure but not analytics1003 no?
[17:29:57] <nuria>	 ottomata: i still do not understand  why it wouldn't fail for all schemas though.
[17:30:32] <wikibugs>	 (03PS1) 10Ottomata: Update default value of Refine hive_server_url [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465459 (https://phabricator.wikimedia.org/T205509)
[17:30:44] <ottomata>	 nuria:  i don't get that either...maybe it is failing too early
[17:31:04] <ottomata>	 elukey:  ?
[17:31:19] <elukey>	 sorry I didn't get that you were fixing it
[17:31:26] <elukey>	 going to review in a sec, I am merging another change
[17:31:31] <ottomata>	 k
[17:31:34] <ottomata>	 nuria:  i'm going to try to rerun
[17:31:37] <ottomata>	 want to do it with me and see?
[17:31:39] <ottomata>	 to practice?
[17:31:48] <ottomata>	 we can override the CLI opt and set manually when we try
[17:31:59] <ottomata>	 bc?
[17:32:06] <nuria>	 ottomata: yess
[17:32:08] <ottomata>	 k
[17:32:16] <nuria>	 ottomata: let me get headset
[17:33:19] <nuria>	 ottomata: on bc
[17:36:06] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) @elukey I am in conversation with DELL about the server, getting them the info they need.....nothing has been decided yet but as soon as they tell me what they're sending (should be a...
[17:38:45] <wikibugs>	 (03CR) 10Elukey: [C: 031] Update default value of Refine hive_server_url [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465459 (https://phabricator.wikimedia.org/T205509) (owner: 10Ottomata)
[17:39:23] <elukey>	 ottomata: did you also send a cdh module change too?
[17:39:41] <elukey>	 I can see in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465458/ the submodule updated
[17:40:09] <elukey>	 but maybe that was my last change
[17:44:51] <wikibugs>	 10Analytics, 10Analytics-Kanban: is-yarn-app-running script should output the running application id - https://phabricator.wikimedia.org/T206555 (10Ottomata)
[17:44:57] <ottomata>	 oops i didn't mean to elukey
[17:45:02] <ottomata>	 fixig
[17:45:15] <elukey>	 fixing it now
[17:45:18] <elukey>	 should be ready
[17:45:32] <ottomata>	 oh
[17:45:33] <ottomata>	 oop
[17:45:34] <ottomata>	 s
[17:45:49] <ottomata>	 gr8 danke
[17:48:07] <elukey>	 ottomata: about Nuria's email for the ReadingDepth refine failure - can you give me some hints about how to check what made the refinement to fail? Yarn logs and /var/log/refinery ones do not contain much afaict
[17:48:11] <elukey>	 (but I could be wrong!)
[17:48:37] <nuria>	 elukey: looking at that now with andrew on bc
[17:48:48] <nuria>	 elukey: i will follow up
[17:50:48] <joal>	 ottomata: how come the default value in refine-params has affected only a single schema? I don't get it
[17:51:36] <elukey>	 nuria: ah great! Will wait for the mail update
[17:51:49] <nuria>	 joal: ya, we do not get it either
[17:52:18] <joal>	 Ah ok :) I feel less alone - I'll sta with elukey, waiting for news from the frontline
[17:53:07] <elukey>	 I am curious to know if the other refinement problem (ReadingDepth) it is due to the jvm's direct memory settings or soemthing else
[17:53:18] <elukey>	 anyhow, dinner time, going offline team!
[17:53:19] <ottomata>	 joal:  i don't get that yet either
[17:53:32] <ottomata>	 elukey:  come to batcave and discuss! :)
[17:53:35] <joal>	 bye elukey - We'll talk again about direct memory :)
[17:53:40] <ottomata>	 ok nm byyeee
[17:54:20] <joal>	 ottomata: The direct memory seems related to shuffle - might be interesting to see if no-dynamic-allocation help
[17:54:21] <elukey>	 ottomata: I could but given the fact that it is 8 PM in here Marika might kill me :D
[17:54:39] <joal>	 elukey: Please don't risk that :)
[17:54:44] <elukey>	 :D
[17:54:47] <elukey>	 byyeee
[17:58:44] <ottomata>	 np! byyee
[18:24:25] <ottomata>	 !log adding Accept header to all varnishkafka generated webrequest logs
[18:24:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:30:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata) Just added the `accept` field to the varnishkafka generated webrequest logs.  @JAllemandou I haven't done this in a while, I'll ping you in m...
[18:31:30] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Update default value of Refine hive_server_url [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465459 (https://phabricator.wikimedia.org/T205509) (owner: 10Ottomata)
[18:56:01] <wikibugs>	 (03PS1) 10Ottomata: Make is-yarn-application-running --verbose more informative [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465471 (https://phabricator.wikimedia.org/T206555)
[19:04:57] <joal>	 ottomata: I observe weird behaviors in DataFrameToHive
[19:05:02] <joal>	 Have a minute?
[19:05:33] <ottomata>	 joal ya
[19:05:37] <ottomata>	 bc?
[19:05:56] <joal>	 OMW
[19:42:27] <joal>	 ottomata: The alter is actually fired at every run
[19:42:39] <joal>	 I think we need to find why and try to prevent :)
[19:43:05] <joal>	 I have also found another bug - I don't even understand how it was not failing before
[19:47:01] <ottomata>	 joal:  like the regular refine is doing that every time too?
[19:47:10] <ottomata>	 joal bc again?
[19:47:11] <joal>	 ottomata: possibly yes !
[19:47:13] <joal>	 sure
[19:47:21] <ottomata>	 joal i don't think it is...i think i would see it
[19:47:27] <ottomata>	 when i run it manually
[19:47:28] <ottomata>	 but maybe not!
[20:06:30] <wikibugs>	 10Analytics-Kanban, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Nuria) @Tbayer, you do not need any special permissions to access any type of data, the datasources that were accessible through these permits have sinc...
[20:21:14] <joal>	 ottomata: Interesting finding! two spark-sql types are different if they don't have the same metadata (comments for instance)
[20:21:23] <joal>	 This is where my repetition comes from
[20:24:43] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10mforns) @Tbayer   > @mforns Great to hear that Druid already allows ingestion of array types! But just to clar...
[20:26:19] <ottomata>	 OHHHH
[20:26:22] <ottomata>	 interesting
[20:26:34] <joal>	 So when we use Seq.diff --> it uses equal
[20:26:36] <ottomata>	 joal why don't they have the same comments?  oh because source data is from parquet files instead of table?
[20:26:48] <joal>	 correct
[20:26:51] <ottomata>	 hmmm
[20:27:03] <ottomata>	 but the json data doesn't have comments...
[20:27:10] <joal>	 Same in json for instance - Except that since refine created the tables, no comments (no problem
[20:27:16] <ottomata>	 ohhhhHHHHH
[20:27:17] <ottomata>	 right.
[20:27:26] <ottomata>	 huh
[20:27:33] <joal>	 Looking for an elegant patch
[20:27:43] <ottomata>	 so the alter is removing the comments everytime
[20:27:49] <ottomata>	 joal:  couldnt' you just let your job create the table?
[20:28:11] <joal>	 ottomata: For sure, I could even create the table without comments :)
[20:28:31] <ottomata>	 aye
[20:28:47] <joal>	 Now an interesting part is that it only tries to alter the subobject filed, not others
[20:28:52] <ottomata>	 it sucks that we are losing the comments here, but maybe the convert to schema is doing the right thing here!
[20:28:56] <joal>	 While others have comments too
[20:28:57] <ottomata>	 oh that is interesting
[20:29:22] <ottomata>	 if the incoming schema had comments, we'd want it to keep them on the output schema
[20:29:25] <joal>	 I have no clue why
[20:29:44] <joal>	 indeed ottomata - I'm gonna make sure this is what happenms
[20:30:01] <joal>	 Enough for tonight though :)
[20:30:10] <joal>	 I'll keep on searching on that tomorrow
[20:31:14] <ottomata>	 great find!
[21:14:45] <wikibugs>	 10Analytics-Kanban, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Tbayer) >>! In T178802#4653097, @Nuria wrote: > @Tbayer, you do not need any special permissions to access any type of data, the datasources that were a...
[21:45:48] <nuria>	 milimetric: yt?
[21:46:30] <milimetric>	 no, out picking up steph, what’s up nuria
[21:46:49] <nuria>	 milimetric: that's fine, ping me if/when you get back online
[22:23:08] <HaeB>	 regarding the above discussion about analytics1003: does this mean that the entire server is renamed now?
[22:23:26] <HaeB>	 it is used in a lot of other contexts, cf. https://wikitech.wikimedia.org/w/index.php?search=analytics1003&title=Special%3ASearch&go=Go
[22:25:18] <HaeB>	 e.g. groceryheist and i couldn't run hive queries from SWAP today as documented at https://wikitech.wikimedia.org/wiki/SWAP#Querying_data ... worked only after changing analytics1003 to an-coord1001
[23:08:28] <wikibugs>	 (03PS3) 10Nuria: [WIP] Time dimension should be reseted to "1-Month" for top metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/465296 (https://phabricator.wikimedia.org/T206479)
[23:17:36] <wikibugs>	 (03PS1) 10Mforns: Refactor EventLoggingToDruid to use whitelists and ConfigHelper [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465532 (https://phabricator.wikimedia.org/T206342)
[23:18:19] <wikibugs>	 (03CR) 10Mforns: [C: 04-2] "Still need to figure out how to use properties file with ConfigHelper." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465532 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns)
[23:31:37] <nuria>	 HaeB: yes, all references need to be updated as that host no longer exists
[23:33:57] <HaeB>	 ok, thanks for clarifying - this wasn't apparent from the announcement https://lists.wikimedia.org/pipermail/wiki-research-l/2018-October/006477.html (CC elukey neilpquinn )
[23:44:38] <ottomata>	 sorry HaeB that's a documentation update lag on our part
[23:55:42] <milimetric>	 nuria: hi, back, what's up