[00:13:24] <wikibugs>	 (03PS24) 10Nuria: Add core class and job to import EL hive tables to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)
[00:16:07] <wikibugs>	 (03CR) 10Nuria: "Moved dependency to base pom to fix issue with iOS spark install. I think this code (or rather the oozie that calls it) should probably us" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)
[05:42:08] <wikibugs>	 10Analytics, 10New-Readers, 10Easy: Split opera mini in proxy or turbo mode - https://phabricator.wikimedia.org/T138505#2402369 (10ralgara) Some key items as I start reading up:  - [[ https://github.com/ua-parser/uap-core/blob/master/test_resources/opera_mini_user_agent_strings.yaml | Opera Mini UA strings ]...
[06:00:46] <wikibugs>	 10Analytics, 10New-Readers, 10Easy: Split opera mini in proxy or turbo mode - https://phabricator.wikimedia.org/T138505#3892194 (10ralgara) @Nuria adn @atgo: the breakdown I think we want is into standard, proxy and remote for both Opera Mini and UC Mini. Can you please confirm?.   Do we have a larger recent...
[08:37:55] <elukey>	 hello people, moving to the co-working, will be back in a few
[08:38:05] <joal>	 Hi elukey :) Have a good trip :)
[08:55:03] <wikibugs>	 (03CR) 10Joal: "This patch moves our usual way of dealing with time partitions. In other jobs, we used non-padded months, days and hours in partitions, wh" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403484 (https://phabricator.wikimedia.org/T170764) (owner: 10Milimetric)
[09:01:42] <elukey>	 aaand back
[09:04:35] <elukey>	 !log reboot analytics1051->1054 for kernel updates
[09:04:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:06:11] <elukey>	 so joal, whenever you have time shall we discuss what was "weird" in the hadoop coordinator in labs? If we resolve the outstanding issues the next step is test the update to java8!
[09:06:23] <joal>	 elukey: For sure !
[09:07:45] <joal>	 elukey: as of now (actually, was already yesterday), hive is broken (Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient --> Connection refused)
[09:07:54] <joal>	 Which prevents me from running oozie jobs
[09:08:20] <elukey>	 how does hive dare to return this error!
[09:08:24] <elukey>	 :D
[09:08:40] <elukey>	 so the metastore is not available, checking now!
[09:08:42] <joal>	 elukey: I've tried to threaten it, but didin't change anything
[09:09:26] <joal>	 elukey: I think my hammer is not powerfull enough to make tools fear it
[09:16:02] <elukey>	 so both hive daeamons are not binding their port
[09:16:11] <elukey>	 I think it is a residue of my experiments
[09:16:12] <elukey>	 lemme check
[09:20:23] <elukey>	 joal: all right all ports up
[09:20:26] <elukey>	 hive should be running
[09:34:44] <elukey>	 !log reboot analytics1055->1058 for kernel updates
[09:34:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:01:05] <elukey>	 !log reboot analytics1059-61 for kernel updates
[10:01:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:18:39] <elukey>	 really nice https://github.com/wikimedia/operations-software-druid_exporter/issues/1
[10:18:48] <elukey>	 first pull request, people are using it :)
[10:28:37] <joal>	 elukey: please excuse me, got an unexpected visitor
[10:28:43] <joal>	 hive works indeed :)
[10:28:58] <joal>	 elukey: trying to launch again oozie job
[10:29:27] <elukey>	 nice!
[10:43:48] <icinga-wm>	 PROBLEM - Number of banner_activity realtime events received by Druid over a 30 minutes period on tegmen is CRITICAL: CRITICAL - druid_realtime_banner_activity is 0 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1
[10:44:31] <elukey>	 ah!
[10:44:34] <elukey>	 it works!
[10:44:51] <joal>	 :D
[10:45:27] <elukey>	 today I'll refactor your code review joal, we have migrated everything related to refinery to profiles
[10:45:32] <elukey>	 so I am finally unblocked
[10:45:41] <joal>	 Yay elukey !
[10:45:42] <elukey>	 would you mind to restart it now?
[10:46:47] <joal>	 surely not elukey 
[10:46:59] <joal>	 surely not == I do NO mind :)
[10:47:51] <joal>	 !log Restarting banner-streaming job after hadoop nodes reboot
[10:47:58] <joal>	 Done elukey 
[10:48:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:49:32] <joal>	 elukey: succesful oozie jobs on labdoop
[10:50:25] <joal>	 elukey: I'm ready for us to test upgrading to j8 !
[10:50:48] <elukey>	 niceeee
[10:53:48] <icinga-wm>	 RECOVERY - Number of banner_activity realtime events received by Druid over a 30 minutes period on tegmen is OK: OK - druid_realtime_banner_activity is 397 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1
[10:54:02] <joal>	 elukey: <3
[10:54:04] <joal>	 --^
[10:54:07] <elukey>	 \o/
[10:54:32] <joal>	 elukey: I really feel we're almost done now on that streaming task :)
[10:54:48] <elukey>	 now I am trying to grab the etherpad that we used the last time for cdh upgrade
[10:54:58] <elukey>	 I don't remember the order to shutdown the cluster
[10:55:04] <joal>	 elukey: With a patch to automagically restart the job, I call it not only finished, but awesomely-finished 
[10:55:04] <elukey>	 not that it matters a lot in this use case
[10:55:36] <elukey>	 joal: let's do this - I am going to fix the patch now so we'll call it done done done, and after lunch we'll play with j8
[10:55:44] <joal>	 elukey: +1 1
[10:55:46] <joal>	 !
[10:55:57] <joal>	 many thanks for that elukey :)
[10:58:12] <joal>	 elukey: I have a change to ask for before you atually merge (in the command to launch the job)
[11:06:00] <elukey>	 sure
[11:06:21] <elukey>	 I am also thinking to rename the profile/role from banner_stream to streams_checker
[11:06:25] <elukey>	 what do you think?
[11:06:35] <elukey>	 so we'll be able to add say a checker for the netflow data
[11:06:45] <joal>	 elukey: I think it's a good idea if we plan to add more datasets
[11:06:48] <joal>	 +1
[11:09:42] <elukey>	 joal: going to send the code review, then you can amend it as you wish and then I'll merge ok?
[11:10:02] <joal>	 Yes elukey - grat
[11:13:33] <wikibugs>	 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3284834 (10MoritzMuehlenhoff) Given that this task is stalled for a while now, we should reimage these servers with stretch before eventually putting...
[11:15:52] <wikibugs>	 10Analytics, 10EventBus, 10Services (next): EventBus rejecting events because of malformed characters in the comment - https://phabricator.wikimedia.org/T184698#3892747 (10mobrovac) p:05Triage>03Normal
[11:22:52] <elukey>	 joal: https://gerrit.wikimedia.org/r/#/c/395504/3
[11:28:13] <elukey>	 also found the shutdown procedure in https://etherpad.wikimedia.org/p/analytics-cdh5.10
[11:29:58] <joal>	 elukey: looking at the procedue
[11:33:16] <elukey>	 I'd say to follow it as it is, so safemode + shutdown all daemons
[11:33:27] <elukey>	 then my idea is to do the following
[11:33:34] <elukey>	 1) install openjdk-8 packages
[11:33:48] <elukey>	 2) run update-java-alternatives on all hosts to java 8
[11:34:01] <elukey>	 and then restart everything 
[11:34:20] <joal>	 elukey: no package change needed?
[11:34:25] <joal>	 elukey: REALLY?
[11:34:26] <elukey>	 in this way, if we have fire, we shutdown again, flip update-alternatives to java 7, restart
[11:35:17] <elukey>	 joal: do you think that we should have some?
[11:36:45] <joal>	 hm, shouldn't hadoop packages be updated to run with java8?
[11:37:11] <joal>	 elukey: if not, I think I had misunderstood the labs cluster you put together :)
[11:38:38] <elukey>	 so in theory, as far as I got, the packages that we have should run fine on java 8 (2.6.0+cdh5.10.0+2102-1.cdh5.10.0.p0.72~jessie-cdh5.10.0 don't mention any java version) plus the cloudera guidelines do not talk about any hadoop package update
[11:39:01] <elukey>	 the goal of the labs cluster is to make sure that this is true and that basic stuff works fine after the j8 migration
[11:39:07] <elukey>	 and also that the procedure is good
[11:39:41] <joal>	 !log rerun mediacounts-load-wf-2018-1-11-8
[11:39:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:40:20] <joal>	 elukey: hm, and all the packages for our hadoop sack are the same for j7 and j8? I'm imporessed :)
[11:40:49] <elukey>	 https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_cm_upgrading_to_jdk8.html
[11:41:19] <elukey>	 seems so from --^
[11:41:26] <joal>	 okey then :)
[11:41:59] <joal>	 elukey: I commented in the etherpad - I think we should review machines names and so - do we do that together? batcave?
[11:42:49] <elukey>	 ah yes of course, for the moment I just wanted to have a procedure for the labs cluster to test
[11:42:57] <joal>	 :)
[11:43:05] <elukey>	 once we are confident that in labs all works fine, we do a proper prod one
[11:43:13] <elukey>	 Notice: /Stage[main]/Profile::Analytics::Refinery::Job::Streams_check/Cron[refinery-relaunch-banner-streaming]/ensure: created
[11:43:33] <joal>	 elukey: Can I kill the job manually, to double check it gets relauncheD?
[11:43:56] <elukey>	 */5 * * * * PYTHONPATH=/python /bin/is-yarn-app-running BannerImpressionsStream || /usr/bin/spark2-submit --master yarn --deploy-mode cluster --queue production --conf spark.dynamicAllocation.enabled=false --driver-memory 2G --executor-memory 4G --executor-cores 3 --num-executors 4 --class org.wikimedia.analytics.refinery.job.druid.BannerImpressionsStream --name BannerImpressionsStream /artifac
[11:44:03] <elukey>	 ts/refinery-job.jar --druid-indexing-segment-granularity HOUR --druid-indexing-window-period PT10M --batch-duration-seconds 60 > /dev/null 2>&1
[11:44:07] <elukey>	 this is the job in the crontab --^
[11:44:12] <elukey>	 joal: ack, let's do it
[11:44:45] <joal>	 elukey: I'll let the thing bake some time first - checking that it doesn't do anything as of now - Then kill the job
[11:45:18] <elukey>	 !log re-run webrequest-load-wf-text-2018-1-11-8 (failed due to reboots)
[11:45:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:45:34] <joal>	 Ah ! missed that one - Thanks elukey :)
[11:45:41] <elukey>	 joal: sure, mind if I go to lunch in the meantime?
[11:45:46] <joal>	 please go
[11:46:06] <elukey>	 ack! ttl! :)
[11:46:41] <joal>	 elukey: as planned, there is an error :)
[11:46:51] <joal>	 we'll see that after your lunch
[11:47:21] <joal>	 taking a break, back to recombine with elukey after
[12:32:43] <milimetric>	 joal: I got the padded uri-template from mediawiki jobs: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/datasets_raw.xml#L17
[12:32:52] <milimetric>	 and I was confused, because some use padding, some don't
[12:33:22] <milimetric>	 but now I'm more confused, because what happened before (when I didn't have the padding) is the SUCCESS flags were written to different directories
[12:33:37] <milimetric>	 so like 2018-1-2 for the flag, 2018-01-02 for the data
[12:33:45] <milimetric>	 which is clearly wrong, right?
[12:33:57] <milimetric>	 oh, sorry, you're on break, talk later
[12:50:21] <elukey>	 ahhh /bin/is-yarn-app-running not there
[12:50:23] <elukey>	 checking
[12:52:04] <elukey>	 joal: my fault, fixing it
[13:06:59] <elukey>	 no idea why refinery-dump-status-webrequest-partitions reports errors
[13:07:20] <elukey>	 but it is likely a false positive since everything is marked as not healthy
[13:15:10] <joal>	 elukey: only bizarre thing is that refinery-dump-status-webrequest-partitions reports while failing oozie jobs don't
[13:15:45] <elukey>	 I checked in /mnt/hdfs and data seems to be there
[13:16:05] <joal>	 elukey: the error is from labs I think :D
[13:16:50] <elukey>	 aaaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhh
[13:16:55] <elukey>	 you are right!!!
[13:16:59] <elukey>	 now I feel better
[13:17:05] <elukey>	 yes it is the labs cluster
[13:17:11] <elukey>	 lol
[13:17:22] <joal>	 elukey: wrong dump status from prod, you'd have seen me shouting all over the place this morning :D
[13:18:49] <elukey>	 I was trying /srv/deployment/analytics/refinery/bin/refinery-dump-status-webrequest-partitions --hdfs-mount /mnt/hdfs --datasets webrequest,raw_webrequest --quiet on an1003 and I got a lot of X, this is why I was confused
[13:19:42] <wikibugs>	 10Analytics, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3893184 (10Xqt)
[13:20:14] <joal>	 elukey: sudo -u hdfs /srv/deployment/analytics/refinery/bin/refinery-dump-status-webrequest-partitions  --datasets webrequest,raw_webrequest
[13:20:18] <joal>	 on an1003
[13:20:46] <elukey>	 fails because of perms, right
[13:20:51] <joal>	 elukey: also, I love you email about cron alerts :D
[13:21:06] <elukey>	 I clearly need some coffee though!
[13:21:14] <elukey>	 thanks for the explanation joal
[13:22:10] <elukey>	 we can test the stream check cron if you want, it should work
[13:22:19] <elukey>	 it is not spamming anymore
[13:22:24] <joal>	 elukey: meaning, me killing the job, right?
[13:22:29] <elukey>	 yep
[13:22:52] <joal>	 Doing !
[13:23:10] <joal>	 !log Killing banner-streaming job to have it auto-restarted from cron
[13:23:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:30:18] <joal>	 elukey: Sorry, needed to do some manual patches before actually testing - Job is now properly killed
[13:30:32] <joal>	 Well actually, just got restarted :)
[13:31:14] <joal>	 mwarf elukey - looks like it didn't restart correctly :(
[13:33:38] <elukey>	 joal: what is the issue? the cron or the command executed?
[13:35:24] <joal>	 elukey: worst: class not found !
[13:37:37] <joal>	 WAAAT? Looks like the patch about the banner-job didn't make it to v0.0.57????
[13:37:42] <joal>	 How is that even possible
[13:43:35] <joal>	 elukey: looks like code made it, but I can't find it :(
[13:44:41] <joal>	 rooh elukey - triple checking, but I htink I have understood
[13:44:56] <wikibugs>	 10Analytics, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3893309 (10Xqt) The `Event.__dict__` gives always:  `{'data': '', 'event': 'message', 'id': None, 'retry': None}`
[13:45:10] <wikibugs>	 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Explain decrease in number of patchset authors for same time span when accessed 3 months later - https://phabricator.wikimedia.org/T184427#3893310 (10Aklapper)
[13:45:33] <wikibugs>	 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Explain decrease in number of patchset authors for same time span when accessed 3 months later - https://phabricator.wikimedia.org/T184427#3882640 (10Aklapper)
[13:46:02] <joal>	 cd ..
[13:46:04] <joal>	 oops
[13:48:06] <elukey>	 need to got afk for a bit due to an issue in the coworking, brb!
[13:54:01] <wikibugs>	 (03PS1) 10Joal: Manually refinery-job-spark-2.1 jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403653 (https://phabricator.wikimedia.org/T176983)
[13:54:16] <joal>	 elukey: when you're back --^
[14:05:42] <icinga-wm>	 PROBLEM - Number of banner_activity realtime events received by Druid over a 30 minutes period on einsteinium is CRITICAL: CRITICAL - druid_realtime_banner_activity is 0 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1
[14:07:24] <joal>	 !log Manually restarting banner streaming job to prevent alerting
[14:07:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:08:42] <icinga-wm>	 RECOVERY - Number of banner_activity realtime events received by Druid over a 30 minutes period on einsteinium is OK: OK - druid_realtime_banner_activity is 659 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1
[14:13:54] <wikibugs>	 10Analytics, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3893376 (10Xqt) Here the sseclient dict itself:  ``` {'buf': u'\x00\ufffd\ufffdm\x08\x01E\ufffd\x11\\K&=w\'\ufffd?\ufffd\ufffdP \ufffd\x7f\x7fUZO^\u0717\x06l\ufffdw\u06e0\ufffd...
[14:15:32] <elukey>	 here I am sorry, problems with the heating system :D
[14:15:39] <joal>	 np elukey 
[14:15:53] <joal>	 elukey: sent 2 PRs to fix the cron - Completely my fault - I'm very sorry
[14:16:51] <wikibugs>	 (03CR) 10Ottomata: "Hm, yes, but this is also going with a different partition scheme than the usual year=2018/month=1/...  I think if we are doing the full d" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403484 (https://phabricator.wikimedia.org/T170764) (owner: 10Milimetric)
[14:18:07] <elukey>	 joal: one is https://gerrit.wikimedia.org/r/403653, don't find the second
[14:18:23] <elukey>	 ah https://gerrit.wikimedia.org/r/403655
[14:18:56] <wikibugs>	 (03PS2) 10Joal: Manually add refinery-job-spark-2.1 jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403653 (https://phabricator.wikimedia.org/T176983)
[14:19:07] <joal>	 elukey: just corrected the typo above
[14:19:13] <joal>	 Looks like you find the other one
[14:20:05] <elukey>	 and I guess we need to first deploy the other one right?
[14:20:34] <wikibugs>	 (03CR) 10Joal: "Ah! Sorry - Completely missed the partition-format change." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403484 (https://phabricator.wikimedia.org/T170764) (owner: 10Milimetric)
[14:20:41] <joal>	 elukey: correct
[14:20:55] <milimetric>	 thanks, joal, so should I deploy again then?
[14:21:07] <milimetric>	 or does anyone want anything else in -refinery before I deploy?
[14:21:22] <elukey>	 milimetric: probably https://gerrit.wikimedia.org/r/403653 would need to be deployed as well
[14:21:31] <joal>	 milimetric: I can also do it, I broke our plan with elukey (soooorry)
[14:22:08] <joal>	 milimetric: do you need to deploy for the interlanguage patch, or have you done it already?
[14:22:09] <elukey>	 joal: do we usually add jars manually in this way? Super ignorant about it, if you are sure about the procedure please +2 and go ahead
[14:22:21] <elukey>	 also, hello milimetric ! 
[14:22:23] <milimetric>	 uh... not sure what that jar thing's about, go ahead joal 
[14:22:37] <milimetric>	 hi :)
[14:22:53] <joal>	 elukey: manually adding jars is the old way - We now use jenkins to do that - But I don't know how to automate
[14:23:34] <joal>	 ottomata: would you mind confirming  https://gerrit.wikimedia.org/r/403653 is ok for you?
[14:24:06] <ottomata>	 hii
[14:24:28] <joal>	 Hello ottomata :)
[14:24:36] <ottomata>	 oo joal, so that is build from the refinery-job-spark subproject you made in that one commit?
[14:24:38] <joal>	 Sorry for the not-so-nice welcome :S
[14:24:42] <ottomata>	 it is nice!
[14:24:44] <joal>	 correct
[14:24:59] <ottomata>	 why is it manual?  i don't mind at all just wondering
[14:25:05] <ottomata>	 so you don't have to do the whole release?
[14:25:22] <joal>	 ottomata: because I don't know how to automate it - looking now
[14:25:42] <joal>	 And to prevent having to re-do the whole release thing as well
[14:26:00] <ottomata>	 hm, oh does the jenkins job have the sub projects built into the configs?
[14:26:19] <ottomata>	 i would  thikn it would be just adding the refinery-job-spark subproject to the parent pom, and then everything would work as normal
[14:26:21] <ottomata>	 buut anyway
[14:26:24] <ottomata>	 i def don't mind this at all
[14:26:31] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Manually add refinery-job-spark-2.1 jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403653 (https://phabricator.wikimedia.org/T176983) (owner: 10Joal)
[14:26:52] <wikibugs>	 (03PS5) 10Mforns: [WIP] Improve WikiSelector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530)
[14:27:02] <joal>	 ottomata: jar is release to archive, but jenkins doesn't know about it when patching refinery
[14:27:15] <joal>	 ok, merging then deploying
[14:27:20] <joal>	 Thanks ottomata :)
[14:27:31] <ottomata>	 ohhh intersering ok
[14:27:44] <joal>	 ottomata: do you know where our jenkins code lives?
[14:27:52] <joal>	 I'll gladly provide a patch :)
[14:29:36] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/403653 (https://phabricator.wikimedia.org/T176983) (owner: 10Joal)
[14:30:11] <joal>	 milimetric: the patch I commented, have you dpeloyed it already?
[14:30:19] <elukey>	 joal,ottomata - if you are ok I'd follow the cluster shutdown in https://etherpad.wikimedia.org/p/analytics-cdh5.10 for labs and install openjdk 8
[14:30:22] <milimetric>	 no joal
[14:30:47] <milimetric>	 once refinery is deployed, I'll restart the job.  I'm dropping old data now
[14:31:03] <joal>	 milimetric: Ok, just so that you know: I deploy now :)
[14:31:15] <joal>	 and your patch is in (confirmed)
[14:31:34] <milimetric>	 k, great thx
[14:33:19] <joal>	 !log Deploy refinery with Scap
[14:33:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:37:39] <joal>	 ottomata: looks I broke something :(
[14:38:04] <joal>	 ottomata: deployment went fine, but I get an error when going to /srv/deployment/analytics/refinery on stat1004
[14:38:26] <ottomata>	 joal:  i think i broke eventstreams yesterday, (BUT IT WAS FINE), gonna focus on that real quick...
[14:38:48] <joal>	 sure ottomata 
[14:40:02] <joal>	 ottomata: I checked on an1003 and stat1005, everything looks fine - seems stat1004 related
[14:41:41] <joal>	 elukey: I confirm we can move forward with the puppet patch using new jar
[14:42:29] <elukey>	 ottomata: just found out something uber cool
[14:42:29] <elukey>	 elukey@labpuppetmaster1001:~$ sudo cumin "project:analytics name:hadoop" 'ls -l'
[14:42:33] <elukey>	 6 hosts will be targeted:
[14:42:35] <elukey>	 hadoop-coordinator-1.analytics.eqiad.wmflabs,hadoop-master-[1-2].analytics.eqiad.wmflabs,hadoop-worker-[1-3].analytics.eqiad.wmflabs
[14:43:05] <joal>	 roh elukey, that is super nice :)
[14:43:07] <elukey>	 joal: ack, shall we deploy it?
[14:43:11] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats Bug: Menu to select projects doesn't work (sometimes?) - https://phabricator.wikimedia.org/T179530#3893434 (10mforns) Fixed bugs/nits:  - The one mentioned in the descriprion of this task - If you're typing and click outside the control...
[14:43:12] <joal>	 please elukey :)
[14:43:25] <joal>	 elukey: I apologize again for the mess I made :(
[14:45:09] <elukey>	 joal: how do you dare making a mistake? I am really disappointed
[14:45:24] <elukey>	 from now on I am not going to talk with you anymore
[14:45:27] <elukey>	 :D
[14:45:46] <joal>	 elukey: I'll get out and throw myself against my nearby tree ten times
[14:46:06] * joal is looking his helmet
[14:46:14] <elukey>	 ahahahhaah
[14:46:36] <joal>	 !log Deploy refinery onto HDFS
[14:46:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:47:00] <joal>	 elukey: I think we have an issue with /srv/deploy/analytics/refinery in stat1004
[14:47:12] <joal>	 looks ok on stat1005 and an1003, but broken on stat1004
[14:47:23] <elukey>	 in the meantime, changes deployed to an1003
[14:47:28] <joal>	 elukey: nothing urgent, ottomata knows - just to let you know
[14:47:34] <joal>	 correct elukey 
[14:49:19] <elukey>	 joal: I didn't get what broke on stat1004, scap deploy-log from tin is not clear
[14:49:55] <joal>	 elukey: deploy went fine, but when cd-ing into /srv/, it's not happy
[14:50:20] <elukey>	 elukey@stat1004:/srv/deployment/analytics/refinery$ ls -l
[14:50:21] <elukey>	 total 40
[14:50:23] <elukey>	 etc..
[14:50:32] <elukey>	 git log also points to your change
[14:50:37] <elukey>	 is it missing the new jar?
[14:50:44] <joal>	 elukey: I think it's not
[14:51:12] <elukey>	 lrwxrwxrwx 1 analytics analytics   66 Jan 11 14:34 refinery-job-spark-2.1.jar -> org/wikimedia/analytics/refinery/refinery-job-spark-2.1-0.0.57.jar
[14:51:43] <joal>	 elukey: when I cd I get: https://gist.github.com/jobar/15e4e64a0b706bde44eed2e908853114
[14:52:38] <elukey>	 even if you close your ssh session and retry?
[14:53:57] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats Bug: Menu to select projects doesn't work (sometimes?) - https://phabricator.wikimedia.org/T179530#3893481 (10mforns) Improvements: - The user can choose language, project family or special wiki in any order - The user can choose dbname...
[14:54:39] <wikibugs>	 10Analytics, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3893484 (10Ottomata) Ah, I did deploy EventStreams yesterday for T171011.  I don't know exactly what caused this change, but I think the `event.data` is now utf-8 encoded.  I d...
[14:54:39] <joal>	 elukey: yessir
[14:56:19] <elukey>	 can't really explain it
[14:56:28] <joal>	 :(
[14:56:35] <wikibugs>	 (03PS6) 10Mforns: Improve WikiSelector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530)
[14:56:39] <ottomata>	 hmmi think dan saw this the other day too
[14:57:02] <ottomata>	 dunno why it would work for us though elukey
[14:57:17] <ottomata>	 i guess joal's git/git-fat wants to write temp files in .git/fat/objects/
[14:57:19] <ottomata>	 but dosen't have write perms
[14:57:33] <ottomata>	 we don't have write perms there either
[14:57:34] <ottomata>	 so dunno
[14:57:44] <elukey>	 yeah it is a bit weird
[15:00:25] <joal>	 mwarf
[15:02:01] <ottomata>	 joal:  things work for you there though, right? just annoying errors in that dir?
[15:03:56] <joal>	 ottomata: did not even try to make it work
[15:04:05] <joal>	 ottomata: use an1003 to deploy
[15:04:10] <ottomata>	 aye k
[15:04:23] <joal>	 milimetric: refinery deployed on HDFS - Feel free to test your new job
[15:04:33] <milimetric>	 thanks joal 
[15:04:34] <joal>	 team - need to drop to catch Lino
[15:04:40] <joal>	 Will be back for statndup
[15:11:19] <milimetric>	 ottomata: dr0ptp4kt wants superset access
[15:11:30] <ottomata>	 ok!
[15:11:40] <milimetric>	 is there an official way to do this?  This might get tedious :)
[15:12:18] <ottomata>	 you can do it too!
[15:13:27] <ottomata>	 dr0ptp4kt done
[15:13:29] <ottomata>	 that is your username
[15:13:34] <ottomata>	 milimetric: https://superset.wikimedia.org/users/list/
[15:13:40] <ottomata>	 click the little + button in the upper right
[15:13:54] <ottomata>	 they just need an ldap account and to be in either the wmf or nda ldap grou
[15:13:57] <ottomata>	 if they are, you can add them
[15:14:07] <milimetric>	 oh cool
[15:14:13] <milimetric>	 thx
[15:14:26] <ottomata>	 you just have to fill out the form with info from https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml
[15:14:34] <ottomata>	 email and username
[15:14:41] <ottomata>	 gotta be shell username, not wikitech ldap username
[15:14:51] <ottomata>	 e.g.
[15:14:53] <ottomata>	 https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L1071
[15:15:07] <ottomata>	 put people in the Alpha role
[15:15:12] <ottomata>	 and make them active :
[15:15:12] <ottomata>	 :)
[15:15:26] <ottomata>	 yeah, i'd prefer if  we didn't have to manually add users
[15:15:41] <ottomata>	 buuut, we are kinda waiting for  patches and releases in two upstreams
[15:16:16] <milimetric>	 yeah, it's ok for now
[15:18:29] <elukey>	 so update-java-alternatives returns this when I try to set j8
[15:18:30] <elukey>	 update-alternatives: error: no alternatives for mozilla-javaplugin.so
[15:18:33] <elukey>	 update-java-alternatives: plugin alternative does not exist: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/IcedTeaPlugin.so
[15:18:59] <elukey>	 but if I execute it with --jre and --jre-headless works
[15:19:13] <elukey>	 it seems an issue with missing plugins, but probably completely not related ?
[15:24:42] * elukey reads https://en.wikipedia.org/wiki/IcedTea
[15:42:05] <elukey>	 it seems not important, also triple checked with Moritz
[15:51:26] <wikibugs>	 10Analytics, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3893739 (10zhuyifei1999) I can't reproduce this locally with python2: ``` $ PYWIKIBOT2_NO_USER_CONFIG=1 python pwb.py shell No handlers could be found for logger "pywiki" Welco...
[15:53:28] <moritzm>	 the plugin is pretty much dead anyway with Firefox 57
[15:57:55] <wikibugs>	 (03PS3) 10Fdans: Translate g according to the y-axis width [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/403184 (https://phabricator.wikimedia.org/T184138)
[15:58:12] <wikibugs>	 (03CR) 10Fdans: Translate g according to the y-axis width (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/403184 (https://phabricator.wikimedia.org/T184138) (owner: 10Fdans)
[16:04:27] <elukey>	 ah! It might not be possibile to have both jdk versions on the same box
[16:04:43] <elukey>	 in ps I am still seeing /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java etc..
[16:04:45] <fdans>	 milimetric: if you're ok with the margin fix, I'd like to merge and deploy everything we have pending :)
[16:05:05] <elukey>	 and from the init.d scripts, it seems that /usr/lib/bigtop-utils/bigtop-detect-javahome favor java 7 over 8
[16:05:28] <milimetric>	 fdans: I was in the middle of reviewing your map, want me to finish that and we can do that too?
[16:05:44] <elukey>	 but maybe there is a way to specify JAVA_HOME
[16:05:46] <fdans>	 cool!
[16:06:14] <milimetric>	 (I'm slow this week 'cause ops week turns me into a suspicious hamster)
[16:07:14] <elukey>	 milimetric: sorry there is a lot of spam coming from me, let me know if you go in a rabbit hole, it might have already been resolved :(
[16:08:07] <milimetric>	 elukey: no, it's ok, I just double-check everything anyway.  But we should probably talk about who's supposed to restart jobs, I'm happy to do it on my ops week but I'm always going to be slower than you
[16:08:27] <nuria_>	 elukey: the errors with dropping data  like " Error trying to check druid datasource webrequest. " were coming from labs then?
[16:08:32] <milimetric>	 hm, this is a weird non-error error when trying to fix hive:
[16:08:40] <milimetric>	 https://www.irccloud.com/pastebin/OkcCciVL/
[16:08:59] <milimetric>	 what's it mean "Tables not in metastore:	interlanguage_navigation" ?
[16:09:01] <wikibugs>	 (03CR) 10Mforns: [V: 032 C: 031] "LGTM! Although I +1'ed one of Fran's comments." (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402466 (https://phabricator.wikimedia.org/T183188) (owner: 10Milimetric)
[16:09:19] <elukey>	 nuria_: if in the subject of the email you see hadoop-coordinator-1 then it comes from labs
[16:09:38] <nuria_>	 elukey: k it does
[16:09:42] <elukey>	 milimetric: me and joseph usually log things in https://tools.wmflabs.org/sal/analytics
[16:09:58] <elukey>	 so if you are unsure you can check in there
[16:10:01] <nuria_>	 milimetric: i totally missed why are we repairing thta table
[16:10:06] <nuria_>	 *that table
[16:10:18] <milimetric>	 nuria_: the old table was going to interlanguage/navigation
[16:10:25] <milimetric>	 and it needs to go to interlanguage/navigation/daily
[16:10:54] <milimetric>	 otherwise the success flags and the data are in different directories, you commented on how I should've tested that in the patch last night (you're right, btw, I should've done that)
[16:11:10] <milimetric>	 so I dropped the old table, and created the new one
[16:11:25] <milimetric>	 but I had already run a few days of the new job, which were fine
[16:11:34] <nuria_>	 milimetric: you are executing that on "default" database though
[16:11:36] <milimetric>	 so I moved those directories, and ran the repair command
[16:11:44] <nuria_>	 milimetric: the "repair"
[16:11:46] <milimetric>	 hahahaha... wait, nooo, I tried it on wmf
[16:11:49] <milimetric>	 omg
[16:11:57] <nuria_>	 milimetric: that is why table is not defined
[16:12:03] <milimetric>	 oh, right FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
[16:12:10] <milimetric>	 wth...
[16:12:31] <milimetric>	 so when I run it in the _correct_ database (doh) it gives me this error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
[16:12:44] <nuria_>	 elukey (cc milimetric ) and how about connection errors to analytics1058.eqiad.wmnet those are the reboots right?
[16:13:55] <nuria_>	 milimetric: i do not think  i have run repair before , let me look what it does
[16:16:34] <nuria_>	 milimetric: redaing about repair it fixes partitions but not base table location right? (is that what we are trying to correct?)
[16:17:06] <milimetric>	 no, it knows about partitions from new data being inserted
[16:17:10] <milimetric>	 and the old data is in the right spot
[16:17:29] <elukey>	 nuria_: correct
[16:17:31] <milimetric>	 so it should just infer partitions from the old data
[16:18:29] <wikibugs>	 (03CR) 10Mforns: "LGTM overall :] One question though, is it possible that with very large numbers the charts are shifted so much that the chart overflows o" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/403184 (https://phabricator.wikimedia.org/T184138) (owner: 10Fdans)
[16:19:45] <wikibugs>	 (03CR) 10Fdans: "> LGTM overall :] One question though, is it possible that with very" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/403184 (https://phabricator.wikimedia.org/T184138) (owner: 10Fdans)
[16:22:12] <nuria_>	 milimetric: mmmm...from reading the repair command i got that it will add partitions like base-location/partition-1  base-location/partition-2 if they have been added by  a process not in hive but i do not think it can reconcile partitions that already exist in hive  
[16:22:19] <nuria_>	 milimetric: hard to tell though
[16:22:47] <milimetric>	 nuria_: right, these partitions were added manually by file copy
[16:23:01] <nuria_>	 ah i see
[16:23:03] <nuria_>	 ok
[16:23:07] <milimetric>	 it's weird because I've used this command a lot, and in exactly the same way
[16:23:18] <milimetric>	 so something deeper is different that I don't understand
[16:23:46] <nuria_>	 milimetric: and adding partitions to table by hand did not work?
[16:24:05] <milimetric>	 I'll try that now, just wondering if anyone else got this error before
[16:24:06] <wikibugs>	 10Analytics, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3893869 (10zhuyifei1999) >>! In T184713#3893484, @Ottomata wrote: > To fix on your side, you'd either: A. run in python3, where everything is utf-8 anyway, or B. change [[ http...
[16:24:31] <milimetric>	 there were a few similar bugs on stackoverflow and their issue tracker but nothing exactly the same
[16:24:41] <milimetric>	 and their proposed changes didn't work
[16:27:42] <nuria_>	 milimetric: repair just add partitions right , by checking partitions existent on meta store
[16:27:54] <milimetric>	 yes
[16:28:20] <nuria_>	 sorry by checking partitions  in hdfs 
[16:28:54] <nuria_>	 milimetric: so we should be able to add them by add partition command, maybe that one gives a more pertinent error
[16:28:59] <wikibugs>	 10Analytics, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3893881 (10Ottomata) > event.data in python 2 is an instance of unicode Hm, you are right.    I'm not sure what is going on then.  I can't really reproduce this either, but I'm...
[16:30:26] <milimetric>	 nuria_: yeah, I'll try that in a second, just gotta finish Fran's review
[16:32:21] <elukey>	 so tried with export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre in yarn-env.sh, it works
[16:33:16] <elukey>	 ottomata: would it be ok to make an explicit parameter of cdh? if not set (default) it doesn't get rendered leaving things as they are (auto-detect), meanwhile if set if forces a jvm version
[16:35:53] <ottomata>	 elukey:  +1
[16:38:15] <wikibugs>	 (03CR) 10Milimetric: [C: 032] Translate g according to the y-axis width [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/403184 (https://phabricator.wikimedia.org/T184138) (owner: 10Fdans)
[16:38:47] <milimetric>	 fdans: I have a few comments on the map change, maybe let me fix my docs change and we can deploy just those, do the map later
[16:43:12] <wikibugs>	 (03PS2) 10Milimetric: Add documentation links to each metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402466 (https://phabricator.wikimedia.org/T183188)
[16:43:23] <wikibugs>	 (03CR) 10Milimetric: Add documentation links to each metric (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402466 (https://phabricator.wikimedia.org/T183188) (owner: 10Milimetric)
[16:43:47] <milimetric>	 k, fdans, ready for you to merge/deploy
[16:44:42] <wikibugs>	 10Analytics-Tech-community-metrics: Number of changeset submitters in "gerrit_main_numbers" widget differs from number of submitters in "gerrit_top_developers" widget - https://phabricator.wikimedia.org/T184741#3893944 (10Aklapper) p:05Triage>03Normal
[16:46:08] <wikibugs>	 10Analytics-Tech-community-metrics: Number of changeset submitters in "gerrit_main_numbers" widget differs from number of submitters in "gerrit_top_developers" widget - https://phabricator.wikimedia.org/T184741#3893944 (10Aklapper)
[16:47:04] <wikibugs>	 (03CR) 10Milimetric: "Some initial comments, I haven't played with the actual UI yet" (035 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[16:56:39] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Analytics Cluster to Java 8 - https://phabricator.wikimedia.org/T166248#3894009 (10elukey) Tested in labs the procedure outlined above (install + update-java-alternatives to java8) and everything went fine. The following er...
[17:00:38] <nuria_>	 ping ottomata standddupp
[17:01:08] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3894034 (10Ottomata) Here's a Q:  In cergen, I'm generating EC keys using a [[ https://cryptography.io/en/latest/hazmat/primitives/asymmetric/ec...
[17:05:39] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Analytics Cluster to Java 8 - https://phabricator.wikimedia.org/T166248#3894039 (10Ottomata) > /usr/lib/bigtop-utils/bigtop-detect-javahome, that seems to favor java7 over java8. Strange that it favors Java 7 even if update...
[17:08:25] <ottomata>	 !log restart kafka on kafka-jumbo1001...something is not right with my certpath change yesterday
[17:08:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:11:00] <ottomata>	 !log restart kafka on kafka-jumbo1003
[17:11:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:28:29] <wikibugs>	 10Analytics-Kanban: Add ISO code to AQS data per country - https://phabricator.wikimedia.org/T184748#3894113 (10Nuria)
[17:28:53] <wikibugs>	 10Analytics-Kanban: Add ISO code to AQS data per country - https://phabricator.wikimedia.org/T184748#3894125 (10Nuria)
[17:28:56] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Analytics-Wikistats, 10RESTBase-API, 10Services (done): Add "Pageviews by Country" AQS endpoint - https://phabricator.wikimedia.org/T181520#3894124 (10Nuria)
[17:30:13] <wikibugs>	 10Analytics-Kanban: Add ISO code to AQS data per country - https://phabricator.wikimedia.org/T184748#3894113 (10Nuria) This involves:  - adding a new column to cassandra - changing loading jobs
[17:33:09] <wikibugs>	 10Analytics-Kanban: Puppetize job that saves old versions of geoIP database - https://phabricator.wikimedia.org/T136732#2345955 (10Nuria)
[17:37:14] <joal>	 !log Kill manual banner-streaming job to see it restarted by cron
[17:37:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:38:19] <wikibugs>	 10Analytics, 10Analytics-Dashiki: Enable nested on-wiki config pages in mediawiki-storage - https://phabricator.wikimedia.org/T163725#3894182 (10Nuria)
[17:38:27] <joal>	 Hi marlier - I notice you have a huge hive query running - I wonder if this kind of query could not benefit froim sampling (given uou use a full month of data)
[17:39:57] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Analytics Cluster to Java 8 - https://phabricator.wikimedia.org/T166248#3894189 (10elukey) it does the following (maybe I am reading the code in the wrong way):  ``` # Note that the JDK versions recommended for production u...
[17:40:11] <joal>	 marlier: Even if sampling is not an option, could you please follow the parttern describe here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Run_long_queries_in_a_screen_session_and_in_the_nice_queue in order to let smaller requests be executed faster?
[17:40:16] <joal>	 marlier: Thanks :)
[17:41:32] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: When displaying a graph include metric total not only average - https://phabricator.wikimedia.org/T184139#3894194 (10Nuria)
[17:43:41] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3894199 (10RobH) We don't have any spare hardware with SSDs, but do have spares with 1TB SATA.  wmf4750 - Dell PoweEdge R430 - Dual Intel Xeon E5-2640 v3 2.6GHz - 64GB RAM  Oxygen has only 79...
[17:43:47] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3894202 (10RobH) a:03faidon
[17:50:22] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats Bug: Menu to select projects doesn't work (sometimes?) - https://phabricator.wikimedia.org/T179530#3894217 (10mforns)
[17:50:24] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats Bug - https://phabricator.wikimedia.org/T184475#3894219 (10mforns)
[17:52:23] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10ORES, 10Scoring-platform-team: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479#3884392 (10Milimetric) totally, put a meeting on our calendar or let's chat here.
[17:53:32] <wikibugs>	 10Analytics, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3894231 (10Nuria)
[17:53:43] <wikibugs>	 10Analytics, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3884526 (10Nuria) We will be killing that instance
[17:54:15] <wikibugs>	 10Analytics-Kanban, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3894238 (10Nuria)
[17:55:31] <wikibugs>	 10Analytics-Kanban, 10RESTBase-API, 10Services (watching): Update AQS pageview-top definition - https://phabricator.wikimedia.org/T184541#3894241 (10Nuria)
[17:56:03] <robh>	 nuria_: haha, im working on that task you moved right now
[17:56:13] <robh>	 we have a spare we can likely allocate =]
[17:56:14] <wikibugs>	 10Analytics-Kanban, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3894245 (10elukey) 05Open>03Resolved a:03elukey Instance deleted!
[17:56:28] <wikibugs>	 10Analytics-Kanban, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3894248 (10elukey) 05Resolved>03Open
[17:56:59] <wikibugs>	 10Analytics-Kanban, 10Puppet, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3884526 (10elukey) Just seen that there are more instances to fix. Some of them are under experiment at the moment, will try to fix them asap though.
[17:57:45] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#3894260 (10RobH) a:03Ottomata So we have a spare server that would actually meet this requirement without ordering more hardware:  wmf4751 - w...
[17:59:08] <wikibugs>	 10Analytics: Transform and Import Qualtrics Survey data - https://phabricator.wikimedia.org/T184626#3890422 (10Nuria) ideas: put data into mysql labs and have a superset labs instance ?
[17:59:46] <wikibugs>	 10Analytics: Transform and Import Qualtrics Survey data - https://phabricator.wikimedia.org/T184626#3890422 (10Nuria) Maybe put data into mysql on labs once transformed?
[18:01:39] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3894306 (10Nuria)
[18:04:26] <wikibugs>	 10Analytics: Investigate oozie suspended workflows - https://phabricator.wikimedia.org/T163933#3894310 (10Nuria) a:03JAllemandou
[18:04:46] <wikibugs>	 10Analytics: Investigate oozie suspended workflows - https://phabricator.wikimedia.org/T163933#3215250 (10Nuria) Can @JAllemandou please document this on our oncall docs?
[18:04:59] <wikibugs>	 10Analytics-Kanban: Investigate oozie suspended workflows - https://phabricator.wikimedia.org/T163933#3894312 (10Nuria)
[18:07:39] <wikibugs>	 10Analytics, 10User-Elukey: Secure hue and other private data access sites with 2FA - https://phabricator.wikimedia.org/T159584#3894317 (10Nuria)
[18:08:58] <chelsyx>	 Hello a-team! I'm working with the mobile apps session metrics tables(https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/mobile_apps_session_metrics) and have some questions about how the data is generated, especially how this quantile function work (https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala#L42-L54).
[18:10:00] <chelsyx>	 For the quantiles, instead of a number, there are lowbound and highbond for each quantiles. I don't understand what that means...
[18:10:28] <joal>	 Hi chelsyx - We're in meeting, but the code is here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala
[18:10:55] <joal>	 chelsyx: The quantile function we use is from com.twitter.algebird
[18:11:00] <joal>	 marlier: ping?
[18:15:37] <chelsyx>	 joal: Sorry for interrupt your meeting. I will put my questions here, but feel free to reply after your meeting :)
[18:19:05] <chelsyx>	 joal: Yes, I looked into the quantile function, and found this doc for the Qtree: https://twitter.github.io/algebird/datatypes/approx/q_tree.html
[18:21:08] <chelsyx>	 joal: But it doesn't make sense to me... In the example they gave, the range of 50th percentile is (5.0,6.0) for data = List(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)
[18:22:04] <chelsyx>	 joal: But the 50th quantiles of this list should be 4.5, which falls out of the range of (5, 6)
[18:22:24] <chelsyx>	 joal: Maybe you can help me understand it?
[18:23:09] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#3894361 (10Ottomata) a:05Ottomata>03faidon Great, that'll do just fine!  Assigned to @faidon for approval.
[18:34:36] <marlier>	 joal: sorry, was afk, will be back at a terminal in about 10 and can kill my job then (if it hasn't already been) 
[18:34:40] <marlier>	 Sorry about that! 
[18:35:13] <elukey>	 ottomata: so the cluster in labs is now running with java 8, all daemons
[18:35:22] <ottomata>	 awesooome
[18:35:33] <ottomata>	 camus/refine stuff still working?
[18:35:48] <elukey>	 still not sure, will need to check :)
[18:36:13] <elukey>	 atm I am trying to get why hive (client) works fine 
[18:36:26] <elukey>	 so I straced it and it sources hadoop-env.sh :)
[18:36:30] <elukey>	 but not sure where
[18:36:45] <elukey>	 same thing for spark-shell I think
[18:38:40] <elukey>	 tailing also /var/log/camus/webrequest.log looks good
[18:39:15] <elukey>	 so still need to figure out a couple of details, buuuut overall looks very good
[18:39:16] <ottomata>	 gr8
[18:39:18] <ottomata>	 cooool
[18:43:17] * elukey off!! 
[18:43:54] <wikibugs>	 10Analytics: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#3894446 (10Nuria) Data will be updated monthly (?) (maybe data needs to be updated more frequently?)  * scoping https://www.mediawiki.org/wiki/Extension:CheckUser/cu_changes_table  * join the data with user  and...
[18:45:51] <wikibugs>	 10Analytics: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#3894448 (10Nuria) We can get started in scooping the cu_changes_table
[18:48:02] <wikibugs>	 10Analytics-Kanban: Scoop https://www.mediawiki.org/wiki/Extension:CheckUser/cu_changes_table - https://phabricator.wikimedia.org/T184759#3894451 (10Nuria) p:05Triage>03High
[18:55:20] <marlier>	 joal: killed
[18:56:03] <joal>	 marlier: no big deal, it was more af a (more-or-less-gentle) ping on resource usage :)
[18:56:14] <joal>	 thanks a lot marlier 
[18:56:46] <joal>	 marlier: if you're after trends on big volumes of data, sampling is your friend on webrequest data
[18:57:01] <joal>	 marlier: you'll divide usage by 64, so that's really interesting
[18:59:49] <chelsyx>	 joal: I just found this: https://github.com/twitter/algebird/issues/517
[19:00:20] <chelsyx>	 joal: looks like the question I have is a problem of the Qtree function...
[19:01:17] <marlier>	 joal: it actually looks like the pageview_hourly table should have what I'm looking for already
[19:01:21] <marlier>	 I just didn't know it existed
[19:01:23] <chelsyx>	 joal: Could we change our implementation from Long to Double, as the issue suggested?
[19:01:32] <DarTar>	 joal, nuria, bearloga: just emailed you all about the (hopefully) final steps before publishing the blog post, take a look 
[19:02:15] <DarTar>	 I’m also sending a note to Ellery, he’ll be excited to hear this
[19:02:17] <joal>	 chelsyx: I guess updating the types is a solution - we need to check with other users that it works for them
[19:02:20] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3894549 (10BBlack) No, it's not a problem.  For certificates, `NIST P-256` (aka `secp256r1`, aka `prime256v1`, depending on who's talking) is re...
[19:02:26] <joal>	 chelsyx: an you please file a phab taks?
[19:02:35] <chelsyx>	 joal: Sure!
[19:03:01] <joal>	 marlier: if pageview_hourly has what you want, have a look at projectview_hourly as well: even ssaller for high level use cases
[19:03:12] <joal>	 many thanks for looking onto that chelsyx !!
[19:05:57] <chelsyx>	 joal: No problem!
[19:06:59] <chelsyx>	 joal:  Also, do you think it's doable to aggregate the data by day, instead of by week?
[19:07:30] <joal>	 chelsyx: the weekly request was from the reading tema that asked for the data
[19:07:37] <joal>	 I can't recall precisely who it was
[19:08:36] <chelsyx>	 joal: I see. I will put this in the ticket too, and see if it works for other folks
[19:08:58] <joal>	 chelsyx: Better to have 2 tickets  - 1 for the bug, the other for a feature request - Please :)
[19:09:23] <chelsyx>	 joal: Got you!
[19:12:57] <joal>	 Gone for diner a-team, back after
[19:13:07] <chelsyx>	 joal: Also, do you think it's doable to use some other function to get a single value (not a range) for the quantiles? I know this may require a lot more work and I obviously don't know what is workable in our implementation...
[19:13:37] <chelsyx>	 joal: But when talking about quantiles, most people would expect a single value, not a range...
[19:30:06] <nuria_>	 marlier: have you looked at pivot.wikimedia.org? you can visualy explore that data
[19:31:44] <marlier>	 I have.  I need to be able to do a bunch of different ad-hoc aggregations and stuff, it'll actually be simplest for me to just get it into a spreadsheet and work with it there.  (I'm trying to make sure that I don't do anything bad to event logging when we start performance oversampling for Singapore going live :-) )
[19:33:19] <nuria_>	 chelsyx: probabilistic quantiles have  arange because they are not exact
[19:33:54] <nuria_>	 chelsyx:  it is not always possible when  (big data-ing) to calculate quantiles precisely
[19:34:28] <nuria_>	 chelsyx: hopefully this makes sense, if things have not changed our qtree quantile calculation is a probabilistic one
[19:35:01] <nuria_>	 chelsyx: qtree is build for streaming data 
[19:35:16] <nuria_>	 chelsyx: for which the bounds of the data set are not known
[19:38:03] <nuria_>	 chelsyx:  providing a precise quantile for this data might be possible (app data is not super large) in general I think is probably the wrong way to look at (very) large datasets though
[19:41:31] <nuria_>	 chelsyx: see http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
[19:47:21] <chelsyx>	 nuria_: Got it. Thanks! 
[19:47:28] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3894725 (10Ottomata) > Oook, I've set this [restricted certpath algorithms] on all jumbo Kafka brokers.  Welp, something is totally crazy with P...
[19:47:41] <ottomata>	 elukey:  FYI, varnishakfak canary is broken due to my jdk.certpath.disabledAlgorithms i deployed yesterday
[19:47:49] <ottomata>	 it has something to do with the puppet signed certificates not working
[19:47:58] <ottomata>	 which is why I didn't catch it in labs; i was using self signed cert CA
[19:48:05] <ottomata>	 (beacuse getting puppet to sign certs in labs is a pain)
[19:48:16] <ottomata>	 am working on it, but just in case you happen to look on syslog on cp1008, that is why...
[19:49:55] <ottomata>	 afk for a bit
[19:57:44] <wikibugs>	 10Analytics, 10Discovery-Analysis, 10MobileApp, 10Wikipedia-Android-App-Backlog: Bug behavior of QTree[Long] for quantileBounds - https://phabricator.wikimedia.org/T184768#3894787 (10chelsyx)
[20:08:49] <joal>	 nuria_: do you have a minute for last-minute modif of blog poast?
[20:30:06] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Pywikibot-core: EventStreams doesnt find any messages anymore - https://phabricator.wikimedia.org/T184713#3894927 (10Xqt) I have no glue what is failing there. Our pagegenerators_tests on Travis ci works fine for all python versions. On the other hand I have two scripts runn...
[20:38:39] <nuria_>	 joal: yes, here
[20:39:21] <joal>	 Heya nuria_ - I modified in the middle of the article  - Do you mind having a look, we can discuss after
[20:39:29] <nuria_>	 joal: ya looking
[20:41:00] <nuria_>	 joal: looks good, corrected  grammar ( i think)
[20:41:15] <joal>	 Thanks :)
[20:43:15] <joal>	 nuria_: I promise it was hte last one I had :)
[20:43:33] <nuria_>	 joal: no worries, i think is 99% done
[20:44:06] <joal>	 Yay :)
[20:45:59] <joal>	 nuria_: While thinking about it - Do we keep or delete that one https://phabricator.wikimedia.org/T183951 ?
[20:46:37] <nuria_>	 joal: we can delete if it only exists temporarily as part of the druid ingestion, is that the case?
[20:47:27] <joal>	 It is nuria_ 
[20:47:33] <joal>	 I'll mark it as invalid
[20:47:38] <nuria_>	 joal: then let's decline
[20:47:55] <wikibugs>	 10Analytics-Kanban: Document mediawiki history reduced table - https://phabricator.wikimedia.org/T183951#3894968 (10JAllemandou) 05Open>03declined
[20:47:59] <joal>	 Done - Thanks
[20:48:22] <nuria_>	 joal: one question i have (nothing to do with this)
[20:48:26] <joal>	 sure
[20:48:50] <nuria_>	 joal: on mforns to move data from spark to druid he creates a "master" ingestion spec that is used.
[20:48:57] <nuria_>	 joal: let me sjow you
[20:49:00] <nuria_>	 *show 
[20:49:32] <nuria_>	 joal: https://gerrit.wikimedia.org/r/#/c/386882/24/refinery-core/src/main/scala/org/wikimedia/analytics/refinery/core/DataFrameToDruid.scala
[20:49:51] <joal>	 nuria_: could be called a generic template
[20:50:13] <chelsyx>	 joal nuria_ : I have another question about the count variable in mobile_apps_session_metrics (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/mobile_apps_session_metrics)
[20:50:14] <nuria_>	 joal: i think we can probably  put that in an external file, have this code fill in its values and pass all that to the python druid loader 
[20:50:21] <chelsyx>	 joal nuria_ : so when type="SessionsPerUser", count is the number of users
[20:50:27] <chelsyx>	 joal nuria_ : when type="PageviewsPerSession", count is the number of sessions
[20:50:30] <nuria_>	 joal: so the loading logic is not duplicated 
[20:50:40] <joal>	 very feasible nuria
[20:50:50] <chelsyx>	 joal nuria_ : is that correct?
[20:50:55] <chelsyx>	 joal nuria_ : what is the count when type="SessionLength"?
[20:51:00] <nuria_>	 joal: see on gerrit change "getDruidTaskStatus()"
[20:51:24] <nuria_>	 joal: ok, it seems feasible , just triple checking myself here cc mforns 
[20:51:35] <joal>	 nuria_: I thought it wsas easier originally to have more templates and small pieces of data that changed - but I don't mind moving to smaller template and more moving pieces
[20:52:23] <joal>	 nuria_: only not-so easy part will be to be able to access the same "template" from both scala in refinery-source and python in refinery
[20:52:45] <nuria_>	 joal: right right (again cc mforns ) 
[20:52:59] <nuria_>	 joal: right now this is done "in memory" on teh spark job
[20:53:14] <joal>	 correct - can easily be loaded from path
[20:53:22] <nuria_>	 joal: so maybe the python job can accept a template passed in as a string, does thi seem horrible
[20:53:31] <nuria_>	 joal: cause template on disk will not have values 
[20:53:42] <joal>	 This would mean storing the file in refinery, having it deployed, and reference the deployed path for spark job - very feasible
[20:54:06] <nuria_>	 joal: mforns code needs to run to "fill" in values for template 
[20:54:32] <joal>	 nuria_: not sure I understand then - batcave for a minute?
[20:54:41] <nuria_>	 chelsyx: one sec
[20:54:45] <nuria_>	 joal: yes, omw
[20:57:15] <ottomata>	 !log restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403753/
[20:57:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:09:37] <wikibugs>	 (03CR) 10Nuria: "Taking it back after taking joseph, he does not think that python should be called from spark. Let's talk more about this." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)
[21:11:04] <nuria_>	 chelsyx: i imagine it is milliseconds or seconds, do those not sound right?
[21:16:37] <nuria_>	 joal: what version of scala do we use, do you know?
[21:16:46] <joal>	 nuria_: as of now we use 2.190
[21:16:55] <joal>	 nuria_: 2.10 sorry
[21:19:27] <joal>	 chelsyx: from the code, sessionLength is is second
[21:24:08] <chelsyx>	 nuria_ joal: Yes, sessionLength is second. But I'm asking about the count variable. In the documentation, it only says count is "Value of count for given metric". So I'm wondering if it is the number of sessions
[21:25:42] <joal>	 I think you're right chelsyx 
[21:27:54] <wikibugs>	 10Analytics, 10Fr-tech-archived-from-FY-14/15, 10Fundraising Tech Backlog, 10Wikimedia-Fundraising, 10Fundraising Sprint Enya: Strategy banner impressions - https://phabricator.wikimedia.org/T90635#3895118 (10DStrine)
[21:29:15] <joal>	 Gone for tonight a-team - see you tomorrow
[21:29:47] <nuria_>	 chelsyx: count is 0 in sessionLength right?
[21:32:53] <chelsyx>	 nuria_: it's not 0
[21:33:25] <nuria_>	 chelsyx: ah sorry, yes you are right
[21:34:03] <chelsyx>	 joal nuria_: but what does count means when type="SessionsPerUser", or type="PageviewsPerSession"? 
[21:34:26] <chelsyx>	 joal nuria_: The counts for these three metrics are different
[21:34:58] <chelsyx>	 https://usercontent.irccloud-cdn.com/file/EicswDgz/Screen%20Shot%202018-01-11%20at%201.34.31%20PM.png
[21:35:15] <nuria_>	 chelsyx: i need to look at code, i have no idea
[21:38:55] <nuria_>	 chelsyx: looks like is the cardinality of the series from whichquantiles are calculated: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala#L98
[21:39:53] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Fr-tech-archived-from-FY-2015/16, and 4 others: Beta Cluster EventLogging data is disappearing? - https://phabricator.wikimedia.org/T112926#3895273 (10DStrine)
[21:40:07] <wikibugs>	 10Analytics, 10Analytics-Backlog, 10Analytics-EventLogging, 10Fr-tech-archived-from-FY-2015/16, and 4 others: Promise returned from LogEvent should resolve when logging is complete - https://phabricator.wikimedia.org/T112788#3895281 (10DStrine)
[21:43:53] <chelsyx>	 nuria: I'm still confused. so when type="SessionsPerUser", the count is number of sessions, or number of users?
[21:48:16] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Fr-tech-archived-from-FY-2015/16, 10Fundraising Sprint Vengaboys, and 3 others: Impression log parsers should get sample rate from filenames - https://phabricator.wikimedia.org/T116800#3895569 (10DStrine)
[21:50:48] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Fr-tech-archived-from-FY-2015/16, 10Fundraising Tech Backlog, and 6 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#3895652 (10DStrine)
[21:58:16] <nuria_>	 chelsyx: sorry, per every metric teh count refers to the series that was used to calculate quantiles if  metric is "Sessionslength" the count is the number of sessions
[21:59:22] <nuria_>	 chelsyx: if it is  SessionsPerUser "count" should also be the  number of sessions 
[22:00:26] <nuria_>	 chelsyx: I would think 
[22:00:46] <nuria_>	 chelsyx: but seeing that those two numbers are not teh same
[22:04:03] <ottomata>	 !log restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403762/
[22:04:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:05:17] <nuria_>	 chelsyx: naming not so good
[22:06:10] <nuria_>	 chelsyx: SessionsPerUser is keyed by user, thus count must be number of users: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala#L137
[22:07:00] <nuria_>	 chelsyx: sessions is a flattened version of sessions per user: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala#L144
[22:07:54] <nuria_>	 chelsyx: thus count of SessionLength must be number of sessions
[22:08:16] <chelsyx>	 nuria_: then when type is PageviewsPerSession, the count must be number of sessions. But the number is different when type is SessionLength
[22:09:09] <nuria_>	 chelsyx: cause sessions with 1 pageview are not counted as such i think
[22:10:39] <chelsyx>	 nuria_: I see
[22:10:48] <chelsyx>	 nuria_: Thank you very much!
[22:10:59] <nuria_>	 chelsyx: please update docs
[22:11:06] <nuria_>	 chelsyx: that would be awesome
[22:11:11] <chelsyx>	 nuria_: Will do!
[22:24:11] <wikibugs>	 10Analytics, 10Research: Formal announcement of productized clickstream dataset - https://phabricator.wikimedia.org/T183097#3895740 (10DarTar)
[22:25:40] <wikibugs>	 10Analytics, 10Research: Formal announcement of productized clickstream dataset - https://phabricator.wikimedia.org/T183097#3843291 (10DarTar) draft completed, will be posted on Monday 1/15.
[22:35:41] <ottomata>	 !log restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/403774
[22:35:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:41:03] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3895798 (10Ottomata) Current status:  kafka-jumbo running with Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:E...
[22:53:46] <wikibugs>	 (03PS25) 10Nuria: Add core class and job to import EL hive tables to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)