[00:24:21] <wikibugs>	 10Analytics, 10Community-Tech: Data missing in page creation datasets - https://phabricator.wikimedia.org/T185019#3904866 (10kaldari) @Milimetric: Any idea why there might be data missing here?
[01:25:45] <wikibugs>	 10Analytics, 10Community-Tech: Data missing in page creation datasets - https://phabricator.wikimedia.org/T185019#3905124 (10Milimetric) So, I'm not sure, but my bet is that there was some outage that caused the data to land there after the reports were run.  When that happens, you can re-run the reports:  htt...
[01:26:35] <wikibugs>	 10Analytics, 10Community-Tech: Data missing in page creation datasets - https://phabricator.wikimedia.org/T185019#3905125 (10Milimetric) and if you're nervous about what that did, you can check the .reruns folder:  ```  milimetric@stat1006:/srv/reportupdater$ cat /srv/reportupdater/jobs/reportupdater-queries/p...
[03:21:21] <wikibugs>	 (03CR) 10Milimetric: "couple of follow-ups on stuff Fran said and a couple of things of my own.  But you and Fran feel free to merge and deploy tomorrow morning" (0312 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530) (owner: 10Mforns)
[03:28:24] <wikibugs>	 (03CR) 10Milimetric: [C: 04-1] "couldn't build/run because you forgot some dependencies, check the jenkins error logs, they're the same ones I get" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[07:01:17] <elukey>	 mforns: eventlogging cleaner on db1107 completed! \o/
[07:01:43] <elukey>	 now I want to see another run today (it will happen in a few hours) and also triple check the tables
[07:01:51] <elukey>	 buuuut looking good
[07:01:58] <elukey>	 this time no errors reported
[07:05:20] * elukey brb
[08:01:54] <elukey>	 so I briefly checked min(timestamp) of all tables, it is consistent with the whitelist
[08:02:04] <elukey>	 will do other checks but everything looks good
[08:16:08] * elukey errand for a bit!
[08:44:21] <elukey>	 rebooting stat boxes 
[08:47:38] <moritzm>	 \o/
[08:53:31] <elukey>	 stat boxes rebooted!
[08:53:47] <elukey>	 !log disabled camus as prep step for analytics1003 reboot
[08:53:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:05:58] <elukey>	 rebooting an1003
[09:11:41] <elukey>	 all right, an1003 back in service
[09:11:48] <elukey>	 let's see if everything looks right
[09:11:55] <elukey>	 moritzm: Hadoop reboots completed :)
[09:15:38] <moritzm>	 \o/
[09:31:52] <wikibugs>	 (03CR) 10Mforns: "Thanks for the reviews guys, will take care of comments today." (0310 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/402387 (https://phabricator.wikimedia.org/T179530) (owner: 10Mforns)
[09:46:31] <elukey>	 !log removed upstart config for brrd on eventlog1001 (failing and spamming syslog, old leftover?)
[09:46:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:46:57] <elukey>	 !log rebooted analytics1003
[09:46:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:49:45] <elukey>	 mforns o/
[09:50:31] <elukey>	 I am currently trying to figure out what happened in
[09:50:32] <elukey>	 https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&from=1516112999752&to=1516115881761
[09:53:28] <elukey>	 !log disable druid middlemanager on druid1002 as prep step for reboot
[09:53:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:22:07] <elukey>	 !log reboot druid1002 for kernel upgrades
[10:22:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:24:51] <elukey>	 last host standing is druid1001
[10:37:11] <elukey>	 !log re-run webrequest-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's reboot)
[10:37:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:38:54] <elukey>	 !log stopped all crons on hadoop-coordinator-1
[10:38:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:39:08] <elukey>	 this should hopefully stop the spam --^
[10:59:11] <elukey>	 so webrequest-druid-hourly-wf-2018-1-17-8 seems consistently failing
[10:59:15] <elukey>	 I've re-ran it two times
[10:59:27] <elukey>	 and the logs on druid1002's overlord don't make a lot of sense
[11:25:38] <elukey>	 it keeps failing, lovely
[11:27:48] <mforns>	 hellooooo
[11:44:14] <elukey>	 !log restart druid middlemanager on druid1002
[11:44:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:44:41] <elukey>	 !log re-run pageview-druid-hourly-wf-2018-1-17-9 and pageview-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's middlemanager being in a weird state after reboot)
[11:44:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:44:49] <elukey>	 mforns: --^
[11:45:53] <mforns>	 elukey, thanks! the hour 9 just failed, should I rerun it?
[11:46:15] <elukey>	 check the log :)
[11:46:43] <elukey>	 just re-run it, it should be completed in a minute
[11:47:08] <elukey>	 it is unfortunate that druid sometimes gets into this state when either zookeeper or the indexer service gets changed
[11:47:09] <mforns>	 elukey, ok ok, thanks a lot
[11:47:19] <mforns>	 aha
[11:53:28] <elukey>	 it took me a bit to figure out that it was the middlemanager the issue :(
[11:53:46] <elukey>	 mforns: did you read what I posted about the el cleaner earlier on?
[11:53:57] <mforns>	 elukey,  no, wait
[11:54:06] <mforns>	 no, no, was off
[11:54:11] <mforns>	 what was it?
[11:54:42] <elukey>	 it completed! No issues registered, I'll let todays run go before saying victory but we are close
[11:54:58] <elukey>	 moreover I checked all the min(timestamps) across the tables
[11:55:06] <mforns>	 \\\\\\o//////
[11:55:07] <elukey>	 comparing them with the whitelist
[11:55:12] <elukey>	 and they look sane
[11:55:34] <mforns>	 <o> \o/ <o> \o/
[11:57:37] <elukey>	 also mforns not sure what happened in https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&from=1516112099034&to=1516116301963
[11:57:45] <elukey>	 yesterday burrow alarmed
[11:57:49] <elukey>	 and I noticed the gap
[11:58:01] <elukey>	 but it is not clear to me from the logs what happened
[11:58:57] <elukey>	 whenever you have time (not urgent) could you review it?
[11:59:07] <elukey>	 I'll try to do it as well after lunch
[12:17:59] <elukey>	 going to lunch!
[12:18:02] * elukey lunch!
[12:31:52] <mforns>	 elukey, no no, I will do it! thanks for the heads up
[13:23:50] <wikibugs>	 10Analytics-Kanban: Make banner-activiy success file cleaner not fail when there's nothing to be cleaned - https://phabricator.wikimedia.org/T185100#3905982 (10mforns)
[13:24:58] <wikibugs>	 (03PS1) 10Mforns: Make banner-actvity cleaner not fail when there's nothing to drop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404662 (https://phabricator.wikimedia.org/T185100)
[13:25:54] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Make banner-activiy success file cleaner not fail when there's nothing to be cleaned - https://phabricator.wikimedia.org/T185100#3905982 (10mforns)
[13:39:12] <wikibugs>	 (03PS10) 10Fdans: Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529)
[13:41:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[13:47:27] <elukey>	 so I got an alert for banner impression stream not pushing data for the past 30 mins
[13:47:59] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1
[13:48:28] <elukey>	 but https://yarn.wikimedia.org/proxy/application_1515441536446_8509/ doesn't look blocked
[13:51:13] <mforns>	 elukey, hm
[13:53:14] <mforns>	 elukey, no idea on what to do here
[13:57:27] <mforns>	 but indeed pivot is showing no data since 1 hour ago, actually flapping since 2 hours ago
[13:58:40] <mforns>	 and also you can see a huge drop in activity since Jan 15th at 7pm
[13:58:56] <mforns>	 https://tinyurl.com/yal7a2ba
[14:01:21] <mforns>	 hm, it has been going on for a couple days now, and the daily job run already and confirmed the data for Jan 15th and Jan 16th
[14:01:32] <mforns>	 so I'd say it's not an issue with the streaming job
[14:01:36] <mforns>	 rather a data issue
[14:05:20] <mforns>	 the banner data query is pretty straightforward...
[14:07:25] <elukey>	 yeah..
[14:07:39] <elukey>	 mforns: what is the filter/source for kafka data?
[14:07:51] <elukey>	 we can easily check if data is there or not
[14:08:08] <mforns>	 elukey, no the issue should be already in webrequest table
[14:08:32] <mforns>	 I suspect the data comes corrupt or simply does not come
[14:09:11] <elukey>	 it makese sense though that banner impression data decreased a lot after the past weeks though
[14:10:26] <mforns>	 elukey, yes, the number of webrequests with uri_path = '/beacon/impression' was in the order of millions in Jan 12th, and since 15th is in the order of thousands.
[14:11:06] <elukey>	 let's tail the webrequest log and see :)
[14:11:20] <mforns>	 sure :]
[14:15:09] <ottomata>	 ( ^_^)／
[14:15:34] <elukey>	 o/
[14:15:40] <elukey>	 mforns: kafkacat shows some activity
[14:16:13] <elukey>	 iirc spark is pulling directly from webrequest right?
[14:16:16] <mforns>	 elukey, I think even with the low volume since the 15th, there is actually a problem today
[14:16:22] <elukey>	 yeah
[14:16:26] <mforns>	 elukey, no
[14:16:31] <mforns>	 I think it pulls from kafka
[14:16:47] <mforns>	 through tranquility
[14:16:47] <elukey>	 yeah sorry I wanted to say webrequest data on kafka
[14:16:53] <mforns>	 ah! sorry my bad
[14:17:02] <mforns>	 yes, I think so
[14:17:29] <elukey>	 no wait tranquillity handles the realtime indexers on druid
[14:17:32] <mforns>	 I checked, and there is a regular number of webrequests with uri_path = '/beacon/impression' during the hours of the alerts in hive webrequest
[14:17:40] <mforns>	 oh ok
[14:18:07] <wikibugs>	 (03PS11) 10Fdans: Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529)
[14:18:13] <elukey>	 so the overlord console is not showing up any indexer
[14:18:17] <elukey>	 for realtime data
[14:18:24] <mforns>	 aha
[14:19:37] <elukey>	 ok now I am going to brutally kill the spark job
[14:20:00] <mforns>	 don't hurt yourself :]
[14:20:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[14:21:12] <elukey>	 !log forced kill of banner impression data streaming job to get it restarted
[14:21:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:21:43] <elukey>	 now the cron on an1003 should figure out that nothing is there in ~5m and restart the job
[14:22:46] <mforns>	 ok
[14:27:37] <elukey>	 job respawned, no indexer on druid
[14:29:42] <wikibugs>	 (03PS12) 10Fdans: Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529)
[14:32:19] <wikibugs>	 (03CR) 10Fdans: "@Milimetric sorry, my package.json got bamboozled in one of the many rebases and of course I hadn't erased my node_modules so everything w" (033 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[14:34:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[14:37:11] <elukey>	 now spark+tranquillity handle the realtime indexing jobs
[14:38:21] <elukey>	 via the indexing service (overlord+middlemanager+peons) service
[14:38:42] <elukey>	 (using its apis and zookeeper directly)
[14:38:58] <elukey>	 the peons acting as real time indexers gets data via POST
[14:39:03] <elukey>	 and then druid index that data
[14:39:30] <elukey>	 at the moment all the regular "batch" indexing jobs, that need to contact the overlord, are working
[14:40:01] <elukey>	 so I am wondering if rebooting druid1002 caused some issues for tranquillity
[14:40:04] <elukey>	 maybe in zookeeper?
[14:41:30] <elukey>	 2018-01-17T14:02:59,126 INFO io.druid.indexing.overlord.RemoteTaskRunner: Tried to delete status path[/druid/analytics-eqiad/indexer/status/druid1003.eqiad.wmnet:8091/index_realtime_banner_activity_minutely_2018-01-17T14:00:00.000Z_1_0] that didn't exist! Must've gone away already?
[14:41:55] <elukey>	 maybe the middlemanager is acting weirdly?
[14:42:47] <elukey>	 !log restart druid middlemanager on druid1003 as attempt to unblock realtime streaming
[14:42:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:45:22] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, and 3 others: Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3906116 (10mobrovac) >>! In T176126#3904165, @Ottomata wrote: > We've also got librdkafka 0.11 backported for Jessie in our apt repo no...
[14:49:14] <mforns>	 sorry elukey got distracted
[14:49:55] <mforns>	 at least at the end of the day, the daily job will replace the missing data with good data
[14:50:04] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, and 3 others: Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3906135 (10Ottomata) I'd prefer to pin the version in puppet, then restrict it everywhere for SCB.  If we fix up that patch to somethin...
[14:50:27] <elukey>	 mforns: sure but it bothers me that everything is so fragile to reboots
[14:50:32] <mforns>	 aha
[14:52:02] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, and 3 others: Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3906136 (10mobrovac) I would prefer having a stable and known-to-work version everywhere rather than confining the problem to SCB
[14:55:45] <elukey>	 INFO: line 617: Update /var/run/eventlogging_cleaner with the current end_ts 20171019110001
[14:55:51] <elukey>	 oh yesssss
[14:56:45] <wikibugs>	 10Analytics-Kanban, 10DBA: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3906141 (10elukey)
[14:58:11] <wikibugs>	 10Analytics-Kanban, 10DBA: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3363866 (10elukey) The first run completed without any errors, and then another one (cleaning up only daily data) ran as well setting the following:  ``` INFO: line 617: Update /var/run/eventloggi...
[14:58:29] <elukey>	 \o/
[15:00:02] <mforns>	 !!!!!!!!!!!!!!
[15:00:26] <mforns>	 :D
[15:00:31] * elukey hugs mforns 
[15:00:31] <mforns>	 :'D
[15:00:36] <mforns>	 :''''''''D
[15:00:46] * mforns hugs elukey
[15:00:50] <mforns>	 xD
[15:01:45] <mforns>	 elukey, gonna run grab some food and be back in a while
[15:02:20] <elukey>	 sure!
[15:03:23] <elukey>	 gooood next hour kicked in and we now have realtime indexers again
[15:53:19] <elukey>	 mforns: I keep seeing  Warning: Duplicate entry 'blablabla' for key etc.. in the eventlogging logs
[15:53:33] <mforns_brb>	 elukey, is it super frequent?
[15:53:45] <elukey>	 not super frequent but it seems happening regularly
[15:53:52] <mforns_brb>	 looking
[15:54:34] <elukey>	 I am checking now eventlogging_consumer-mysql-m4-master-00.log.1
[15:57:27] <wikibugs>	 10Analytics, 10TCB-Team: Where should keys used for stats in an extension be documented? - https://phabricator.wikimedia.org/T185111#3906321 (10WMDE-Fisch)
[15:58:34] <mforns_brb>	 elukey, duplicates may happen when there are restarts
[15:58:50] <mforns_brb>	 but shouldn't happen theoretically otherwise
[15:59:36] <mforns_brb>	 I'm tailing eventlogging_consumer-mysql-m4-master-00.log for 5 minutes now and didn't see any duplicates
[16:01:09] <elukey>	 mforns_brb: try to open it with say less
[16:01:30] <elukey>	 there are some from this morning
[16:01:39] <elukey>	 and afaict we haven't done anything at that time
[16:01:44] <mforns_brb>	 elukey, yes, maybe there was a restart this morning?
[16:01:48] <mforns_brb>	 oh ok
[16:02:54] <elukey>	 maybe those aligns with kafka consumer groups changes, but I believe we'd need to investigate
[16:03:08] <elukey>	 not now, but maybe open a task?
[16:05:00] <mforns_brb>	 yes sure
[16:07:05] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Make banner-actvity cleaner not fail when there's nothing to drop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/404662 (https://phabricator.wikimedia.org/T185100) (owner: 10Mforns)
[16:13:22] <milimetric>	 ottomata: when you add something to the blacklist, I was thinking we should re-check those schemas and data every once in a while to see if they've fixed the problems.  Also, probably we should ping whoever's generating the data?
[16:16:15] <ottomata>	 milimetric:  those job ones are special
[16:16:20] <ottomata>	 in that they aren't eventlogging schemas
[16:16:23] <ottomata>	 they are from mediawiki job queue
[16:16:25] <ottomata>	 so who knows
[16:16:35] <ottomata>	 they don't really have a schema
[16:17:00] <ottomata>	 for evenltogging analytics stuff, definitely.
[16:17:04] <milimetric>	 ok.  Do the job queue / future job queue people know about this and the blacklist?
[16:17:28] <milimetric>	 like Giuseppe
[16:17:46] <milimetric>	 also, I think I'm gonna try calling him Juice from now on, see how that goes :)
[16:17:49] <ottomata>	 haha
[16:17:56] <ottomata>	 yes, but they've never looked at the reinfed data
[16:18:00] <ottomata>	 i keep meeting to give them a tour
[16:18:04] <ottomata>	 maybe this friday during ops hangouts...
[16:18:05] <ottomata>	 :)
[16:18:40] <milimetric>	 cool
[16:20:45] <wikibugs>	 10Analytics, 10TCB-Team: Where should keys used for stats in an extension be documented? - https://phabricator.wikimedia.org/T185111#3906399 (10WMDE-Fisch)
[16:24:01] <elukey>	 !log re-run all the pageview-druid-hourly failed jobs via Hue
[16:24:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:29:57] <wikibugs>	 10Analytics, 10TCB-Team: Where should keys used for stats in an extension be documented? - https://phabricator.wikimedia.org/T185111#3906421 (10WMDE-Fisch)
[16:30:08] <ottomata>	 joal:  ops sync wanna?
[16:44:20] <milimetric>	 heh, fdans is it a known problem that the countries are all wrong?
[16:44:37] <milimetric>	 like, it thinks Antarctica is Algeria and Australia is Bahamas
[16:45:43] <fdans>	 what the hell
[16:46:07] <fdans>	 milimetric: this will all be correct when we switch to ISO codes
[16:46:12] <milimetric>	 right, I figured
[16:46:16] <milimetric>	 but it's funny
[16:46:22] <milimetric>	 also, Argentina and a couple of others aren't right
[16:46:27] <milimetric>	 but oddly enough everything else seems ok
[16:47:55] <fdans>	 milimetric: it does make me wonder if it's the shoddy provisional table I made (isoLookup.js) or if it's the map geometries
[16:48:12] <fdans>	 if it's the former there's no problem
[16:49:08] <milimetric>	 fdans: there are some more failed tests too: https://integration.wikimedia.org/ci/job/analytics-wikistats2-npm-browser-node-6-docker/96/console
[16:49:19] <milimetric>	 oh, and it doesn't recognize Brazil!
[16:50:35] <fdans>	 milimetric: yeah I'm investigating those, they aren't failing for me locally
[16:50:45] <fdans>	 oh wait they are the merge tests
[16:50:49] <fdans>	 ok that makes sense
[16:54:38] <elukey>	 there are still jobs failing for druid, this time page view indexing
[16:54:42] <elukey>	 checking it now
[16:59:32] <mforns>	 :/
[17:01:37] <nuria_>	 ping fdans
[17:11:01] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Make banner-activity success file cleaner not fail when there's nothing to be cleaned - https://phabricator.wikimedia.org/T185100#3906566 (10Reedy)
[17:17:34] <halfak>	 hey folks.  Did stat1005 get rebooted yesterday? 
[17:17:45] <halfak>	 I seem to have lost some work (or lost my mind) 
[17:18:26] <elukey>	 halfak: hey! This morning EU time, I've sent multiple emails to alert it
[17:18:45] <elukey>	 (engineering@, analytics@ and personal emails to people holding a screen/tmux session)
[17:19:23] <halfak>	 elukey, ahh I lost my email 
[17:19:42] <elukey>	 :(
[17:20:18] <elukey>	 it was for the meltdown upgrades, needed to deploy the new kernel
[17:20:27] <halfak>	 No worries.  You did well.  It's good to know when the reboot happened.  That'll help me recover :) 
[17:20:43] <halfak>	 We should have a keyword to put in emails so I can set up a trigger to highlight the emails. 
[17:20:51] <halfak>	 Maybe [Reboot] or something like that. 
[17:20:59] <halfak>	 Or [Downtime]
[17:21:11] <elukey>	 yeah I was about to ask what would be best to pop up in people's emails list
[17:21:42] <halfak>	 I don't have a strong opinion.  Any keyword I can flag is great ^_^
[17:21:56] <halfak>	 EIther way, thanks for the info and thanks for making sure to send that email 
[17:22:28] <halfak>	 It's my own fault I missed it and it doesn't matter that much.  It's nice to know I'm not losing my mind -- or my screen sessions :D 
[17:29:42] <elukey>	 !log restarted all druid overlords on druid100[123] (weird race condition messages about who was the leader for some task)
[17:29:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:30:17] <elukey>	 halfak: it happens, if I can make the whole alert email spam more effective I am all for it
[17:30:45] <elukey>	 so let's try to do this - next time, in the personal emails that I'll send for tmux/screen sessions, I'll put [Reboot]
[17:31:00] <halfak>	 Awesome.  I'll set up the filter with that keyword in brackets
[17:33:17] <elukey>	 !log killed the banner impression spark job (application_1515441536446_27293) again to force it to respawn (real time indexers not present)
[17:33:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:52:23] <fdans>	 nuria_: aqs per country vs wikistats looks mostly legit https://usercontent.irccloud-cdn.com/file/ZF2053kz/Screen%20Shot%202018-01-17%20at%2018.49.32.png
[18:04:54] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Audit users and account expiry dates for stat boxes - https://phabricator.wikimedia.org/T170878#3906791 (10Ottomata) FYI: @Samwalton9 and @Samtar, your access expired on 2018-01-01 and your accounts have been removed.  Thanks! :)...
[18:05:04] <elukey>	 streaming data back 
[18:05:43] <elukey>	 ottomata: I had to restart all the druid overlords, they were "confused" about who was the leader 
[18:06:07] <elukey>	 will see tomorrow what explodes with druid1001
[18:06:29] * elukey off!
[18:08:01] <ottomata>	 huh that is weiiird
[18:08:03] <ottomata>	 and not good
[18:08:08] <ottomata>	 cmon druid you've been so wonderful!
[18:23:41] <lzia>	 hey joal. did you see Gerard's question on wiki-research-l re clickstream?
[18:24:03] <lzia>	 joal: I think the question boils down to, did what Ellery say at https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream#Not_found get done for this recurrent release?
[18:48:52] <wikibugs>	 10Analytics: Estimate how long a new Dashiki Layout for Qualtrics Survey data would take - https://phabricator.wikimedia.org/T184627#3907003 (10egalvezwmf)
[18:50:50] <wikibugs>	 10Analytics: Estimate how long a new Dashiki Layout for Qualtrics Survey data would take - https://phabricator.wikimedia.org/T184627#3890442 (10egalvezwmf) @mcruzWMF Can you upload the screenshots? I was going to upload to commons but I wasn't sure if there are attributions. Thanks!
[19:00:23] <wikibugs>	 (03CR) 10Milimetric: Map component and Pageviews by Country metric (0312 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[19:07:16] <wikibugs>	 10Analytics, 10Community-Tech: Data missing in page creation datasets - https://phabricator.wikimedia.org/T185019#3907074 (10Nettrom) 05Open>03Resolved a:03Nettrom I checked the dashboard for enwiki and spot-checked a dataset, and the data appears to be in working order. Thanks for helping take care of t...
[19:09:49] <fdans>	 thanks milimetric  :)
[19:56:19] <nuria_>	 ottomata: where can i run a spark shell?  anywhere 1005 works after giving some errors
[19:56:43] <ottomata>	 nuria_:  stat1005 or stat1004
[19:56:43] <ottomata>	 either
[19:56:47] <ottomata>	 the errors you see are probably normal
[19:56:52] <ottomata>	 about trying to pick a port
[19:56:58] <ottomata>	 it tries a few til it finds an open one
[19:57:25] <nuria_>	 ah i see, ok
[19:57:35] <nuria_>	 I have not used this in ages -> me hacker now
[20:29:00] <wikibugs>	 (03PS13) 10Fdans: Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529)
[20:29:42] <wikibugs>	 (03CR) 10Fdans: Map component and Pageviews by Country metric (0310 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[20:30:11] <ottomata>	 nuria_:  can we make a joint goal with services to upgrade main kafka clusters next quarter, if all goes well and gets finished witth jumbo this quarter?
[20:31:54] <nuria_>	 ottomata: sounds great
[20:32:52] <ottomata>	 yeehaw
[20:34:28] <nuria_>	 ottomata: added line 53 https://etherpad.wikimedia.org/p/analytics-goals
[20:36:14] <ottomata>	 haha nuria_ FYI, you've said this several times, the name of the big project is not event streams
[20:36:25] <ottomata>	 that's a different hting
[20:36:26] <ottomata>	 http://wikitech.wikimedia.org/wiki/EventStreams
[20:36:34] <nuria_>	 ottomata: yes, argh, so true
[20:36:58] <ottomata>	 Stream Data Platform is working name
[20:37:00] <ottomata>	 ya?
[20:37:06] <ottomata>	 we are talking about renaming
[20:37:07] <ottomata>	 it
[20:37:09] <ottomata>	 but htat is it for now
[20:37:09] <ottomata>	 k
[20:37:48] <nuria_>	 yes
[20:37:51] <nuria_>	 yes
[20:38:00] <ottomata>	 gr8
[20:40:36] <nuria_>	 ottomata: for teh spark shell to have access to algebird and refinery i have to pass it like  --jar /srv/deployment/analytics/refinery/artifacts/refinery-core.jar
[20:41:43] <nuria_>	 ottomata: but where do i find third party deps (algebird) in the box , do i need to rsync them there?
[20:42:50] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3907314 (10Ottomata) p:05Triage>03Normal
[20:44:01] <ottomata>	 algebird?
[20:44:05] <ottomata>	 nuria_:  what is depending on that?
[20:44:07] <ottomata>	 something you want to do?
[20:44:10] <ottomata>	 if so, then ya
[20:44:13] <ottomata>	 you just rsync or download them there
[20:44:33] <ottomata>	 if you want them productionized/deployed with refinery, we'll have to get them into archiva and then added to refinery/artifacts somewhere
[20:44:36] <ottomata>	 or find a .deb package
[20:45:18] <ottomata>	 or, if you are writing refinery-source code that will be deployed that depends on algebird, then you'll need to add the dependency in a pom.xml file
[20:47:03] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3907347 (10Ottomata)
[20:55:09] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#3907367 (10Ottomata) Yeehaw, FYI, all Kafka clients have been ported from analytics to jumbo in deployment-prep in Cloud VPS.  EventLogging was a breeze there.  The...
[20:55:32] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#3907368 (10Ottomata) If all is still well tomorrow, I will delete the analytics instances in deployment-prep.
[20:56:43] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3907372 (10Ottomata)
[20:56:53] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3840663 (10Ottomata) a:03Ottomata
[21:00:03] <wikibugs>	 10Analytics, 10Discovery: Send Mediawiki Kafka logs to Kafka jumbo cluster with TLS encryption - https://phabricator.wikimedia.org/T126494#3907389 (10Ottomata)
[21:08:41] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3907424 (10Ottomata) @Jgreen FYI, we'll need to coordinate this soon :)
[22:44:00] <wikibugs>	 10Quarry: Quarry runs thousands times slower in last months - https://phabricator.wikimedia.org/T160188#3907835 (10zhuyifei1999) Query 4835 works in around 6 seconds for me. What's wrong?
[22:52:22] <wikibugs>	 10Analytics: Create dashboard for interlanguage navigation stats - https://phabricator.wikimedia.org/T185156#3907880 (10Milimetric)
[23:28:26] <wikibugs>	 10Quarry: Quarry runs thousands times slower in last months - https://phabricator.wikimedia.org/T160188#3908049 (10IKhitron) See five top queries on my profile. They usually run 1-5 second. Now half of them run dozens of seconds.
[23:45:41] <nuria_>	 ottomata: i was getting some  compilation errors but i guess we need to use spark2-shell? that seems to work best with teh spark we have