[07:19:53] * elukey is back from vacation
[07:19:58] * elukey looks for joal :)
[07:20:08] <elukey>	 hello team! :)
[08:20:52] <joal>	 Hi a-team :)
[08:21:06] <joal>	 elukey, my morning friend!! How are you ?
[08:23:06] <elukey>	 joal helloooooo
[08:23:18] <joal>	 HiiiiiIIIIIIIIiiii :)
[08:23:29] <joal>	 How were holidays elukey ?
[08:23:54] <elukey>	 La Corse est belle
[08:24:03] <joal>	 C'est vrai :)
[08:24:46] <elukey>	 I wish I'll go back there with a motorbike
[08:25:13] <elukey>	 how's the new family going??
[08:25:38] <joal>	 You're right, motorbike would be very nice over there :)
[08:27:21] <joal>	 Family is doing great, with still some lack of sleep, but great :)
[08:31:25] <elukey>	 :)
[08:52:05] <joal>	 elukey: I'm proud of me, I managed to break webrequest in less than three days when coming back from holidays ;)
[09:01:49] <elukey>	 joal: I've read the mail! What happened?
[09:02:45] <joal>	 elukey: I deployed a change I did before going to holidays without triple testing
[09:02:57] <joal>	 elukey: I didn't notice nor think that stuff would break
[09:03:15] <joal>	 elukey: When changing a struct in hive, only ADD fields
[09:03:49] <joal>	 elukey: and only add those fields at the end of the struct
[09:04:16] <joal>	 When deploying, I updated the normalized_host struct, adding a field in the middle of the struct
[09:04:26] <joal>	 New data was working fine, but old would break
[09:05:04] <joal>	 It's because ive relies on field ordering (indices) for structs - therefore modifying the order is wrong for old/new compliance
[09:06:03] <elukey>	 wow
[09:06:36] <joal>	 not nice
[09:07:12] <joal>	 sonething else to remind when fixing stuff about hive schema
[09:07:33] <joal>	 When overwriting a partition in a table, its schema is not updated
[09:07:37] <joal>	 elukey: --^
[09:08:04] <joal>	 When we fixed the thing with andrew, we reran jobs with the field moved to the end of the struct
[09:08:33] <joal>	 But the table partitions had been created  with the schema of the field-inthe-middle-struct
[09:08:41] <joal>	 Meeeh
[09:09:01] <joal>	 So we add to manually drop / recreate the partitions in order for hive to apply the correct schema
[09:09:08] <joal>	 Yay !
[09:09:16] <joal>	 What a wonderful first deploy :)
[09:10:00] <elukey>	 argh :(
[09:10:04] <elukey>	 how long did it take?
[09:10:50] <joal>	 elukey: by chance bearloga noticed the error after a few hour
[09:11:15] <joal>	 Once noticed, it took us about one hour to troubleshoot / fix
[09:11:21] <joal>	 Then we needed to rerun
[09:11:51] <joal>	 hm, I say one hour, it's probably 2 or 3 :)
[09:14:05] <elukey>	 two hours of Joseph+Andrew would mean a day for me and somebody else, if not more :D
[09:15:54] <joal>	 :) I'm not sure of that :)
[09:16:22] <elukey>	 joal: did you see the Kafka timeout issue?
[09:16:32] <elukey>	 that one was a sneaky bastard
[09:17:23] <joal>	 elukey: Nope, I have not - I have noticed alarm emails from burrow but didn't follow what was going on
[09:17:26] <joal>	 elukey: 
[09:17:35] <joal>	 elukey --verbose?
[09:21:35] <elukey>	 joal: https://phabricator.wikimedia.org/T172681
[09:38:48] <joal>	 wow elukey - good work
[09:39:15] <joal>	 elukey: Kafka is difficult given how many systems talk to it
[09:39:31] <joal>	 Ah elukey, I really missed our morning talks :)
[09:41:31] <elukey>	 me too! 
[09:41:48] <elukey>	 it was a nightmare, and in the end if was an unknown host doing a DOS basically
[09:42:04] <elukey>	 this made me think that we don't know exactly what are the producers/consumers of kafka
[09:42:13] <elukey>	 and the logs are not that helpful in this regard
[09:43:00] <elukey>	 ah completely unrelated note, we need to restart all the java daemons for updates :)
[09:43:12] <joal>	 Yay, a java update :)
[09:43:54] <joal>	 elukey: It'd be great to have page listing producers / consumers for kafka in our System folder, don't you think ?
[09:48:02] <elukey>	 joal: definitely, the only way that allowed me to find the offending host/vm was to enable network trace logs on one of the broker :(
[09:48:11] <joal>	 right
[09:48:14] <joal>	 not cool
[09:48:23] <elukey>	 even better would be to tune network logs in kafka to list ips connecting
[09:48:29] <elukey>	 so in case of fire you just check there
[09:51:04] <joal>	 hm
[09:51:14] <joal>	 elukey: should not prevent us from documenting :)
[09:55:52] <elukey>	 agreed, but from the ops perspective docs that is not updated to 10 seconds ago doesn't help when a fire starts :)
[09:56:07] <joal>	 right
[10:34:12] <elukey>	 !log restart yarn and hdfs on analytics1030 for jvm updates (canary)
[10:34:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:35:30] <joal>	 elukey: I'll care oozie jobs during the restart (if any)
[10:36:48] <elukey>	 joal: super, I am planning to do it this afternoon if an1030 will be fine.. yarn is not a concern with the seamless restart, hdfs might give us some oozie noise :(
[10:55:42] * elukey lunch!
[11:03:14] <wikibugs_>	 (03PS5) 10Joal: Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101)
[11:03:16] <wikibugs_>	 (03PS3) 10Joal: Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030
[11:32:13] <wikibugs_>	 10Analytics, 10DBA, 10Data-Services, 10Research, 10cloud-services-team (Kanban): Implement technical details and process for "datasets_p" on wikireplica hosts - https://phabricator.wikimedia.org/T173511#3558277 (10jcrespo) Duplicate of T156869? We should fix it for all users, not only datasets_p?
[12:36:16] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] Clear connections between report executions [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/373967 (https://phabricator.wikimedia.org/T173585) (owner: 10Mforns)
[12:37:06] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Clear connections between report executions [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/373967 (https://phabricator.wikimedia.org/T173585) (owner: 10Mforns)
[12:42:39] <elukey>	 ottomata: o/
[12:42:41] <mforns>	 hellooooo
[12:42:45] <elukey>	 mforns: o/
[12:42:48] <ottomata>	 hiiiii
[12:43:05] <mforns>	 \o/
[12:54:58] <elukey>	 the "HDFS capacity used percentage" seems working fine, there is a warning in Icinga that HDFS used space is 70%
[12:55:13] <elukey>	 the critical is at 80%, maybe I can +10 both
[12:55:25] <elukey>	 in any case, wouldn't it be the case to delete some old data?
[12:55:31] <elukey>	 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=25&fullscreen&orgId=1&from=now-30d&to=now
[13:28:39] <wikibugs_>	 (03CR) 10Mforns: [V: 032 C: 032] "Self-merging after +1 to unbreak production." [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/373967 (https://phabricator.wikimedia.org/T173585) (owner: 10Mforns)
[13:31:20] <wikibugs_>	 10Analytics, 10Discovery-Analysis: Get 'sparklyr' working on stats1005 - https://phabricator.wikimedia.org/T139487#3558525 (10GoranSMilovanovic) @mpopov I've tried out two different approaches to connect to Spark from {sparklyr} on stat1005, including yours, and failed. Please take a look and let me know if yo...
[13:35:35] <elukey>	 mforns: any news with eventlogging?
[13:36:06] <mforns>	 elukey, no......... so sorry, I've been the whole week fighting with Wikimetrics :'(
[13:36:21] <mforns>	 elukey, If you want, we can pair today after standup?
[13:36:34] <mforns>	 to alter/write
[13:36:36] <mforns>	 tests
[13:38:07] <elukey>	 mforns: I have the ops meeting, buuut what do you think about tomorrow??
[13:38:34] <mforns>	 elukey, sure! then I maybe can start today, and we kill it tomorrow
[13:38:51] <mforns>	 tomorrow I will start the day earlier
[13:39:07] <mforns>	 about midday
[13:53:02] <elukey>	 super
[13:57:41] <elukey>	 !log restart kafka* on kafka1012 for openjdk security updates (canary) 
[13:57:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:13:02] <elukey>	 mforns: ok to drop PageContentSaveComplete_5588433_15423246 from m4-master?
[14:13:16] <elukey>	 to complete https://phabricator.wikimedia.org/T170720
[14:15:09] <elukey>	 we could also drop MobileWebUIClickTracking_10742159_15423246 after some sanity checking (only from slaves)
[14:18:41] <mforns>	 elukey, reading
[14:20:15] <mforns>	 elukey, I think it's safe to remove PageContentSaveComplete fro m4-master :]
[14:27:57] <mforns>	 elukey, re. MobileWebUIClickTracking, I checked that the data in hive looks good. So I think we can go ahead and drop
[14:28:35] <mforns>	 the last comment in this ticket is also confiming: 
[14:28:36] <mforns>	 https://phabricator.wikimedia.org/T172322
[14:28:56] <elukey>	 yep :)
[14:32:06] <elukey>	 ok dropped PageContentSaveComplete
[14:33:53] <mforns>	 \\\o///
[14:40:45] <elukey>	 eventlogging_sync on dbstore1002 complained
[14:40:54] <elukey>	 had to restart it, thing to remember for the next drops
[14:42:05] <mforns>	 aha
[14:42:57] <elukey>	 ok mforns I'll drop MobileWebUIClickTracking_10742159_15423246 tomorrow after lunch when you'll be online ok?
[14:46:01] <wikibugs_>	 10Analytics-Kanban, 10Patch-For-Review: Troubleshoot Wikimetrics "magic button" - https://phabricator.wikimedia.org/T173585#3558751 (10mforns) It's deployed to production. We should not see the same problem. Please, reopen the task otherwise!
[15:00:28] <ottomata>	 joal:  yoohoo
[15:00:31] <joal>	 taoops
[15:16:12] <wikibugs_>	 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis (Current work), 10Patch-For-Review: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3558833 (10JAllemandou)
[15:18:38] <wikibugs_>	 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis (Current work), 10Patch-For-Review: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3558835 (10JAllemandou) a:03Ottomata
[15:25:06] <wikibugs_>	 10Analytics-Kanban: Meta-statistics on MediaWiki history reconstruction process - https://phabricator.wikimedia.org/T155507#3558844 (10JAllemandou)
[15:28:35] <wikibugs_>	 10Analytics, 10DBA, 10Data-Services, 10Research, 10cloud-services-team (Kanban): Implement technical details and process for "datasets_p" on wikireplica hosts - https://phabricator.wikimedia.org/T173511#3558849 (10bd808) >>! In T173511#3558277, @jcrespo wrote: > Duplicate of T156869? We should fix it for...
[15:34:17] <wikibugs_>	 10Analytics: Reinstate a subset of reports removed from the reportcard until WikiStats 2.0 is back - https://phabricator.wikimedia.org/T166679#3558854 (10JAllemandou) 05Open>03declined
[15:35:20] <wikibugs_>	 10Analytics: Reinstate a subset of reports removed from the reportcard until WikiStats 2.0 is back - https://phabricator.wikimedia.org/T166679#3304049 (10JAllemandou) It makes a long time this task has been open. Wikistats is not far from ready - We've decided to put effort into implementing those important metr...
[15:40:58] <wikibugs_>	 10Analytics-Cluster, 10Analytics-Kanban: Automate refinery jar cleanup - https://phabricator.wikimedia.org/T159337#3558916 (10JAllemandou)
[15:46:05] <wikibugs_>	 10Analytics: Weird performance of sqoop job on Edit Reconstruction - https://phabricator.wikimedia.org/T172579#3502684 (10JAllemandou) Could be due to hadoop overhead. To investigate.
[15:52:22] <wikibugs_>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Try to make tranquility work with Spark - https://phabricator.wikimedia.org/T168550#3558974 (10JAllemandou) a:03JAllemandou
[15:55:47] <wikibugs_>	 10Analytics, 10Performance-Team: Explore NavigationTiming by faceted properties - EventLogging refine - https://phabricator.wikimedia.org/T166414#3558988 (10JAllemandou)
[15:55:49] <wikibugs_>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Implement EventLogging Hive refinement - https://phabricator.wikimedia.org/T162610#3558989 (10JAllemandou)
[16:59:09] <wikibugs_>	 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3559341 (10mpopov)
[17:23:26] <dsaez>	 Hi all!
[17:23:26] <dsaez>	 quick question, does anyone knows whether there is a (enwiki) dump mirror on the stats machines?
[17:23:26] <dsaez>	 I was downloading the dumps, but maybe it doesn't make sense that everyone has a local copy on their own account
[17:28:08] <ottomata>	 joal:  i liked that article :)
[17:28:13] <ottomata>	 geb is a favorite book of mine :)
[17:28:51] <ottomata>	 dsaez:  there is an nfs mount on the stat boxes
[17:28:56] <ottomata>	 i cannot vouch for its reliability :)
[17:29:11] <dsaez>	 cool! I'll check
[17:29:14] <ottomata>	  /mnt/data/xmldatadumps/public
[17:31:06] <dsaez>	 ottomata: looks good, thanks. Let's see how fast is this nfs mount :)
[17:34:02] <joal>	 ottomata: I don't know Geb
[17:34:23] <joal>	 Ahhhh ! Goedel Escher Bach :) Got it :)
[17:40:43] <elukey>	 nice book :)
[17:40:48] * elukey goes offline!
[18:35:42] <wikibugs_>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Investigate use-cases for delayed job executions - https://phabricator.wikimedia.org/T172832#3559604 (10GWicke) >>! In T172832#3540031, @Mattflaschen-WMF wrote: > There are three considerations relevant to Echo: > 1. Delayed notific...
[18:45:22] <wikibugs_>	 10Analytics-Kanban, 10Research, 10Patch-For-Review: Add QuickSurvey schemas to EventLogging white-list - https://phabricator.wikimedia.org/T172112#3559643 (10mforns) @leila  Sorry for the inactivity period after having asked you for a quick response. The purging script has been very slow the last weeks (it w...
[18:48:09] <wikibugs_>	 (03PS4) 10Joal: [WIP] Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030
[18:54:17] <wikibugs_>	 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3482327 (10Ottomata) Ya, pretty sure this will need `analytics-privatedata-users`.  I'm on clinic duty now, this has already been appr...
[18:57:27] <joal>	 ottomata: not yet a full success with spark 2.1.1, but at least the thing runs
[18:58:03] <ottomata>	 oh ya?
[18:58:12] <ottomata>	 somehow around dep issues because spark 2.1.1 includes different jars?
[18:59:36] <joal>	 I guess so - I changed many things, so it's difficult to be sure
[19:00:30] <joal>	 jackson in spark 2.1.1 is 2.6.5, above the one needed by tranquility
[19:00:37] <ottomata>	 hm
[19:01:57] <joal>	 ottomata: it's one more reason to add to the pile for upgrading ;)
[19:05:10] <joal>	 ottomata: now something else to find: Why do I have no indexation going on in druid :)
[19:07:48] <ottomata>	 hha
[19:08:20] <joal>	 so as originaly said: partially fixed ;)
[19:35:07] <wikibugs_>	 10Analytics: Remove sensitive fields from whitelist for QuickSurvey schemas (end of Q2) - https://phabricator.wikimedia.org/T174386#3560006 (10mforns)
[19:36:28] <wikibugs_>	 10Analytics-Kanban, 10Research, 10Patch-For-Review: Add QuickSurvey schemas to EventLogging white-list - https://phabricator.wikimedia.org/T172112#3560021 (10mforns) Here's the diff of the new change. We'll merge it in short. https://gerrit.wikimedia.org/r/#/c/368769/1..2/modules/role/files/mariadb/eventlogg...
[19:50:54] <ottomata>	 mforns:  got a few minutes for a brain bounce?
[20:13:28] <ottomata>	 holy mooolyyyyyy
[20:13:29] <ottomata>	 https://www.confluent.io/blog/ksql-open-source-streaming-sql-for-apache-kafka/
[20:30:50] <joal>	 ottomata: Man, confluent is ZE place to be ;)
[20:31:24] <ottomata>	 haha
[20:42:02] <wikibugs_>	 10Analytics, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3560280 (10Shilad)
[20:42:07] <wikibugs_>	 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3560278 (10Shilad) 05Open>03Resolved Everything looks good now! Thanks for your quick help, @Ottomata! I'm going to close this tic...
[22:02:35] <wikibugs_>	 10Analytics-Kanban, 10Research, 10Patch-For-Review: Add QuickSurvey schemas to EventLogging white-list - https://phabricator.wikimedia.org/T172112#3560562 (10leila) Thanks @mforns.