[07:19:53] * elukey is back from vacation [07:19:58] * elukey looks for joal :) [07:20:08] hello team! :) [08:20:52] Hi a-team :) [08:21:06] elukey, my morning friend!! How are you ? [08:23:06] joal helloooooo [08:23:18] HiiiiiIIIIIIIIiiii :) [08:23:29] How were holidays elukey ? [08:23:54] La Corse est belle [08:24:03] C'est vrai :) [08:24:46] I wish I'll go back there with a motorbike [08:25:13] how's the new family going?? [08:25:38] You're right, motorbike would be very nice over there :) [08:27:21] Family is doing great, with still some lack of sleep, but great :) [08:31:25] :) [08:52:05] elukey: I'm proud of me, I managed to break webrequest in less than three days when coming back from holidays ;) [09:01:49] joal: I've read the mail! What happened? [09:02:45] elukey: I deployed a change I did before going to holidays without triple testing [09:02:57] elukey: I didn't notice nor think that stuff would break [09:03:15] elukey: When changing a struct in hive, only ADD fields [09:03:49] elukey: and only add those fields at the end of the struct [09:04:16] When deploying, I updated the normalized_host struct, adding a field in the middle of the struct [09:04:26] New data was working fine, but old would break [09:05:04] It's because ive relies on field ordering (indices) for structs - therefore modifying the order is wrong for old/new compliance [09:06:03] wow [09:06:36] not nice [09:07:12] sonething else to remind when fixing stuff about hive schema [09:07:33] When overwriting a partition in a table, its schema is not updated [09:07:37] elukey: --^ [09:08:04] When we fixed the thing with andrew, we reran jobs with the field moved to the end of the struct [09:08:33] But the table partitions had been created with the schema of the field-inthe-middle-struct [09:08:41] Meeeh [09:09:01] So we add to manually drop / recreate the partitions in order for hive to apply the correct schema [09:09:08] Yay ! [09:09:16] What a wonderful first deploy :) [09:10:00] argh :( [09:10:04] how long did it take? [09:10:50] elukey: by chance bearloga noticed the error after a few hour [09:11:15] Once noticed, it took us about one hour to troubleshoot / fix [09:11:21] Then we needed to rerun [09:11:51] hm, I say one hour, it's probably 2 or 3 :) [09:14:05] two hours of Joseph+Andrew would mean a day for me and somebody else, if not more :D [09:15:54] :) I'm not sure of that :) [09:16:22] joal: did you see the Kafka timeout issue? [09:16:32] that one was a sneaky bastard [09:17:23] elukey: Nope, I have not - I have noticed alarm emails from burrow but didn't follow what was going on [09:17:26] elukey: [09:17:35] elukey --verbose? [09:21:35] joal: https://phabricator.wikimedia.org/T172681 [09:38:48] wow elukey - good work [09:39:15] elukey: Kafka is difficult given how many systems talk to it [09:39:31] Ah elukey, I really missed our morning talks :) [09:41:31] me too! [09:41:48] it was a nightmare, and in the end if was an unknown host doing a DOS basically [09:42:04] this made me think that we don't know exactly what are the producers/consumers of kafka [09:42:13] and the logs are not that helpful in this regard [09:43:00] ah completely unrelated note, we need to restart all the java daemons for updates :) [09:43:12] Yay, a java update :) [09:43:54] elukey: It'd be great to have page listing producers / consumers for kafka in our System folder, don't you think ? [09:48:02] joal: definitely, the only way that allowed me to find the offending host/vm was to enable network trace logs on one of the broker :( [09:48:11] right [09:48:14] not cool [09:48:23] even better would be to tune network logs in kafka to list ips connecting [09:48:29] so in case of fire you just check there [09:51:04] hm [09:51:14] elukey: should not prevent us from documenting :) [09:55:52] agreed, but from the ops perspective docs that is not updated to 10 seconds ago doesn't help when a fire starts :) [09:56:07] right [10:34:12] !log restart yarn and hdfs on analytics1030 for jvm updates (canary) [10:34:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:35:30] elukey: I'll care oozie jobs during the restart (if any) [10:36:48] joal: super, I am planning to do it this afternoon if an1030 will be fine.. yarn is not a concern with the seamless restart, hdfs might give us some oozie noise :( [10:55:42] * elukey lunch! [11:03:14] (03PS5) 10Joal: Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101) [11:03:16] (03PS3) 10Joal: Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 [11:32:13] 10Analytics, 10DBA, 10Data-Services, 10Research, 10cloud-services-team (Kanban): Implement technical details and process for "datasets_p" on wikireplica hosts - https://phabricator.wikimedia.org/T173511#3558277 (10jcrespo) Duplicate of T156869? We should fix it for all users, not only datasets_p? [12:36:16] (03CR) 10Ottomata: [C: 031] Clear connections between report executions [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/373967 (https://phabricator.wikimedia.org/T173585) (owner: 10Mforns) [12:37:06] (03CR) 10jerkins-bot: [V: 04-1] Clear connections between report executions [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/373967 (https://phabricator.wikimedia.org/T173585) (owner: 10Mforns) [12:42:39] ottomata: o/ [12:42:41] hellooooo [12:42:45] mforns: o/ [12:42:48] hiiiii [12:43:05] \o/ [12:54:58] the "HDFS capacity used percentage" seems working fine, there is a warning in Icinga that HDFS used space is 70% [12:55:13] the critical is at 80%, maybe I can +10 both [12:55:25] in any case, wouldn't it be the case to delete some old data? [12:55:31] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=25&fullscreen&orgId=1&from=now-30d&to=now [13:28:39] (03CR) 10Mforns: [V: 032 C: 032] "Self-merging after +1 to unbreak production." [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/373967 (https://phabricator.wikimedia.org/T173585) (owner: 10Mforns) [13:31:20] 10Analytics, 10Discovery-Analysis: Get 'sparklyr' working on stats1005 - https://phabricator.wikimedia.org/T139487#3558525 (10GoranSMilovanovic) @mpopov I've tried out two different approaches to connect to Spark from {sparklyr} on stat1005, including yours, and failed. Please take a look and let me know if yo... [13:35:35] mforns: any news with eventlogging? [13:36:06] elukey, no......... so sorry, I've been the whole week fighting with Wikimetrics :'( [13:36:21] elukey, If you want, we can pair today after standup? [13:36:34] to alter/write [13:36:36] tests [13:38:07] mforns: I have the ops meeting, buuut what do you think about tomorrow?? [13:38:34] elukey, sure! then I maybe can start today, and we kill it tomorrow [13:38:51] tomorrow I will start the day earlier [13:39:07] about midday [13:53:02] super [13:57:41] !log restart kafka* on kafka1012 for openjdk security updates (canary) [13:57:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:13:02] mforns: ok to drop PageContentSaveComplete_5588433_15423246 from m4-master? [14:13:16] to complete https://phabricator.wikimedia.org/T170720 [14:15:09] we could also drop MobileWebUIClickTracking_10742159_15423246 after some sanity checking (only from slaves) [14:18:41] elukey, reading [14:20:15] elukey, I think it's safe to remove PageContentSaveComplete fro m4-master :] [14:27:57] elukey, re. MobileWebUIClickTracking, I checked that the data in hive looks good. So I think we can go ahead and drop [14:28:35] the last comment in this ticket is also confiming: [14:28:36] https://phabricator.wikimedia.org/T172322 [14:28:56] yep :) [14:32:06] ok dropped PageContentSaveComplete [14:33:53] \\\o/// [14:40:45] eventlogging_sync on dbstore1002 complained [14:40:54] had to restart it, thing to remember for the next drops [14:42:05] aha [14:42:57] ok mforns I'll drop MobileWebUIClickTracking_10742159_15423246 tomorrow after lunch when you'll be online ok? [14:46:01] 10Analytics-Kanban, 10Patch-For-Review: Troubleshoot Wikimetrics "magic button" - https://phabricator.wikimedia.org/T173585#3558751 (10mforns) It's deployed to production. We should not see the same problem. Please, reopen the task otherwise! [15:00:28] joal: yoohoo [15:00:31] taoops [15:16:12] 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis (Current work), 10Patch-For-Review: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3558833 (10JAllemandou) [15:18:38] 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis (Current work), 10Patch-For-Review: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3558835 (10JAllemandou) a:03Ottomata [15:25:06] 10Analytics-Kanban: Meta-statistics on MediaWiki history reconstruction process - https://phabricator.wikimedia.org/T155507#3558844 (10JAllemandou) [15:28:35] 10Analytics, 10DBA, 10Data-Services, 10Research, 10cloud-services-team (Kanban): Implement technical details and process for "datasets_p" on wikireplica hosts - https://phabricator.wikimedia.org/T173511#3558849 (10bd808) >>! In T173511#3558277, @jcrespo wrote: > Duplicate of T156869? We should fix it for... [15:34:17] 10Analytics: Reinstate a subset of reports removed from the reportcard until WikiStats 2.0 is back - https://phabricator.wikimedia.org/T166679#3558854 (10JAllemandou) 05Open>03declined [15:35:20] 10Analytics: Reinstate a subset of reports removed from the reportcard until WikiStats 2.0 is back - https://phabricator.wikimedia.org/T166679#3304049 (10JAllemandou) It makes a long time this task has been open. Wikistats is not far from ready - We've decided to put effort into implementing those important metr... [15:40:58] 10Analytics-Cluster, 10Analytics-Kanban: Automate refinery jar cleanup - https://phabricator.wikimedia.org/T159337#3558916 (10JAllemandou) [15:46:05] 10Analytics: Weird performance of sqoop job on Edit Reconstruction - https://phabricator.wikimedia.org/T172579#3502684 (10JAllemandou) Could be due to hadoop overhead. To investigate. [15:52:22] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Try to make tranquility work with Spark - https://phabricator.wikimedia.org/T168550#3558974 (10JAllemandou) a:03JAllemandou [15:55:47] 10Analytics, 10Performance-Team: Explore NavigationTiming by faceted properties - EventLogging refine - https://phabricator.wikimedia.org/T166414#3558988 (10JAllemandou) [15:55:49] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Implement EventLogging Hive refinement - https://phabricator.wikimedia.org/T162610#3558989 (10JAllemandou) [16:59:09] 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3559341 (10mpopov) [17:23:26] Hi all! [17:23:26] quick question, does anyone knows whether there is a (enwiki) dump mirror on the stats machines? [17:23:26] I was downloading the dumps, but maybe it doesn't make sense that everyone has a local copy on their own account [17:28:08] joal: i liked that article :) [17:28:13] geb is a favorite book of mine :) [17:28:51] dsaez: there is an nfs mount on the stat boxes [17:28:56] i cannot vouch for its reliability :) [17:29:11] cool! I'll check [17:29:14] /mnt/data/xmldatadumps/public [17:31:06] ottomata: looks good, thanks. Let's see how fast is this nfs mount :) [17:34:02] ottomata: I don't know Geb [17:34:23] Ahhhh ! Goedel Escher Bach :) Got it :) [17:40:43] nice book :) [17:40:48] * elukey goes offline! [18:35:42] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Investigate use-cases for delayed job executions - https://phabricator.wikimedia.org/T172832#3559604 (10GWicke) >>! In T172832#3540031, @Mattflaschen-WMF wrote: > There are three considerations relevant to Echo: > 1. Delayed notific... [18:45:22] 10Analytics-Kanban, 10Research, 10Patch-For-Review: Add QuickSurvey schemas to EventLogging white-list - https://phabricator.wikimedia.org/T172112#3559643 (10mforns) @leila Sorry for the inactivity period after having asked you for a quick response. The purging script has been very slow the last weeks (it w... [18:48:09] (03PS4) 10Joal: [WIP] Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 [18:54:17] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3482327 (10Ottomata) Ya, pretty sure this will need `analytics-privatedata-users`. I'm on clinic duty now, this has already been appr... [18:57:27] ottomata: not yet a full success with spark 2.1.1, but at least the thing runs [18:58:03] oh ya? [18:58:12] somehow around dep issues because spark 2.1.1 includes different jars? [18:59:36] I guess so - I changed many things, so it's difficult to be sure [19:00:30] jackson in spark 2.1.1 is 2.6.5, above the one needed by tranquility [19:00:37] hm [19:01:57] ottomata: it's one more reason to add to the pile for upgrading ;) [19:05:10] ottomata: now something else to find: Why do I have no indexation going on in druid :) [19:07:48] hha [19:08:20] so as originaly said: partially fixed ;) [19:35:07] 10Analytics: Remove sensitive fields from whitelist for QuickSurvey schemas (end of Q2) - https://phabricator.wikimedia.org/T174386#3560006 (10mforns) [19:36:28] 10Analytics-Kanban, 10Research, 10Patch-For-Review: Add QuickSurvey schemas to EventLogging white-list - https://phabricator.wikimedia.org/T172112#3560021 (10mforns) Here's the diff of the new change. We'll merge it in short. https://gerrit.wikimedia.org/r/#/c/368769/1..2/modules/role/files/mariadb/eventlogg... [19:50:54] mforns: got a few minutes for a brain bounce? [20:13:28] holy mooolyyyyyy [20:13:29] https://www.confluent.io/blog/ksql-open-source-streaming-sql-for-apache-kafka/ [20:30:50] ottomata: Man, confluent is ZE place to be ;) [20:31:24] haha [20:42:02] 10Analytics, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3560280 (10Shilad) [20:42:07] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3560278 (10Shilad) 05Open>03Resolved Everything looks good now! Thanks for your quick help, @Ottomata! I'm going to close this tic... [22:02:35] 10Analytics-Kanban, 10Research, 10Patch-For-Review: Add QuickSurvey schemas to EventLogging white-list - https://phabricator.wikimedia.org/T172112#3560562 (10leila) Thanks @mforns.