[00:01:51] (03CR) 10Nuria: "Capitalization now looks good, still, on the "AllMetrics" page there are no "wikistats1" metrics what makes the message box a bit confusin" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/498748 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [00:03:41] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Beta: Provide easier mapping between Wikistats1 metrics and Wikistats2 metrics (example: "active editors") - https://phabricator.wikimedia.org/T187806 (10Nuria) {F28490083} See screenshot, info box looks a bit strange cause there... [00:04:10] (03CR) 10Nuria: "Added snapshot to phab ticket https://phabricator.wikimedia.org/T187806" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/498748 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [02:46:44] nuria: the time selector change depends on this refactor being merged: [02:46:45] https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/498002/ [02:47:09] I think it's better to merge this and then send the time selector change, to avoid conflicts [03:33:32] (03PS4) 10Fdans: Create metrics matrix component [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/498748 (https://phabricator.wikimedia.org/T187806) [06:17:43] 10Analytics, 10ChangeProp, 10Community-Tech, 10EventBus, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10Joe) FTR, I don't think kubernetes is an option for running mediawiki jobs for now. Because we're still deployi... [06:56:15] morning! [07:03:09] helloooo elukey [07:04:14] hey! what's up my friends under the sun! [07:05:04] I'm just finishing up my tech com meeting, see yall when I wake up :) [07:06:08] btw, fdans / elukey: I pushed the change to wikistats so fdans you can make any changes you want to either the templates (and attempt running the perl) or to the generated docs themselves [07:06:22] but the repository is clean now, has the latest [07:07:36] milimetric: since you're here, look! :D [07:07:45] https://usercontent.irccloud-cdn.com/file/fD32kkbq/phone.gif [07:08:05] it finally works with touch dragging :D [07:08:13] psh, showoff [07:08:13] :) [07:08:18] that looks really nice [07:08:21] lol, i know right [07:08:31] (about the showoff thing) [07:08:39] hm... maybe make that calendar a little bit bigger? [07:08:48] yeah probably [07:08:58] sorry to nitpick [07:09:08] nono, that's why I'm showing ya [07:09:10] it just feels like too nice a feature to not make more prominent [07:09:14] other than showing off, that is [07:09:17] :) [07:09:42] ok, gonna go sleep now [07:10:07] milimetric: see you later! [07:10:13] * fdans lives in tomorrow [07:10:37] what the hell, it's 3am there [07:14:24] milimetric: o/ - shall I re-enable the drop job for geowiki? I don't recall if I needed to do it asap or not [07:14:34] ah already gone [07:14:36] :) [07:18:42] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10mforns) Thanks @ema! [07:18:47] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10mforns) [07:23:09] (03CR) 10Fdans: "@Nuria done. Feel free to merge whenever." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/498748 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [07:24:45] (03CR) 10Fdans: [C: 03+1] "@Nuria we could use the simpler fade in/fade out like at the beginning? I do think that rotating metrics provides better discovery, and we" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/494241 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [07:46:57] joal: bonjour! [07:47:12] I'd like to switch back yarn to zookeeper if you agree [07:47:36] I don't feel comfortable anymore with the rmstore on hdfs [07:47:42] I created https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/499715/1/hieradata/common.yaml for the testing cluster [07:48:02] (I am limiting the rmstore to 1000 in there since there is no point in keeping more) [07:48:23] in zk we currently have [07:48:24] ls / [07:48:24] [burrow, zookeeper, yarn-leader-election, hadoop-ha, hive_zookeeper_namespace, kafka] [07:48:34] so I thought to add 'yarn-rmstore' [07:48:57] and as child znode the name of the cluster [07:49:11] (happens the same for yarn-leader-election) [07:51:59] follow up patch is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/499716 [07:52:12] my idea is to [07:52:52] 1) apply the change to the zookeeper standby and restart it [07:53:06] 2) restart the primary [07:53:22] 3) finally restart again the secondary to restore things [07:53:54] (on the testing cluster first, then prod of course) [08:03:44] Morning elukey :) [08:04:08] elukey: I'm assuming this action means we're not gonna have zk nodes soon :) [08:09:31] (03PS6) 10Joal: Correct names in mediawiki-history sql package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/498861 [08:11:17] joal: o/ [08:11:35] so I still haven't received any answer [08:12:11] but, I thought that to move away from the main zk cluster we'll need to stop yarn to properly migrate the yarn leader election over [08:12:20] so in that moment, the rmstore can be moved as well [08:12:37] but I am not comfortable in say upgrading to 5.16.1 etc.. [08:12:43] without the zk rm store [08:12:57] best case scenario the hosts will come mid q4 I'd say [08:13:00] 10Analytics-Kanban: Fix mediawiki-history-checker after field rename - https://phabricator.wikimedia.org/T219484 (10JAllemandou) [08:13:31] so I'd prefer to close this chapter (my fault if we tried the hdfs rmstore sigh) [08:13:37] makes sense elukey - upgrade and all may require a bunch of restarts (in worse case scenario), not facilitated by HDFS rm-store [08:14:28] elukey: please don't blame yourself - We all could have spent more time investigating, and it seemed a good solution at the time [08:14:41] I am happy that we know a lot more now [08:14:49] elukey: let's proceed with the switch back - I'm here to help :) [08:14:58] ack then, proceeding with test [08:15:04] is the plan outlined good for you? [08:15:06] elukey: learn it the hard way, youll remember it forever :) [08:15:09] secondary first etc.. ? [08:15:54] elukey: There is something that is not clear for me with the plan: should restarting secondary with zk-store copy the HDFS data to zk? [08:16:08] elukey: I'm enclined to think it won't [08:16:43] 10Analytics-Kanban: Fix mediawiki-history-checker after field rename - https://phabricator.wikimedia.org/T219484 (10JAllemandou) a:03JAllemandou [08:17:28] (03PS3) 10Joal: Fix mediawiki-history-checker after field renamed [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499527 (https://phabricator.wikimedia.org/T219484) [08:18:02] joal: nono IIRC it will start from scratch [08:18:23] right, makes sense elukey - We'll loose the data stored in hdfs [08:18:27] yeah [08:18:42] no problem, better now than at beginning of month [08:18:52] yep good point :) [08:26:29] (03CR) 10Joal: [C: 03+1] "One last nit for me: Can you please add the length-check in commit message? Then merging :) Thanks again" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/498702 (https://phabricator.wikimedia.org/T144100) (owner: 10Awight) [08:27:42] (03CR) 10Joal: "Thanks for review - patch 5.1 needed before this one can go - Data-validation is on the way." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/498861 (owner: 10Joal) [08:28:37] (03CR) 10Joal: Fix mediawiki-history-checker after field renamed (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499527 (https://phabricator.wikimedia.org/T219484) (owner: 10Joal) [08:29:37] [zk: localhost:2181(CONNECTED) 4] ls /yarn-rmstore [08:29:37] [analytics-test-hadoop] [08:29:44] \o/ :) [08:30:02] beware Zookeeper, the elephant is back :D [08:31:32] confirmed that: [08:31:39] 1) rmstore starts from scratch [08:31:49] 2) acls are set only for /yarn-rmstore/analytics-test-hadoop/ZKRMStateRoot [08:31:59] so /yarn-rmstore can be reused [08:33:10] all good even after the switchback of the master [08:33:21] proceeding with prod [08:34:57] elukey: one thing though [08:35:18] elukey: ongoing job might die at RM-store change I assume [08:36:07] joal: it didn't happen for testing though [08:36:11] maybe not (yarn might be able to pick up state from ApplicationMasters), but maybe - Let's pick a time when number of ongoing jobs is small? [08:36:20] yep yep [08:36:23] I can stop camus [08:36:48] elukey: I think restarting master just before the hour is enough in our case - Less burden [08:38:47] already stopped, just in case :) [08:39:01] super easy with timers :P [08:39:09] :) [08:39:11] systemctl stop camus-*.timer [08:39:13] that's it [08:39:17] (and puppet disabled) [08:48:00] ok all set for restart [08:48:17] will wait for the webrequest oozie job to finish [09:05:30] elukey: GO GO GO :) [09:05:33] yep! [09:07:09] [zk: localhost:2181(CONNECTED) 14] ls /yarn-leader-election [09:07:10] [analytics-hadoop, analytics-test-hadoop] [09:07:31] all good, 1002 is the active [09:07:33] here we go [09:07:43] going to switch back in a minute [09:11:09] we are back on track [09:11:13] done :) [09:11:35] \o/ [09:14:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve speed and reliability of Yarn's Resource Manager failover - https://phabricator.wikimedia.org/T218758 (10elukey) [09:14:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve speed and reliability of Yarn's Resource Manager failover - https://phabricator.wikimedia.org/T218758 (10elukey) [09:16:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve speed and reliability of Yarn's Resource Manager failover - https://phabricator.wikimedia.org/T218758 (10elukey) Decided to switch back to zookeeper as precautionary measure, I wasn't comfortable in using the hdfs rmstore anymore. The next step wil... [09:21:59] 10Analytics, 10Operations, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) 05Open→03Stalled Pending hardware procurement in https://phabricator.wikimedia.org/T217668 [09:22:01] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) [09:23:33] 10Analytics, 10Discovery-Search, 10Multimedia, 10Reading-Admin, and 3 others: Image Classification Working Group - https://phabricator.wikimedia.org/T215413 (10elukey) [09:23:37] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) 05Open→03Stalled All the info tracked in T216226. We are g... [10:22:57] elukey: have you restarted camus? [10:31:12] ouch no [10:31:18] I suspected that :) [10:31:22] thanks for the reminder, sory [10:31:45] no prob elukey - The cluster seemed too at ease :) [10:33:32] all restored! [10:33:46] Thanks elukey :) [11:42:48] ah lovely email from oozie [11:43:07] :( [11:43:45] so nothing exploding in upload that I can see [11:43:52] but we are running ATS [11:43:58] for some hosts [11:44:01] Interesting :) [11:44:08] https://phabricator.wikimedia.org/phame/post/view/115/switching_production_traffic_to_apache_traffic_server/ [11:45:03] need to check on some hosts alarming if there was an issue [11:45:33] elukey: if switch happened this morning, it might explain ! [11:45:56] not that I know, it has been going on for a while [11:47:00] from the vk metrics nothing weird [11:47:34] elukey: I'm gonna try to rerun it, just in case - It feels weird the thing breaks just at the time we migrate yarn [11:48:55] sure [12:17:52] same error elukey :( [12:19:40] lemme run your script to see what it says [12:19:48] elukey: I'm doing it ;) [12:20:24] as side note, we should start moving people off hive if possible [12:20:31] right [12:21:01] this is not the case of course but I was thinking about it when checking stuff [12:21:10] :) [12:21:26] there's also another thing to discuss when you have time about the hdfs user [12:21:30] not super urgent :) [12:21:58] anyway, back to the problem, let me know if there are any hosts in particular showing up [12:22:08] it might be what dan kept seeing during the past days [12:26:31] possible [12:26:47] elukey: running an updated version of the script as data has not been refiened [12:27:58] ack [12:34:31] elukey: it seems to be real errors: https://gist.github.com/jobar/395aae750ff9a695721fa035136cb81e [12:39:58] very strange [12:41:13] if they were real loss, I'd expect some varnishkafka issues no? [12:41:24] I am trying to think about failure scenarios [12:46:08] elukey: I have a bug in my one-off script - fixing and rerunning [12:56:20] elukey: looking at errors, they seem to come (whether real or false positive, still checking) only from esams and eqsin [12:57:14] joal: is it ok if I take a 1h break before starting to investigate? [12:57:15] elukey: corrected version of the script says we have experienced almost only false positives [12:58:54] elukey: Let's recombine tonight then (I'll be gone for kids and all later) - I think we can rerun the job with higher error threshold, but prefer to have the team opinion [12:59:15] joal: can you update the gist so I can see which hosts are at fault? [12:59:17] And it'd be great to understand why so many rows have ended in the next hour for those hosts [12:59:20] I'll be able to investigate a bit [12:59:50] elukey: updated [12:59:54] thanks! :) [12:59:58] going afk for a bit then, ttl! [13:00:03] later [13:11:23] 10Analytics, 10EventBus, 10Release Pipeline, 10serviceops, 10Services (watching): Modern Event Platform: Stream Intake Service: Documentation - https://phabricator.wikimedia.org/T219332 (10akosiaris) >>! In T219332#5060344, @Ottomata wrote: > > And most specifically: > https://wikitech.wikimedia.org/wik... [13:38:17] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: EventBus extension should never log unserialized events - https://phabricator.wikimedia.org/T218254 (10mobrovac) [13:59:15] 10Analytics, 10EventBus, 10Release Pipeline, 10serviceops, 10Services (watching): Modern Event Platform: Stream Intake Service: Documentation - https://phabricator.wikimedia.org/T219332 (10Ottomata) Thanks, responded and made a couple of changes. [14:44:17] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10elukey) [14:48:57] 10Analytics, 10EventBus, 10WMF-JobQueue, 10Core Platform Team Kanban (Done with CPT), 10Services (done): Partition htmlCacheUpdate job topic - https://phabricator.wikimedia.org/T219159 (10Pchelolo) 05Open→03Resolved We have deployed the partitioner for the htmlCacheUpdate job and it's not running in... [15:01:07] (03PS1) 10Milimetric: Add hyw.wikipedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499787 [15:02:38] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add hyw.wikipedia to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499787 (owner: 10Milimetric) [15:05:55] oh, yall already looked at the upload job failures [15:06:29] yeah I think it is a false positive event accoring to Joseph's investigation [15:06:38] main question is what are those requests doing [15:06:48] morning milimetric btw :) [15:07:10] hi :) [15:07:29] well, 28-9 was false positives, but 28-8 was killed and restarted by yall? [15:07:39] or did you not look into that one, trying to figure it out from the chat above [15:08:37] so this one just failed: https://hue.wikimedia.org/oozie/list_oozie_workflow/0164157-181112144035577-oozie-oozi-W/?coordinator_job_id=0070355-181112144035577-oozie-oozi-C&bundle_job_id=0070353-181112144035577-oozie-oozi-B [15:08:56] elukey: if you're ok, I'll just hit rerun [15:09:53] yep! [15:10:06] the 28-8 was erroring so it needs a higher threshold [15:10:14] joseph wanted to ask to the team before proceeding [15:11:31] oh I see, it hits the data loss threshold and dies, yeah, if it's all false positives, that's fine [15:11:57] 10Analytics, 10EventBus, 10Services (watching): EventGate should extract event time from events and produce to kafka with timestamp - https://phabricator.wikimedia.org/T219513 (10Ottomata) [15:12:04] maybe now that the false positive logic is pretty well vetted we can just incorporate it in the loss computation itself and report only when something is real loss [15:15:27] milimetric: soooo in other news what are we doin about the animation? [15:15:51] nobody seems to have strong opinions on it, but you said it looked bad on your screen so we can change it to fade? [15:37:29] 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) Also, https://etherpad.wikimedia.org/p/moving-data-analytics-prod [15:42:02] milimetric: don't know if you saw the latest comments on the patch, but I'd go with fade for now as uncontroversial [15:42:18] I now you find it uneasy, but for now I think we can push fade and work a bit more on the animation in the meantime [15:42:58] "you find it uneasy" => it makes you feel uneasy (don't know how to express myself, damn) [15:44:16] fdans: you wanna take a quick look, I was playing with the icon [15:44:38] brb [15:44:44] milimetric: to the patch? or in the bc? [15:45:51] fdans: cave [15:46:02] omw [15:55:25] milimetric: I'm back if you want to show me :) [15:56:49] fdans: sure I'm in the cave [15:59:02] (03PS10) 10Milimetric: Add concept of metric groups, rotate in dashboard [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/494241 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [16:04:06] 10Analytics, 10Operations, 10ops-eqiad: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10RobH) p:05Triage→03Normal [16:05:37] 10Analytics, 10Operations, 10ops-eqiad: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10RobH) So the main concern about this is if it will physically fit. Task T216226 - GPU upgrade for stat1005, has a sub-task T216528 where Chris took measurements of the inside of the chassis,... [16:17:05] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088 (10mobrovac) 05Open→03Resolved This has been completed very early in FY 18/19, yay! [16:18:18] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10elukey) [16:45:45] 10Analytics: Replace current time range selector on Wikistats to allow for arbitrary time selections - https://phabricator.wikimedia.org/T219112 (10Nuria) [16:51:42] (03PS11) 10Milimetric: Add concept of metric groups, rotate in dashboard [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/494241 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [16:52:38] fdans: that ^ is the 8 second delay, I guess merge that, rebase the matrix on it, and let me know if you all came to a conclusion after I left (sorry, today's just nuts) [16:57:04] heya, sorry forgot to join! [16:57:45] (03CR) 10Nuria: [C: 04-1] Add ExternalGuidance event logging table to whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499270 (https://phabricator.wikimedia.org/T218838) (owner: 10Chelsyx) [17:04:38] (03CR) 10Ottomata: Oozie: add article recommender (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [17:07:50] (03CR) 10Ottomata: Oozie: add article recommender (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [17:14:41] 10Analytics, 10CirrusSearch, 10Discovery, 10Discovery-Search: Ingest cirrussearchrequest data into druid - https://phabricator.wikimedia.org/T218347 (10debt) [17:27:01] (03PS2) 10Chelsyx: Add ExternalGuidance event logging table to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499270 (https://phabricator.wikimedia.org/T218838) [17:33:59] 10Analytics, 10ChangeProp, 10Community-Tech, 10EventBus, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10Krinkle) >>! In T218812#5062591, @Mooeypoo wrote: > [..] there are quite a number of other products [..] that r... [17:38:40] (03CR) 10Chelsyx: Add ExternalGuidance event logging table to whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499270 (https://phabricator.wikimedia.org/T218838) (owner: 10Chelsyx) [18:02:04] 10Analytics-Kanban, 10Product-Analytics: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Neil_P._Quinn_WMF) >>! In T131280#5053679, @Nuria wrote: > @Yair_rand the public guidelines as to data retention are public in the privacy policy: http... [18:28:36] * elukey off! [18:33:12] nuria: i think the video on front page of https://www.hops.io/ is pretty good [18:33:13] it slong [18:33:17] i've only watched 20+mins of it [18:33:30] ottomata: ok! [18:33:36] but it explains some of the ML stuff, how it'd work with a notebook , distribute, use gpus, tensorflow + pyspark, etc. [18:33:55] milimetric: we are good merging wikistats patches, want me to do a last sanity check and merge? [18:34:16] ottomata: will watch today [18:34:31] nuria: what did you all decide on for the "All metrics..." link [18:34:57] milimetric: "All Metrics (no elipsis), blue in color" [18:34:58] nuria: this one can be merged now: https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/494241/ [18:35:17] nuria: ok, I'll do that on the next patch, rebase, and push that too for merging [18:35:18] milimetric: let me build it it [18:35:19] then I'll deploy [18:37:06] (03CR) 10Nuria: [C: 03+2] Add concept of metric groups, rotate in dashboard [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/494241 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [18:37:19] milimetric: merging, sanity checking looks good [18:37:27] milimetric: should i look at the other one? [18:37:50] milimetric: thsi one: https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/498748/ [18:38:01] milimetric: doing that [18:38:08] nuria: lemme rebase and fix the link first [18:38:17] milimetric: i can do that too [18:38:25] nuria: no worries, I was already in it [18:38:37] milimetric: k, will check after next patch [18:41:17] nuria: there's a regression with routing I just realized [18:41:31] I'll try to fix it [18:41:35] milimetric: k, i will check later today, no rush [18:41:43] yeah, this might take a little bit [18:41:55] ottomata: what is chris (danis) user? [18:44:08] cdanis i tink [18:44:10] irc? [18:44:50] 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Nuria) [18:45:02] ottomata: ok, here it is ticket, please ammend as needed: https://phabricator.wikimedia.org/T219544 [18:48:46] 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) Oh ho hoooo https://hadoop.apache.org/docs/current/hadoop-openstack/index.html [18:51:42] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Modern Event Platform: Stream Intake Service: EventGate security review - https://phabricator.wikimedia.org/T208251 (10sbassett) [18:56:00] 10Analytics: Change permissions for daily traffic anomaly reports on stat1007 - https://phabricator.wikimedia.org/T219546 (10ssingh) [19:04:58] !log Manually rerun webrequest-load-wf-upload-2019-3-28-8 with higher error threshold (alot of false positive!) [19:05:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:07:14] (03CR) 10Ottomata: Add SparkSchemaLoader capabilities to Refine and RefineTarget (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/494831 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [19:24:05] NURIA! [19:24:11] here i am working on sparkschemaloader [19:24:15] and something isn't working! [19:24:25] java.lang.ClassCastException: com.fasterxml.jackson.databind.node.TextNode cannot be cast to com.fasterxml.jackson.databind.node.ObjectNode [19:25:09] so i fire up a spark shell [19:25:11] to run my code [19:25:21] and indeed, via the public methods i have [19:25:22] it fails [19:25:27] ok cool, but what is the value it is getting? [19:25:42] ok i can call load() stil lit is public, but i need to give it the direct URI to eventlogging schema in meta [19:25:50] BUT [19:25:58] we made the function that constructs the URI protected [19:25:59] so I can't call it. [19:26:09] so, i'm either going to manually construct the uri by hand now [19:26:10] or [19:26:22] or i'm going to copy/paste the funciton into the repl so I can call it [19:29:42] ottomata: yessir [19:30:19] wouldn't it b nice if i could just call the code directly? :) [19:30:27] ottomata: i see, wait in what line do you get that exception [19:30:39] nuria: i think this might be that format=json thing [19:30:47] that we also ran into with eeventlogging python [19:30:55] i'll figure that out, but [19:31:06] i'm just complaining that we made these methods non public [19:31:08] now i can't use them! [19:31:12] for debugging [19:31:53] nuria: I don't think I'm being very productive, I think I need some more sleep [19:32:03] I'll push what I have and -1 it, there's a problem with the routing [19:32:11] ottomata: this seems a problem that a unit test should catch rather than repl [19:32:15] milimetric: k [19:32:24] the unit tests don't fetch from meta.wm.org [19:32:25] ottomata: batcave? [19:32:48] ottomata: ya, but thsi is just a text response we can feed to test no? [19:33:21] i'm tryign to find out what the response is [19:33:23] via the repl [19:33:25] by running code [19:33:42] i think its probably returning html like it did for el the other day [19:34:09] but, i can't isolate the offending function directly to see what it returns [19:34:16] so i have to write up the code myself somehow in the repl [19:34:21] ottomata: which would mean that error is just it is not valid json, only that because is java we get 30 line stacktrace [19:34:24] rather than use helper functions that we protected [19:34:45] it look slike it parsed the html as a single JSON text node [19:34:47] ottomata: if they are protected a subclass cannaccess them [19:34:53] parsed the html [19:35:03] nuria: so I should subclass the ELloader? [19:35:10] i coudl do that I suppose [19:35:12] so [19:35:43] class debugloader extends elloader { public publicGetUrl(name) { return super.getUrl(name); } } [19:36:19] val debugL = new debugloader(); [19:36:19] val schemaUrl = debugL.publicGetUrl(name); [19:36:29] might as well just type it in or copy/paste the function into the repl at that point [19:36:36] (btw, i'm just complainnig here) [19:37:00] i'll figure it out, but i'm just giving a real example of why making helper functions non public is annoying in practice [19:37:21] btw, my guess was correct, adding format=json to the URI worked [19:37:55] ottomata: k, i still think is worth testing on unit tests totally non valid json [19:38:21] aye, it should just throw an exception sooner? [19:38:25] if it doesn't get an object? [19:39:18] ottomata: if the retrieved schema is grabage/empty/non valid it should throw an exception and we catch it and log an error yes [19:39:25] ottomata: seems ok? [19:39:30] *garbage [19:39:35] ya [19:39:37] that would have caught this [19:39:49] but my point is... we aren't going to catch everything! [19:39:55] with logging/guarding [19:40:06] when we don't, figuring out what is going on is easier to do in the repl [19:40:27] and if i can't call the code that is actually being run to inspect return values, etc., it is harder to debug stuff in prod [19:40:38] e.g. nuria, the other day when this happened to eventlogging mysql consumer [19:40:43] you and I both did the same thing [19:40:48] we hot-fixed/edited the live code [19:40:51] we can't do that with java [19:40:51] ottomata: agreed but this is a very vanilla error for this code no? [19:40:59] ottomata: we added login [19:41:10] ya, I agree we should guard and log stuff as much as possible. [19:41:40] my point is that we won't catch it all, and when we don't being able to figuree out what is going on in prod (so that we can then guard for it in code) is easier to do if functions are not locked away [19:42:19] since we can't just edit code in prod to figure out what is going on [19:42:27] repl is next best (better?) thing [19:42:30] ottomata: i think proper logging would catch the majority of issues, for the few others that remain you are right we might need do what you just did [19:43:06] ottomata: i have to say in the past what i woudl have done is set logging to debug , gotten a ton of text (basically set by step) and figure it our from there [19:43:20] ottomata: not different than how we figure it out in hadoop no? [19:43:27] ottomata: but repl is MORE useful [19:43:32] that is true [19:44:13] (03PS5) 10Milimetric: Create metrics matrix component [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/498748 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [19:44:43] IF your logging catches it [19:45:02] and figuring out stuff in just hadoop from logging can be very hard [19:45:37] (03CR) 10Milimetric: "fdans - fixed two tricky issues with scoped styles and routing. Take a look at the diff of the last patch so you know what I mean. The s" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/498748 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [19:46:31] nuria: never mind, I figured it out [19:46:47] nuria: so that change, https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/498748/ is ready for review [19:47:03] as I mentioned there I fixed two somewhat bigger issues, take a look and if you +2 I'll deploy [19:48:35] (03PS1) 10Ottomata: chmod 755 some bin/ executables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499880 [19:49:04] (03PS2) 10Ottomata: chmod 755 bin/yarn-logs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499880 [19:49:11] milimetric: looking [19:49:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] chmod 755 bin/yarn-logs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499880 (owner: 10Ottomata) [19:53:10] milimetric: let me triple safe and try the prod build [20:01:55] milimetric: I have the checker-error found :) Is now a good time to discuss it or next week? [20:02:31] sure, omw cave [20:02:31] (03CR) 10Nuria: [C: 03+2] Create metrics matrix component [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/498748 (https://phabricator.wikimedia.org/T187806) (owner: 10Fdans) [20:03:20] thx nuria [20:07:30] 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External): Schema Registry HTTP Service - https://phabricator.wikimedia.org/T219552 (10Ottomata) [20:08:01] 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External): Schema Registry HTTP Service - https://phabricator.wikimedia.org/T219552 (10Ottomata) I'd like to set this up sooner rather than later (Q4 hopefully) so that I can more easily use remotely hosted schemas for ingestion into Hive in {... [20:08:29] 10Analytics, 10EventBus, 10Services, 10Core Platform Team Backlog (Watching / External): Schema Registry HTTP Service - https://phabricator.wikimedia.org/T219552 (10Ottomata) [20:16:05] (03PS1) 10Milimetric: Release 2.5.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499885 [20:17:03] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Release 2.5.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499885 (owner: 10Milimetric) [20:20:40] (03Merged) 10jenkins-bot: Release 2.5.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499885 (owner: 10Milimetric) [20:29:27] :D [20:29:34] Converting JSONSchema field `test_map` to Spark dataType MapType(StringType,StringType,false) [20:29:41] |-- test_map: map (nullable = true) [20:29:41] | |-- key: string [20:29:41] | |-- value: string (valueContainsNull = false) [20:29:45] \o/ :) [20:33:28] 10Analytics, 10EventBus, 10Operations, 10Services, 10vm-requests: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) [20:35:59] (03PS1) 10Ottomata: Fix EventLogging schema URI to include format=json [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499906 [20:37:04] (03CR) 10jerkins-bot: [V: 04-1] Fix EventLogging schema URI to include format=json [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499906 (owner: 10Ottomata) [20:37:56] (03PS9) 10Ottomata: Add SparkSchemaLoader capabilities to Refine and RefineTarget [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/494831 (https://phabricator.wikimedia.org/T215442) [20:39:05] (03CR) 10jerkins-bot: [V: 04-1] Add SparkSchemaLoader capabilities to Refine and RefineTarget [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/494831 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [20:39:33] (03PS2) 10Ottomata: Fix EventLogging schema URI to include format=json [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499906 [20:40:21] (03PS3) 10Ottomata: Fix EventLogging schema URI to include format=json [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499906 [20:41:25] (03CR) 10jerkins-bot: [V: 04-1] Fix EventLogging schema URI to include format=json [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499906 (owner: 10Ottomata) [20:43:15] (03PS4) 10Ottomata: Fix EventLogging schema URI to include format=json [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499906 [20:48:24] (03CR) 10Ottomata: [C: 03+2] Fix EventLogging schema URI to include format=json [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499906 (owner: 10Ottomata) [20:49:28] (03PS10) 10Ottomata: Add SparkSchemaLoader capabilities to Refine and RefineTarget [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/494831 (https://phabricator.wikimedia.org/T215442) [20:56:59] joal: i thnk https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/494831/ can be merged [20:57:25] (03CR) 10Joal: [C: 03+2] "Merging :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/494831 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [20:58:38] (03PS7) 10Joal: Correct names in mediawiki-history sql package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/498861 [20:59:06] (03PS4) 10Joal: Fix mediawiki-history-checker after field renamed [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499527 (https://phabricator.wikimedia.org/T219484) [20:59:06] yeeehaw danke [21:02:39] (03Merged) 10jenkins-bot: Add SparkSchemaLoader capabilities to Refine and RefineTarget [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/494831 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [21:04:13] (03PS1) 10Joal: [WIP] Fix for null-timestamps in checker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499914 [21:04:32] milimetric: I wrote that, to keep it in mind --^ [21:05:23] great [21:05:23] And with that, I'm gone for a few days - See you team! [21:05:37] Have a really nice time! [21:12:01] 10Analytics, 10Operations, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10RobH) a:05RobH→03elukey As the hardware order is pending, and T219522 is setup for the installation, I'm reassigning this to @elukey as there is nothing more pending for me to do at thi... [21:12:51] 10Analytics, 10Operations, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10RobH) Basically this can be resolved as soon as @elukey is happy with it. It can stay open until after the new hardware is installed if preferred. [21:26:23] (03PS1) 10Mforns: Modify mediawiki/history/druid job to ingest a simpler data set to druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499917 (https://phabricator.wikimedia.org/T211173) [21:59:04] o/ all tty in a week+ [22:24:56] (03CR) 10Mforns: [C: 04-1] "Still testing" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499917 (https://phabricator.wikimedia.org/T211173) (owner: 10Mforns) [22:25:10] (03CR) 10Mforns: [C: 04-2] "Still testing" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499917 (https://phabricator.wikimedia.org/T211173) (owner: 10Mforns) [22:27:53] Is there somewhere i can view an ordered list of our projects by how many page views they get? (e.g. What's the 100 most popular wikis)? [22:36:32] (03CR) 10Nuria: [C: 03+2] Fix mediawiki-history-checker after field renamed [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499527 (https://phabricator.wikimedia.org/T219484) (owner: 10Joal) [22:52:08] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Beta: Provide easier mapping between Wikistats1 metrics and Wikistats2 metrics (example: "active editors") - https://phabricator.wikimedia.org/T187806 (10Nuria) First wave of changes on this regard are now live, see, for example, a... [22:53:51] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Beta: Provide easier mapping between Wikistats1 metrics and Wikistats2 metrics (example: "active editors") - https://phabricator.wikimedia.org/T187806 (10Nuria) 05Open→03Resolved