[03:10:43] 10Analytics, 07Chinese-Sites: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3285365 (10Shizhao) [10:30:12] (03PS1) 10Filippo Giunchedi: Merge branch 'master' into debian [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/355194 [10:30:14] (03PS1) 10Filippo Giunchedi: Unblock signals in children processes, fixes cleanup of shell pipelines. [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/355195 [10:54:14] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Merge branch 'master' into debian [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/355194 (owner: 10Filippo Giunchedi) [10:54:25] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Unblock signals in children processes, fixes cleanup of shell pipelines. [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/355195 (owner: 10Filippo Giunchedi) [11:10:19] Thanks HaeB for the doc improvement :) [11:10:34] Hi a-team, I'm back in (almost) normal mode [11:12:09] o/ [11:12:22] elukey: Shall we go for a cluster deploy? [11:12:59] 06Analytics-Kanban, 13Patch-For-Review: Correct uniques computation to not exclude countries that don't have either underestimates or offset - https://phabricator.wikimedia.org/T165661#3285871 (10JAllemandou) a:03JAllemandou [11:15:19] joal: sure no objections.. I'll need to restart the jvms later on for upgrades :) [11:15:25] k elukey [11:15:44] joal: I am going out for lunch, if you could wait ~30 mins would be super great [11:15:49] will be a long post-deploy time: many stuff to restart / rerun / care [11:15:49] otherwise I'll stay a bit more [11:16:10] elukey: sure, preping the job now to be ready for when you're back [11:17:09] super :) [11:17:11] * elukey lunch! [11:17:40] 06Analytics-Kanban: Provide unqiues estimate/offset breakdowns in AQS - https://phabricator.wikimedia.org/T164593#3285877 (10JAllemandou) [11:48:17] * elukey back [11:48:31] elukey: \o [11:49:49] elukey: let me know when you feel ready ;) [11:50:09] you can go Joseph [11:51:11] elukey: phone, will start just after [11:51:54] sure! [11:52:58] I am going to set vm.dirty_background_bytes = 25165824 on aqs1004 to as test for https://gerrit.wikimedia.org/r/#/c/354107/ [11:53:20] atm we have it set to zero and vm.dirty_background_ratio = 10 (default) [11:53:27] https://www.kernel.org/doc/Documentation/sysctl/vm.txt [11:53:41] this is part of the efforts to merge the casssandra puppet code for restbase/aqs [11:54:01] it shouldn't be a problem for us [11:54:08] but I want to test it first [12:00:43] ok just set it :) [12:07:22] back elukey - I start deploying refinery [12:10:12] !log Start refinery deployment [12:10:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:10:24] Arf elukey - Forgot to add a message :( [12:10:35] elukey: Will try thing about it next time sorry [12:12:38] not a problem :) [12:16:21] let me know if you need help with restarting [12:17:00] elukey: I have my stuff planned, will do it thanks :) [12:18:30] Arf, a patch was +2ed but not merged [12:18:43] merging and redeploying, sorry for the noise [12:24:27] elukey: deploy mess - no space left on stat1022 :( [12:24:34] elukey: stat1002 sorry [12:25:35] fixing it! [12:25:40] sorry elukey :( [12:26:01] if this is the worst that happens during a deployment I am really happy :) [12:26:13] :) [12:27:29] should be ok now! [12:30:57] thanks elukey - dumb question: have you reset the diffids and so in scap folders? [12:32:56] joal: sorry I didn't get it [12:33:05] I deleted the old revs dirs in the rev-cache [12:33:27] elukey: when trying to dpeloy after a failure, sometimes there are folder issues in scap IIRC [12:35:05] elukey: still makes no sense what I say, hu? [12:35:25] not a lot :) [12:35:30] hehe [12:35:32] let's try to deploy and see if anything comes up :) [12:36:06] elukey: I recall that scap copies the new copy of the deployed code to an uuid-based internal folder [12:36:29] elukey: And, after a rollbac and a problem, folder is there but content is not correct, and might cause problem [12:36:36] elukey: But I might be dreaming all that [12:37:38] joal: nono that's the rev-cache, the one that I've cleaned.. and the problem could be that the symlink to the last rev is not the one that you expect, we had a similar issue a while ago [12:37:43] but it shouldn't be the case [12:37:51] ok elukey, deploying then :) [12:37:53] thanks agian [12:38:38] super [12:43:57] elukey: I'm sorry I think we have an issue related to the thing we just discussed :( [12:44:11] elukey: git fat jars not downloaded on stat1002 [12:44:23] elukey: the deploy process did not try to redeploy on stat1002 [12:44:25] :( [12:45:17] can I try to deploy? [12:45:21] please [12:46:10] ottomata: helloooo whenever you're here we could deploy eventlogging? [12:47:33] joal: better now? [12:48:00] Yay ! [12:48:29] elukey: What have you done to fix? Redeploy? [12:48:55] scap deploy --limit stat1002.eqiad.wmnet --force "Updated stat1002 with the last refinery deployment" [12:48:58] brutal kick [12:49:00] :D [12:49:10] :D [12:49:29] * joal loves when elukey uses his ops-hammer [12:49:53] fdans: POZOR! [12:50:06] OH NO [12:53:42] aahahha [12:54:15] how was the flight from Rome? Mr Trump is visiting and everything is super blocked now afaik [12:56:57] !log Deploying refinery to HDFS [12:56:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:29:41] fdans: sorry! with you shorltly, am doing a buncha expense stuff [13:30:50] elukey: delayed, but uneventful [13:30:56] slept almost the whole flight [13:31:56] (03PS1) 10Joal: Correct mediawiki-history SLA bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355221 [13:32:34] (03PS2) 10Joal: Correct mediawiki-history SLA bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355221 (https://phabricator.wikimedia.org/T164713) [13:33:16] 10Analytics: Code Review Needed: New data produced on https://analytics.wikimedia.org/datasets/ - https://phabricator.wikimedia.org/T165944#3286179 (10Addshore) As Tobi said code should be published to a public repo. If this is running regularly it should be added to puppet, for this the repo will have to be in... [13:46:16] !log Restarted oozie druid uniques job after deploy [13:46:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:47:01] !log Restarted oozie druid hourly pageview job after deploy [13:47:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:50:16] !log Restarted oozie last_access_uniques jobs (daily + monthly) after deploy [13:50:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:52:39] (03PS3) 10Joal: Correct mediawiki-history SLA bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355221 (https://phabricator.wikimedia.org/T164713) [14:19:26] fdans: OOOok! [14:19:36] ottomata: o/ [14:19:47] have you visited the church ??? [14:19:49] ottomata: BATICUEVA [14:19:51] the bone church [14:19:52] oh yeahhhhh [14:20:07] haha "bone church" [14:20:41] https://www.instagram.com/p/BUVPOiFjCet/?taken-by=ottomata [14:20:56] ottomata: I'm at the batcave [14:32:49] !log Start 1-off oozie jobs adding underestimate and offset values in historical archived uniques datasets [14:32:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:37:21] ottomata: do you recall why we are using virtual disk with raid0 and one disk only on hadoop workers? [14:37:33] I know that it should be a sort of JBOD [14:37:35] elukey: i think we had to? [14:37:53] i think it was the only way we could get it to be 'jbod' [14:37:58] I thought so but kafka shows some adapter info with no virtual disks [14:38:03] (the kafka hosts) [14:38:27] hm, then no i don't know, maybe some historical reason? like earlier boxes we had made us do that [14:38:34] probably... [14:38:52] I am asking because Jaime is adding some alarms if the virtual disks are not configured with Write back [14:39:03] and I found out that analytics1033 for some reason is configured with Write Through [14:39:32] plus we have other inconsitencies across the workers ndoes [14:39:34] like WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU [14:40:04] err, this one is good [14:40:13] but we have other little different configs [14:40:26] might be only a matter of reviewing them and apply some sanitization [14:40:50] (03PS24) 10Ottomata: EventLogging JSON -> Hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [14:41:19] elukey: +1 [14:41:51] i don't have a lot of context for you , sorry, usually they are just set up by chris and then we provision, so anything that happens (unless there is an issue) before boot/install i don't know much about [14:42:58] no problem! I asked just in case, these settings might predate me and you :) [14:44:40] * elukey opens a task [14:48:13] 10Analytics, 06Operations, 15User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3286412 (10elukey) [14:48:21] there you go --^ [14:59:36] 10Analytics, 10DBA: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10Marostegui) [15:00:01] 10Analytics, 10DBA, 06Operations: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286450 (10Marostegui) [15:04:17] they don't predate me, but i def don't know much :) [15:25:02] ottomata: sorry for having to leave, the just HAAAD to come in the middle of standup [15:26:14] haha [15:26:15] np [15:26:20] fdans: i'm going to figure out what's wrong with this deploy [15:26:26] but, can you follow up either on ticket or with uria [15:26:27] nuria [15:26:35] and figure out if we should deploy this thing now, or if we should send an announcement first? [15:26:41] We have some trending on Roger moore ! [15:27:06] ottomata: ok! [15:28:18] Number of pageviews for last hour got multiplied by more than 20000 !! [15:28:43] musikanimal: Hello ! [15:28:55] hey! [15:29:20] fdans: do you have a link to the pastebin i sent you last week when we were hacking on this find function import * thing? [15:29:27] wooo [15:29:28] ah [15:29:37] i do [15:30:31] oh, hm, i found it [15:30:34] sorry, in my history [15:30:55] fdans: so the thing we merged today isn't using what I pasted there, did you get from module import * to work somehow? [15:30:57] i thought that wasn't working [15:31:04] https://pastebin.com/rMnmVtSR [15:31:32] omg you're right, I don't think I committed those [15:31:57] because we were having those issues with the internet I don't think I sent the changes [15:32:09] ottomata ^ [15:32:34] ahh! :) [15:32:37] I completely forgot about them [15:32:39] sorry [15:34:16] ottomata: I'll add that now 🙈 [15:36:21] k danke [15:36:34] dunno how it passed in beta though, weird [15:37:02] yeah strange [15:52:34] ottomata: ok done, gonna test in beta a bit more thoroughly [15:53:01] k [16:27:46] joal: how often is an indexing/load task launched for druid? [16:27:54] right now? [16:28:09] ottomata: depends on jobs/datasource [16:28:38] ya, but what is the most frequent? [16:28:39] hourly? [16:28:44] or daily? [16:28:49] ottomata: I am currently indexing a lot on uniques, to recompute some, but it's a 1-off [16:28:59] ottomata: most frequent is daily I think [16:29:03] ok cool [16:29:14] then i think when we upgrade, we can just make sure no task is running currently [16:29:20] and restart middle managers [16:29:26] counds good ottomata [16:29:39] ottomata: also some monthly, but should be good ;) [16:29:42] aye [16:29:53] elukey: i think this will be very easy. [16:30:00] shoudl we do this tomorrow my morning? [16:30:22] 10Analytics, 10ChangeProp, 10EventBus, 06Services (later), 15User-mobrovac: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#3286702 (10Pchelolo) [16:30:25] 10Analytics, 10ChangeProp, 10EventBus, 06Services (done): Create schema for Job event - https://phabricator.wikimedia.org/T157094#3286699 (10Pchelolo) 05Open>03Resolved The schema is merged and live in the EventBus [16:33:10] ottomata: sure thing! [16:33:31] I'll be out most of my morning (going to send an email) but I'll be there when you'll be online [16:34:21] ok [16:34:33] actually elukey i just realized we need to meet with service folks about kafka asap [16:34:40] so i just scheduled a meeting for my time 10am tomorrow [16:35:05] i guess we could do thursday morning [16:35:56] +1 for me [16:36:08] going afk, byeeee! [16:38:49] oook, laters! [17:12:06] ottomata: tested in beta 👌🏼 https://gerrit.wikimedia.org/r/#/c/355240/ [17:20:17] fdans: cool, 2 comments [17:33:20] ottomata: fixed both :) [17:51:30] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3286996 (10Pchelolo) On the discussion on the hackathon with @JAllemandou we've decided to reuse spark infrastructure for this. @Pchelolo will develop a Scala-base... [17:58:30] 10Analytics, 10Analytics-Cluster: Update puppet for new Kafka cluster and version - https://phabricator.wikimedia.org/T166162#3287005 (10Ottomata) [18:09:24] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3287043 (10Ottomata) > which version of Scala should I use? @JAllemandou can answer better than I, but https://github.com/wikimedia/analytics-refinery-source/blob... [18:45:59] joal, qq: in mediawiki hive-data, we should also keep the last 2 snapshots of private version of the data? [18:46:13] Hey mforns [18:46:17] hey :] [18:47:09] mforns: private snapshots are not regular, and we have no real convention about names nor frequency - I'd say let's remove them [18:47:24] joal, K thanks! [18:47:28] I'd say sorry, let's not touch them automatically [18:47:31] mforns: --^ [18:47:40] oh! too late, I've removed them [18:47:40] mforns: Maaaaaan - Dificult evening [18:47:46] xD no, just kidding [18:47:56] O.o [18:47:58] :D [18:48:01] ok, ok, will leave them to be handled manually [18:48:08] sorry mforns [18:48:13] hehe thx! [19:01:10] (03PS1) 10Joal: Upgrade jar version for restbase job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355266 (https://phabricator.wikimedia.org/T163479) [19:13:49] joal: o/ - restbase-wf-2017-5-23-15 emails are due to the deploymnet? [19:14:03] correct elukey [19:14:21] my bad, was willing to send an email and got sucked into debugging [19:14:26] elukey: --^ [19:16:40] elukey: just replied to the email [19:16:57] nono sorry didn't want to rush you just help : [19:16:58] :( [19:17:03] fdans: merged, lemme know if you find out anything about announcement being necessary [19:17:28] no prob elukey, it's better with an email, I just forgot :) [19:17:45] thanks for all the work, really sorry that you are still working :( [19:17:57] elukey: I'm on uniques now, don't worry [19:18:27] elukey: there was 2 bugs in that deploy, and I author them, so I'm actually glad I realesed it and found them myself (less ashamed) [19:18:56] :) [19:20:03] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Don't accept data from automated bots in Event Logging - https://phabricator.wikimedia.org/T67508#3287226 (10fdans) @Nuria @Tbayer is there anything we should announce before deploying this change? [19:20:17] Gone for now a-team - Bugs found and CR provided on deploy, but still no luck in explaining my last-access-uniques problem :( [19:20:22] See you tomorrow [19:20:29] have a good evening Joseph! [19:20:34] bye joal ! [19:31:48] 10Analytics, 10Analytics-Cluster: Genericize ca-manager script - https://phabricator.wikimedia.org/T166167#3287261 (10Ottomata) [19:32:10] 10Analytics, 10Analytics-Cluster: Genericize ca-manager script - https://phabricator.wikimedia.org/T166167#3287261 (10Ottomata) [22:49:54] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Don't accept data from automated bots in Event Logging - https://phabricator.wikimedia.org/T67508#3287832 (10Tbayer) @fdans: Yes, since this is going to affect the results of various queries (even though it's by improving their accuracy), people working with th...