[00:13:07] 10Quarry, 10Operations, 10cloud-services-team: let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205#3782981 (10Dzahn) [00:13:20] 10Quarry, 10Operations, 10cloud-services-team: let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205#3782996 (10Dzahn) [02:19:53] 10Quarry, 10Operations, 10cloud-services-team (Kanban): let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205#3783068 (10bd808) [09:33:16] Hi elukey :) [09:33:20] morning! [09:33:41] elukey: Shall I start an AQS big loading job with LOCAL_QUORUM? [09:33:42] I am running puppet on the hadoop cluster for a huge refactoring, for the moment nothing broke :P [09:33:47] Yay ! [09:34:07] if you could wait 30m it would be great [09:34:14] elukey: no problem at all [09:34:21] \o/ [09:34:22] elukey: You tell me when you're ready? [09:34:24] sure! [09:53:17] of course I broke one thing on an1002 [09:53:18] uff [10:00:34] fixed :) [10:02:28] no-op on the rest [10:02:33] joal: you are free to go :) [10:07:06] elukey: proceeding ! [10:10:17] just noticed that /srv/geowiki/scripts/scripts/check_web_page.sh's alert is not the usual one [10:10:24] I think it has stopped for some reason [10:10:54] :( [10:11:11] data in the private repo is up to 2017-11-12 [10:11:14] * elukey sigh [10:11:53] elukey: this geowiki think causes us a lot of trouble - I have no knowledge about it, but I think we should invest into fixing it [10:13:35] ah I think it is the problem that I've fixed yesterday about the analytics-slave domain [10:13:55] it runs everyday at 12UTC [10:14:04] so in a bit we should be able to see if it works or not [10:19:15] elukey: Launching now a loading job [10:19:51] elukey: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0148004-170829140538136-oozie-oozi-C/ [10:19:59] So far it prepares data [10:20:26] niceeee [10:23:22] joal: n00b question as always - did you just create a new coordinator for a specific time window right? (for already processed data) [10:24:03] correct elukey :) [10:25:43] \o/ [10:29:53] elukey: now cassandra loading starts [10:35:27] INFO: line 617: Update /var/run/eventlogging_cleaner with the current end_ts 20170824110001 [10:35:30] yessss [10:35:31] it works [10:35:50] hehe ! another great success for elukey! [10:36:32] I'll call it closed on Monday if the weekend pass without any issue [10:36:47] so looking forward to close that task :D [10:36:54] (03CR) 10Joal: Update cassandra load jobs to local quorum write (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/392624 (owner: 10Joal) [10:37:42] elukey: https://gerrit.wikimedia.org/r/#/c/392624/ <-- can you please triple check my response to Nuria's comment is ok? [10:38:24] already did, it is super good (as always) [10:39:18] \me blushes [10:54:13] elukey: so far, I can't see any diff in cassandra machines from using local quorum - only thing is, I am not sure it is applied !!! [10:54:23] elukey: is there a way for us to check? [10:58:13] 10Analytics, 10EventBus, 10Services (next): Support multiple partitions per topic in EventBus - https://phabricator.wikimedia.org/T157822#3783511 (10mobrovac) [10:59:38] joal: not sure [11:03:49] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: A statbox to update the WDCM system - https://phabricator.wikimedia.org/T181094#3783530 (10GoranSMilovanovic) @Ottomata Thanks! Testing today, providing feedback here ASAP. [11:06:34] 10Analytics-EventLogging, 10Analytics-Kanban, 10Readers-Web-Backlog (Tracking): Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3783534 (10elukey) 05Open>03Resolved [11:17:42] 10Analytics-Kanban, 10Operations, 10hardware-requests: eqiad: (2) hardware request for jupyter notebook refresh (SWAP) - https://phabricator.wikimedia.org/T175603#3783540 (10faidon) [12:26:51] * elukey lunch! [12:42:32] 10Analytics-Kanban: Fix mediawiki history page reconstruction bug (similar timestamps) - https://phabricator.wikimedia.org/T179074#3783713 (10JAllemandou) Link not made to CR: https://gerrit.wikimedia.org/r/#/c/388267/ [12:45:20] 10Analytics-Kanban: Check data from new API endpoints agains existing sources - https://phabricator.wikimedia.org/T178478#3783717 (10JAllemandou) = Investigation Report = Naming conventions: MWH for the mediawiki-history computed metrics, WKS for the wikistats ones == Comparing MWH and WKS == - new registered... [12:46:50] elukey: do you mind having a look at --^? [13:07:14] 10Analytics-Kanban: Enhance mediawiki-history page reconstruction with best historical information possible - https://phabricator.wikimedia.org/T179692#3783763 (10JAllemandou) [13:07:36] 10Analytics-Kanban: Fix mediawiki-history page reconstruction bug (deletes and restores) - simple patch - https://phabricator.wikimedia.org/T179690#3783764 (10JAllemandou) [13:36:02] joal: what of the two things above the --^ :D ? [13:37:15] elukey: https://phabricator.wikimedia.org/T178478#3783717 [13:37:17] Please :) [13:37:39] Also elukey - compaction finished for cassandra - looks like nothing changed between one and quorum [13:40:01] that is what we expected, goooood [13:40:13] indeed elukey [13:40:30] Let's wait for nuria_ to read my comment before merging [13:40:39] elukey: we could deploy early next week :) [13:41:33] +1 [13:41:53] the comments looks good! I have a very limited exposure to all that work but it looks understandable and clear [13:42:27] elukey: thanks for reading - I tried to hide all the tedious work and only show problems and solutions :) [13:42:50] :) [13:43:01] second run of the el cleaner - INFO: line 617: Update /var/run/eventlogging_cleaner with the current end_ts 20170825110001 [13:43:14] Awesome :) [13:43:41] and 90d ago was exactly 25 Aug 2017 [13:43:51] I think the task is done [13:44:16] Man - what should we be doing when a never-ending task actually ends? [13:45:26] elukey: taking a break now, see you at standup :) [13:45:30] ttl! [13:46:08] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3096019 (10elukey) Verified that two runs of the eventlogging cleaner ran on db1108 without any issues. [14:48:30] hello folks [14:49:20] piwik has been returning 500s starting this utc morning, it seems [14:52:15] ema: o/ - yesterday ganeti1005 (on which bohrium runs) showed troubles and I think that they removed the instances from the node :( [14:52:22] it might because of that [14:52:37] oh [14:53:26] mmmm but from https://tools.wmflabs.org/sal/production?p=0&q=ganeti1005&d= they should have been migrated [14:53:33] snap something might be wrong, checking, thanks! [14:54:24] http://bit.ly/2jh0Bgp [14:55:10] Error in Piwik (tracker): Connect failed: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) [14:55:13] /o\ [14:55:40] wooops [14:57:22] (03PS2) 10Fdans: [wip] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 [14:59:59] (03CR) 10jerkins-bot: [V: 04-1] [wip] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (owner: 10Fdans) [15:30:41] hey a-team (🇪🇺) I have a fever and I'm actually falling asleep on my desk, so I'm going to miss standup to rest for a bit and I'll be online later [15:31:09] take care fdans :] [15:31:50] yep :) [15:53:00] we have a full outage for piwik, database broken [15:53:23] I am working with Manuel to attempt a dump of the latest data, otherwise we'll need to pull it from bacula [15:58:16] elukey: any help I can provide? [16:00:28] joal: for the moment no, I am pretty useless too now, Manuel is trying to convince mysql to collaborate [16:00:48] elukey: I don't to be in between Manuel and a mysql db [16:13:23] joal, we left you hanging in da cave, please ping us when you're back and we can sync if necessary [16:17:32] Hey mforns / elukey - Sorry, I had to take that call [16:17:55] joal: I dropped since Manuel is going afk soon, will try to rejoin later [16:17:58] buuut [16:18:02] surprise for you (still wip) [16:18:02] https://grafana.wikimedia.org/dashboard/db/prometheus-druid [16:19:05] Nothing special on my end, I documented the findings on data vetting (elukey confirmed it was not complete nonsense), I tested a big cassandra load job with local_quorum insertion succesfully, and now will try to think of a correct way to fix our problem with restores [16:19:28] elukey: That's really fantastic ! [16:19:45] \o/ [16:19:54] I can see Marcel loading segments [16:19:56] :D [16:20:39] we seem to be super lucky since the last (weekly) backup for piwik has been yesterday [16:20:42] root@bohrium:/srv/backups# ls -lht [16:20:44] total 7.6G [16:20:47] -rw-r----- 1 root root 2.8G Nov 22 02:19 piwik-201711220205.sql.gz [16:21:54] elukey: not everything going wrong? weird [16:38:27] joal: the ganeti host that was running bohrium died now, back to square one [16:38:32] you were saying? [16:38:33] :D [16:38:44] Ahhhh, this sounds correct now :) :( [16:56:25] a-team - Forgot to mention -- Tomorrow I'll be at warp10 meetup, and will work late in the evening [16:56:39] aha [17:37:17] ok new dashboard almost completed - https://grafana.wikimedia.org/dashboard/db/prometheus-druid [17:37:33] I added the historical bits [18:05:49] joal, help! [18:20:28] all right piwik's database is importing now [18:20:34] but we are still loosing data [18:20:36] it will take hours [18:20:49] probably I'll be able to re-enable data only tomorrow morning [18:35:09] all right email sent to all of you people [18:35:15] :] [18:36:21] elukey, can you kill a job for me in stat1004? [18:37:04] it's a spark-submit, and I don't think I can kill it with my permits, or sudo -u stats [18:39:14] mforns: what do you mean kill it? [18:39:32] is it a hadoop job? [18:39:47] it's a spark-submit job [18:40:02] not sure how to kill it [18:40:20] is it running on the cluster now? [18:41:11] doesn't seem so from yarn [18:41:24] the only process that I can see on stat1004 is owned by your user [18:41:31] you can kill it [18:41:53] just did ps aux | grep spark, not sure if you meant something else [18:43:08] mforns: ? [18:43:41] elukey, can I just kill without sudo? [18:44:00] I tried sudo -u stats and it did not work, it asks me password [18:45:47] mforns: you can just kill with your username [18:46:01] because it seems running from you not stats [18:47:13] right, done [18:47:16] thanks for that [18:47:32] goood [18:47:43] elukey, so it seems it lost connection with the cluster no? [18:48:00] and still was zombieing around [18:48:08] no idea what it was doing :( [18:48:14] k np [18:52:34] going off team! [18:52:37] * elukey off! [18:52:46] byeeee [19:05:48] Hey mforns - sorry I was gone for diner :( [19:05:54] mforns: you manged to kill the job? [19:05:57] joal, np [19:06:36] elukey helped me kill the job, I think the executors were all dead and my spark-submit process was zombie [19:07:03] mforns: also, the prod user on hadoop is hdfs, not stats :) [19:14:28] joal, yea my bad [19:14:54] np mforns - I just just pointing the thing (and maybe you actually have rights with hdfs user?) [19:15:29] joal, I can kill processes launched by me in stat1004 without sudo [19:15:41] correct sir, was just double checkin [19:15:49] I meant elukey gave me the tip, but I was the one to kill [19:16:08] I always knew you were a killer mforns ;) [19:16:11] xD [19:16:28] ah but this job, for some reason keeps failing... [19:16:34] anything else you would havbe liked help for? [19:17:00] I managed to load 2 weeks (2 jobs of 7 days each) [19:17:13] but the third week is giving me problems [19:17:36] how does it fail mforns ? [19:17:39] still OOM ? [19:17:43] 17/11/23 19:03:46 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 145300 ms exceeds timeout 120000 ms [19:17:43] 17/11/23 19:04:31 WARN DFSClient: DataStreamer Exception [19:17:44] java.io.IOException: Broken pipe [19:18:05] mforns: do you have the app_id? [19:20:13] joal, no, spark-submit doesn't give me one, and I can not find my job in yarn [19:20:45] Ah, mforns - Can you past me your spark-submit command? [19:20:50] sure [19:21:30] joal, I tried many things, but the last one is: https://pastebin.com/LBAMmP5G [19:22:06] mforns: can you tell me more about the reduce-memory parameter/ [19:22:07] ? [19:22:18] it's the one in the druid ingestion spec [19:22:44] "jobProperties" : { [19:22:45] | "mapreduce.reduce.memory.mb" : "{{REDUCE_MEMORY}}", [19:22:45] Right makes sense [19:23:14] Also to understand: this job prepares data, then send a druid request to start indexation, correct? [19:23:24] yes, exactly [19:23:25] ASnd in our failing case, it fails at data prerp [19:23:29] yes [19:23:42] Ok - Add --master yarn to spark command ;) [19:23:43] the resulting data is small [19:23:59] I don't understand why it takes so many resources, because the data is small overall [19:24:14] the resulting data is 12MB per day! [19:24:41] haven't checked the source data, but it should not be that big [19:25:02] mforns: without "--master yarn", you actually don't run the thing in hadoop [19:25:14] it runs in local mode on whatever stat machine you use [19:25:17] aaaaaaaaaah!!!!!! I'm sooooooo stupid! [19:25:39] No you're not, but that's the reason for which you actually can't find your job in yarn [19:25:47] myyyyyy goooooooooood [19:27:22] mforns: Can you give the dates of the failin job? [19:27:38] you mean the date interval? [19:27:43] correct [19:27:56] the one I linked there has them [19:28:04] --start-date 2017-11-01T12 \ [19:28:04] --end-date 2017-11-04T12 \ [19:29:50] executing again IN YARN [19:30:14] mforns: And the previous imports ou did were on previous dates (like, 2017-10-25->31 roughly?) [19:30:53] yes: 2017-10-18T12 -> 2017-10-25T12 [19:30:55] mforns: I'm positively enclined to think this one will succeed [19:31:11] and 2017-10-18T12 -> 2017-11-01T12 [19:31:20] why you ask the dates? [19:31:21] YES! [19:31:30] so blind! [19:31:44] mforns: I'm looking at he result of command: hdfs dfs -du -s -h /user/tbayer/eventlogging_refine_test/Popups/year=2017/*/* [19:32:00] joal, ah ok [19:32:14] This tells me data starts to get bigger from 2017-10-23 onward [19:32:40] So 2017-10-18T12 -> 2017-10-25T12 succesfull, makes sense, but the other one, I'm surprised :) [19:33:15] hm, I did a histogram in hive and data starts to grow considerable from the 19th [19:33:21] Anyway - Doing this in yarn should ensure success ;) [19:33:30] althoug 23rd there's a bigger jump [19:33:35] hehehe, sure [19:33:47] man... I'm a cluster white-belt [19:34:06] mforns: By civility for other users, can you please also use: "--conf spark.dynamicAllocation.maxExecutors=64", like that you don't eat all he cluster resources [19:34:13] ok, it finished already \o/ [19:34:22] of course [19:34:23] For that job, no bother, but for next ones [19:34:39] And you can tweak the max exec value up to 128, no problem [19:35:15] I'm so happy we have double checked cluster actually serves some purpose ;) [19:35:30] joal, current command: https://pastebin.com/vF5tLvaq [19:35:46] Should work :) [19:35:58] already did [19:36:00] mforns: the yarn job succeeded, with druid loading and all? [19:36:28] yes: https://tinyurl.com/y7svw22y [19:36:54] yes mforns, was looking at that as well ') [19:36:58] That's awesonme [19:37:55] ok, loading the remaining data [19:38:08] mforns: you probably should be able to do it in one row [19:38:16] in one pass I mean [19:38:21] yea, I did that [19:40:10] already loading [19:40:15] :) [19:40:33] another info for you mforns - Spark2 is reallllly faster than 1.6 ;) [19:40:37] :D [19:41:03] awesome, I can create a task to translate this job to spark2 [19:41:40] mforns: For the moment it's not yet tested with oozie, but works like a charm for manual or cron use [19:41:48] cool [19:43:54] Ok - enough for today - Bye a-team - tomorrow, I'll be late because of the warp10 thing - hopefully I'll learn some along the way :) [19:44:02] joal, finished, popups experiment is fully loaded [19:44:11] joal, thanks a lot :] [19:44:23] byeeeeee [19:44:25] No prob mforn - You did 99% of the job ;) [19:44:33] :] [21:21:52] 10Analytics: Incorporate data from the GeoIP2 ISP database to webrequest - https://phabricator.wikimedia.org/T167907#3784639 (10faidon) This came up again this week: I was looking into our network traffic in our various PoPs, to plan capacity and procure network links for eqsin (Singapore). There is traffic on o... [21:22:39] 10Analytics: Create ops dashboard with info like ipv6 traffic split - https://phabricator.wikimedia.org/T138396#3784643 (10faidon) Also see T167907, for a similar request (from the network side of things). [22:37:54] 10Analytics, 10Operations, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3784670 (10faidon) [23:09:19] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: A statbox to update the WDCM system - https://phabricator.wikimedia.org/T181094#3784689 (10GoranSMilovanovic) @Ottomata You're awesome! Everything seems to be running smoothly on stat1004. Marked as resolved. [23:09:27] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: A statbox to update the WDCM system - https://phabricator.wikimedia.org/T181094#3784690 (10GoranSMilovanovic) 05Open>03Resolved a:03GoranSMilovanovic