[00:13:07] <wikibugs>	 10Quarry, 10Operations, 10cloud-services-team: let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205#3782981 (10Dzahn)
[00:13:20] <wikibugs>	 10Quarry, 10Operations, 10cloud-services-team: let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205#3782996 (10Dzahn)
[02:19:53] <wikibugs>	 10Quarry, 10Operations, 10cloud-services-team (Kanban): let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205#3783068 (10bd808)
[09:33:16] <joal>	 Hi elukey :)
[09:33:20] <elukey>	 morning! 
[09:33:41] <joal>	 elukey: Shall I start an AQS big loading job with LOCAL_QUORUM?
[09:33:42] <elukey>	 I am running puppet on the hadoop cluster for a huge refactoring, for the moment nothing broke :P
[09:33:47] <joal>	 Yay !
[09:34:07] <elukey>	 if you could wait 30m it would be great
[09:34:14] <joal>	 elukey: no problem at all
[09:34:21] <elukey>	 \o/
[09:34:22] <joal>	 elukey: You tell me when you're ready?
[09:34:24] <elukey>	 sure!
[09:53:17] <elukey>	 of course I broke one thing on an1002
[09:53:18] <elukey>	 uff
[10:00:34] <elukey>	 fixed :)
[10:02:28] <elukey>	 no-op on the rest
[10:02:33] <elukey>	 joal: you are free to go :)
[10:07:06] <joal>	 elukey: proceeding !
[10:10:17] <elukey>	 just noticed that  /srv/geowiki/scripts/scripts/check_web_page.sh's alert is not the usual one
[10:10:24] <elukey>	 I think it has stopped for some reason
[10:10:54] <joal>	 :(
[10:11:11] <elukey>	 data in the private repo is up to 2017-11-12
[10:11:14] * elukey sigh
[10:11:53] <joal>	 elukey: this geowiki think causes us a lot of trouble - I have no knowledge about it, but I think we should invest into fixing it
[10:13:35] <elukey>	 ah I think it is the problem that I've fixed yesterday about the analytics-slave domain
[10:13:55] <elukey>	 it runs everyday at 12UTC
[10:14:04] <elukey>	 so in a bit we should be able to see if it works or not
[10:19:15] <joal>	 elukey: Launching now a loading job
[10:19:51] <joal>	 elukey: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0148004-170829140538136-oozie-oozi-C/
[10:19:59] <joal>	 So far it prepares data
[10:20:26] <elukey>	 niceeee
[10:23:22] <elukey>	 joal: n00b question as always - did you just create a new coordinator for a specific time window right? (for already processed data)
[10:24:03] <joal>	 correct elukey :)
[10:25:43] <elukey>	 \o/
[10:29:53] <joal>	 elukey: now cassandra loading starts
[10:35:27] <elukey>	 INFO: line 617: Update /var/run/eventlogging_cleaner with the current end_ts 20170824110001
[10:35:30] <elukey>	 yessss
[10:35:31] <elukey>	 it works
[10:35:50] <joal>	 hehe ! another great success for elukey!
[10:36:32] <elukey>	 I'll call it closed on Monday if the weekend pass without any issue
[10:36:47] <elukey>	 so looking forward to close that task :D
[10:36:54] <wikibugs>	 (03CR) 10Joal: Update cassandra load jobs to local quorum write (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/392624 (owner: 10Joal)
[10:37:42] <joal>	 elukey: https://gerrit.wikimedia.org/r/#/c/392624/ <-- can you please triple check my response to Nuria's comment is ok?
[10:38:24] <elukey>	 already did, it is super good (as always)
[10:39:18] <joal>	 \me blushes
[10:54:13] <joal>	 elukey: so far, I can't see any diff in cassandra machines from using local quorum - only thing is, I am not sure it is applied !!!
[10:54:23] <joal>	 elukey: is there a way for us to check?
[10:58:13] <wikibugs>	 10Analytics, 10EventBus, 10Services (next): Support multiple partitions per topic in EventBus - https://phabricator.wikimedia.org/T157822#3783511 (10mobrovac)
[10:59:38] <elukey>	 joal: not sure
[11:03:49] <wikibugs>	 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: A statbox to update the WDCM system - https://phabricator.wikimedia.org/T181094#3783530 (10GoranSMilovanovic) @Ottomata Thanks! Testing today, providing feedback here ASAP.
[11:06:34] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Readers-Web-Backlog (Tracking): Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana - https://phabricator.wikimedia.org/T174815#3783534 (10elukey) 05Open>03Resolved
[11:17:42] <wikibugs>	 10Analytics-Kanban, 10Operations, 10hardware-requests: eqiad: (2) hardware request for jupyter notebook refresh (SWAP) - https://phabricator.wikimedia.org/T175603#3783540 (10faidon)
[12:26:51] * elukey lunch!
[12:42:32] <wikibugs>	 10Analytics-Kanban: Fix mediawiki history page reconstruction bug (similar timestamps) - https://phabricator.wikimedia.org/T179074#3783713 (10JAllemandou) Link not made to CR: https://gerrit.wikimedia.org/r/#/c/388267/
[12:45:20] <wikibugs>	 10Analytics-Kanban: Check data from new API endpoints agains existing sources - https://phabricator.wikimedia.org/T178478#3783717 (10JAllemandou) = Investigation Report =  Naming conventions: MWH for the mediawiki-history computed metrics, WKS for the wikistats ones  == Comparing MWH and WKS ==  - new registered...
[12:46:50] <joal>	 elukey: do you mind having a look at --^?
[13:07:14] <wikibugs>	 10Analytics-Kanban: Enhance mediawiki-history page reconstruction with best historical information possible - https://phabricator.wikimedia.org/T179692#3783763 (10JAllemandou)
[13:07:36] <wikibugs>	 10Analytics-Kanban: Fix mediawiki-history page reconstruction bug (deletes and restores) - simple patch - https://phabricator.wikimedia.org/T179690#3783764 (10JAllemandou)
[13:36:02] <elukey>	 joal: what of the two things above the --^ :D ?
[13:37:15] <joal>	 elukey: https://phabricator.wikimedia.org/T178478#3783717
[13:37:17] <joal>	 Please :)
[13:37:39] <joal>	 Also elukey - compaction finished for cassandra - looks like nothing changed between one and quorum
[13:40:01] <elukey>	 that is what we expected, goooood
[13:40:13] <joal>	 indeed elukey
[13:40:30] <joal>	 Let's wait for nuria_ to read my comment before merging
[13:40:39] <joal>	 elukey: we could deploy early next week :)
[13:41:33] <elukey>	 +1
[13:41:53] <elukey>	 the comments looks good! I have a very limited exposure to all that work but it looks understandable and clear
[13:42:27] <joal>	 elukey: thanks for reading - I tried to hide all the tedious work and only show problems and solutions :)
[13:42:50] <elukey>	 :)
[13:43:01] <elukey>	 second run of the el cleaner - INFO: line 617: Update /var/run/eventlogging_cleaner with the current end_ts 20170825110001
[13:43:14] <joal>	 Awesome :)
[13:43:41] <elukey>	 and 90d ago was exactly 25 Aug 2017
[13:43:51] <elukey>	 I think the task is done
[13:44:16] <joal>	 Man - what should we be doing when a never-ending task actually ends?
[13:45:26] <joal>	 elukey: taking a break now, see you at standup :)
[13:45:30] <elukey>	 ttl! 
[13:46:08] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Improve purging  for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3096019 (10elukey) Verified that two runs of the eventlogging cleaner ran on db1108 without any issues.
[14:48:30] <ema>	 hello folks
[14:49:20] <ema>	 piwik has been returning 500s starting this utc morning, it seems
[14:52:15] <elukey>	 ema: o/ - yesterday ganeti1005 (on which bohrium runs) showed troubles and I think that they removed the instances from the node :(
[14:52:22] <elukey>	 it might because of that
[14:52:37] <ema>	 oh
[14:53:26] <elukey>	 mmmm but from https://tools.wmflabs.org/sal/production?p=0&q=ganeti1005&d= they should have been migrated
[14:53:33] <elukey>	 snap something might be wrong, checking, thanks!
[14:54:24] <ema>	 http://bit.ly/2jh0Bgp
[14:55:10] <elukey>	 Error in Piwik (tracker): Connect failed: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
[14:55:13] <elukey>	  /o\
[14:55:40] <ema>	 wooops
[14:57:22] <wikibugs>	 (03PS2) 10Fdans: [wip] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661
[14:59:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [wip] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (owner: 10Fdans)
[15:30:41] <fdans>	 hey a-team (🇪🇺) I have a fever and I'm actually falling asleep on my desk, so I'm going to miss standup to rest for a bit and I'll be online later
[15:31:09] <mforns>	 take care fdans :]
[15:31:50] <elukey>	 yep :)
[15:53:00] <elukey>	 we have a full outage for piwik, database broken
[15:53:23] <elukey>	 I am working with Manuel to attempt a dump of the latest data, otherwise we'll need to pull it from bacula
[15:58:16] <joal>	 elukey: any help I can provide?
[16:00:28] <elukey>	 joal: for the moment no, I am pretty useless too now, Manuel is trying to convince mysql to collaborate
[16:00:48] <joal>	 elukey: I don't to be in between Manuel and a mysql db
[16:13:23] <mforns>	 joal, we left you hanging in da cave, please ping us when you're back and we can sync if necessary
[16:17:32] <joal>	 Hey mforns / elukey  - Sorry, I had to take that call
[16:17:55] <elukey>	 joal: I dropped since Manuel is going afk soon, will try to rejoin later
[16:17:58] <elukey>	 buuut
[16:18:02] <elukey>	 surprise for you (still wip)
[16:18:02] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-druid
[16:19:05] <joal>	 Nothing special on my end, I documented the findings on data vetting (elukey confirmed it was not complete nonsense), I tested a big cassandra load job with local_quorum insertion succesfully, and now will try to think of a correct way to fix our problem with restores
[16:19:28] <joal>	 elukey: That's really fantastic !
[16:19:45] <elukey>	 \o/
[16:19:54] <elukey>	 I can see Marcel loading segments
[16:19:56] <elukey>	 :D
[16:20:39] <elukey>	 we seem to be super lucky since the last (weekly) backup for piwik has been yesterday
[16:20:42] <elukey>	 root@bohrium:/srv/backups# ls -lht
[16:20:44] <elukey>	 total 7.6G
[16:20:47] <elukey>	 -rw-r----- 1 root root 2.8G Nov 22 02:19 piwik-201711220205.sql.gz
[16:21:54] <joal>	 elukey: not everything going wrong? weird
[16:38:27] <elukey>	 joal: the ganeti host that was running bohrium died now, back to square one
[16:38:32] <elukey>	 you were saying?
[16:38:33] <elukey>	 :D
[16:38:44] <joal>	 Ahhhh, this sounds correct now :) :(
[16:56:25] <joal>	 a-team - Forgot to mention -- Tomorrow I'll be at warp10 meetup, and will work late in the evening
[16:56:39] <mforns>	 aha
[17:37:17] <elukey>	 ok new dashboard almost completed - https://grafana.wikimedia.org/dashboard/db/prometheus-druid
[17:37:33] <elukey>	 I added the historical bits
[18:05:49] <mforns>	 joal, help!
[18:20:28] <elukey>	 all right piwik's database is importing now
[18:20:34] <elukey>	 but we are still loosing data
[18:20:36] <elukey>	 it will take hours
[18:20:49] <elukey>	 probably I'll be able to re-enable data only tomorrow morning
[18:35:09] <elukey>	 all right email sent to all of you people
[18:35:15] <mforns>	 :]
[18:36:21] <mforns>	 elukey, can you kill a job for me in stat1004?
[18:37:04] <mforns>	 it's a spark-submit, and I don't think I can kill it with my permits, or sudo -u stats
[18:39:14] <elukey>	 mforns: what do  you mean kill it?
[18:39:32] <elukey>	 is it a hadoop job?
[18:39:47] <mforns>	 it's a spark-submit job
[18:40:02] <mforns>	 not sure how to kill it
[18:40:20] <elukey>	 is it running on the cluster now?
[18:41:11] <elukey>	 doesn't seem so from yarn
[18:41:24] <elukey>	 the only process that I can see on stat1004 is owned by your user
[18:41:31] <elukey>	 you can kill it
[18:41:53] <elukey>	 just did ps aux  | grep spark, not sure if you meant something else
[18:43:08] <elukey>	 mforns: ?
[18:43:41] <mforns>	 elukey, can I just kill without sudo?
[18:44:00] <mforns>	 I tried sudo -u stats and it did not work, it asks me password
[18:45:47] <elukey>	 mforns: you can just kill with your username
[18:46:01] <elukey>	 because it seems running from you not stats
[18:47:13] <mforns>	 right, done
[18:47:16] <mforns>	 thanks for that
[18:47:32] <elukey>	 goood
[18:47:43] <mforns>	 elukey, so it seems it lost connection with the cluster no?
[18:48:00] <mforns>	 and still was zombieing around
[18:48:08] <elukey>	 no idea what it was doing :(
[18:48:14] <mforns>	 k np
[18:52:34] <elukey>	 going off team!
[18:52:37] * elukey off!
[18:52:46] <mforns>	 byeeee
[19:05:48] <joal>	 Hey mforns - sorry I was gone for diner :(
[19:05:54] <joal>	 mforns: you manged to kill the job?
[19:05:57] <mforns>	 joal, np
[19:06:36] <mforns>	 elukey helped me kill the job, I think the executors were all dead and my spark-submit process was zombie
[19:07:03] <joal>	 mforns: also, the prod user on hadoop is hdfs, not stats :)
[19:14:28] <mforns>	 joal, yea my bad
[19:14:54] <joal>	 np mforns - I just just pointing the thing (and maybe you actually have rights with hdfs user?)
[19:15:29] <mforns>	 joal, I can kill processes launched by me in stat1004 without sudo
[19:15:41] <joal>	 correct sir, was just double checkin
[19:15:49] <mforns>	 I meant elukey gave me the tip, but I was the one to kill
[19:16:08] <joal>	 I always knew you were a killer mforns ;)
[19:16:11] <mforns>	 xD
[19:16:28] <mforns>	 ah but this job, for some reason keeps failing...
[19:16:34] <joal>	 anything else you would havbe liked help for?
[19:17:00] <mforns>	 I managed to load 2 weeks (2 jobs of 7 days each)
[19:17:13] <mforns>	 but the third week is giving me problems
[19:17:36] <joal>	 how does it fail mforns ?
[19:17:39] <joal>	 still OOM ?
[19:17:43] <mforns>	 17/11/23 19:03:46 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 145300 ms exceeds timeout 120000 ms
[19:17:43] <mforns>	 17/11/23 19:04:31 WARN DFSClient: DataStreamer Exception
[19:17:44] <mforns>	 java.io.IOException: Broken pipe
[19:18:05] <joal>	 mforns: do you have the app_id?
[19:20:13] <mforns>	 joal, no, spark-submit doesn't give me one, and I can not find my job in yarn
[19:20:45] <joal>	 Ah, mforns - Can you past me your spark-submit command?
[19:20:50] <mforns>	 sure
[19:21:30] <mforns>	 joal, I tried many things, but the last one is: https://pastebin.com/LBAMmP5G
[19:22:06] <joal>	 mforns: can you tell me more about the reduce-memory parameter/
[19:22:07] <joal>	 ?
[19:22:18] <mforns>	 it's the one in the druid ingestion spec
[19:22:44] <mforns>	 "jobProperties" : {
[19:22:45] <mforns>	     |                "mapreduce.reduce.memory.mb" : "{{REDUCE_MEMORY}}",
[19:22:45] <joal>	 Right makes sense
[19:23:14] <joal>	 Also to understand: this job prepares data, then send a druid request to start indexation, correct?
[19:23:24] <mforns>	 yes, exactly
[19:23:25] <joal>	 ASnd in our failing case, it fails at data prerp
[19:23:29] <mforns>	 yes
[19:23:42] <joal>	 Ok - Add --master yarn to spark command ;)
[19:23:43] <mforns>	 the resulting data is small
[19:23:59] <mforns>	 I don't understand why it takes so many resources, because the data is small overall
[19:24:14] <mforns>	 the resulting data is 12MB per day!
[19:24:41] <mforns>	 haven't checked the source data, but it should not be that big
[19:25:02] <joal>	 mforns: without "--master yarn", you actually don't run the thing in hadoop
[19:25:14] <joal>	 it runs in local mode on whatever stat machine you use
[19:25:17] <mforns>	 aaaaaaaaaah!!!!!! I'm sooooooo stupid!
[19:25:39] <joal>	 No you're not, but that's the reason for which you actually can't find your job in yarn
[19:25:47] <mforns>	 myyyyyy goooooooooood
[19:27:22] <joal>	 mforns: Can you give the dates of the failin job?
[19:27:38] <mforns>	 you mean the date interval?
[19:27:43] <joal>	 correct
[19:27:56] <mforns>	 the one I linked there has them
[19:28:04] <mforns>	 --start-date 2017-11-01T12 \
[19:28:04] <mforns>	 --end-date 2017-11-04T12 \
[19:29:50] <mforns>	 executing again IN YARN
[19:30:14] <joal>	 mforns: And the previous imports ou did were on previous dates (like, 2017-10-25->31 roughly?)
[19:30:53] <mforns>	 yes: 2017-10-18T12 -> 2017-10-25T12
[19:30:55] <joal>	 mforns: I'm positively enclined to think this one will succeed
[19:31:11] <mforns>	 and 2017-10-18T12 -> 2017-11-01T12
[19:31:20] <mforns>	 why you ask the dates?
[19:31:21] <mforns>	 YES!
[19:31:30] <mforns>	 so blind!
[19:31:44] <joal>	 mforns: I'm looking at he result of command: hdfs dfs -du -s -h /user/tbayer/eventlogging_refine_test/Popups/year=2017/*/*
[19:32:00] <mforns>	 joal,  ah ok
[19:32:14] <joal>	 This tells me data starts to get bigger from 2017-10-23 onward
[19:32:40] <joal>	 So 2017-10-18T12 -> 2017-10-25T12 succesfull, makes sense, but the other one, I'm surprised :)
[19:33:15] <mforns>	 hm, I did a histogram in hive and data starts to grow considerable from the 19th
[19:33:21] <joal>	 Anyway - Doing this in yarn should ensure success ;)
[19:33:30] <mforns>	 althoug 23rd there's a bigger jump
[19:33:35] <mforns>	 hehehe, sure
[19:33:47] <mforns>	 man... I'm a cluster white-belt
[19:34:06] <joal>	 mforns: By civility for other users, can you please also use: "--conf spark.dynamicAllocation.maxExecutors=64", like that you don't eat all he cluster resources 
[19:34:13] <mforns>	 ok, it finished already \o/
[19:34:22] <mforns>	 of course
[19:34:23] <joal>	 For that job, no bother, but for next ones
[19:34:39] <joal>	 And you can tweak the max exec value up to 128, no problem
[19:35:15] <joal>	 I'm so happy we have double checked cluster actually serves some purpose ;)
[19:35:30] <mforns>	 joal, current command: https://pastebin.com/vF5tLvaq
[19:35:46] <joal>	 Should work :)
[19:35:58] <mforns>	 already did
[19:36:00] <joal>	 mforns: the yarn job succeeded, with druid loading and all?
[19:36:28] <mforns>	 yes: https://tinyurl.com/y7svw22y
[19:36:54] <joal>	 yes mforns, was looking at that as well ')
[19:36:58] <joal>	 That's awesonme
[19:37:55] <mforns>	 ok, loading the remaining data
[19:38:08] <joal>	 mforns: you probably should be able to do it in one row
[19:38:16] <joal>	 in one pass I mean
[19:38:21] <mforns>	 yea, I did that
[19:40:10] <mforns>	 already loading
[19:40:15] <joal>	 :)
[19:40:33] <joal>	 another info for you mforns - Spark2 is reallllly faster than 1.6 ;)
[19:40:37] <joal>	 :D
[19:41:03] <mforns>	 awesome, I can create a task to translate this job to spark2
[19:41:40] <joal>	 mforns: For the moment it's not yet tested with oozie, but works like a charm for manual or cron use
[19:41:48] <mforns>	 cool
[19:43:54] <joal>	 Ok - enough for today - Bye a-team - tomorrow, I'll be late because of the warp10 thing - hopefully I'll learn some along the way :)
[19:44:02] <mforns>	 joal, finished, popups experiment is fully loaded
[19:44:11] <mforns>	 joal, thanks a lot :]
[19:44:23] <mforns>	 byeeeeee
[19:44:25] <joal>	 No prob mforn - You did 99% of the job ;)
[19:44:33] <mforns>	 :]
[21:21:52] <wikibugs>	 10Analytics: Incorporate data from the GeoIP2 ISP database to webrequest - https://phabricator.wikimedia.org/T167907#3784639 (10faidon) This came up again this week: I was looking into our network traffic in our various PoPs, to plan capacity and procure network links for eqsin (Singapore). There is traffic on o...
[21:22:39] <wikibugs>	 10Analytics: Create ops dashboard with info like ipv6 traffic split - https://phabricator.wikimedia.org/T138396#3784643 (10faidon) Also see T167907, for a similar request (from the network side of things).
[22:37:54] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3784670 (10faidon)
[23:09:19] <wikibugs>	 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: A statbox to update the WDCM system - https://phabricator.wikimedia.org/T181094#3784689 (10GoranSMilovanovic) @Ottomata You're awesome! Everything seems to be running smoothly on stat1004.   Marked as resolved.
[23:09:27] <wikibugs>	 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: A statbox to update the WDCM system - https://phabricator.wikimedia.org/T181094#3784690 (10GoranSMilovanovic) 05Open>03Resolved a:03GoranSMilovanovic