[01:32:46] <wikibugs>	 10Analytics: Sqoop wbc_entity_usage from all wikis into hadoop (HDFS) - https://phabricator.wikimedia.org/T167290#3714116 (10GoranSMilovanovic) @fdans @Addshore We now have a [[ https://phabricator.wikimedia.org/diffusion/AWCM/browse/master/WDCM_Sqoop_Clients.R | WDCM_Sqoop_Clients.R ]] R script that sqoops all...
[03:27:09] <wikibugs>	 (03PS1) 10Milimetric: [WIP] Clean up UI and aggregations [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084)
[07:36:25] <elukey>	 !log stop & mask hadoop-httpfs.service on analytics1001 after https://gerrit.wikimedia.org/r/#/c/386684/
[07:36:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:40:18] <elukey>	 !log re-run wikidata-articleplaceholder_metrics-wf-2017-10-26
[07:40:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:55:41] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3714753 (10Pchelolo) So, we have an event that's > 100 Mb of serialized JSON... Wonderful. Do you think it's possibly to dump it somewhere on the filesystem to find out what...
[08:23:12] <joal>	 Thanks elukey for rerunning wikidata job :)
[08:23:22] <joal>	 And all the rest obviously ;)
[08:25:00] <elukey>	 joal: o/ - I didn't figure out what went wrong ut I am going to do it in a bit, too many things in parallel :)
[08:25:57] <joal>	 no worries elukey - We experience such failures once every now and then - I think it may be related to OOM for spark executors, but not sure
[09:18:25] * elukey checks dataloss alarms..
[09:41:27] <wikibugs>	 10Quarry: pawikisource_p.page table not available - https://phabricator.wikimedia.org/T179153#3714941 (10Ankry)
[10:00:02] <elukey>	 joal: do you have 10 mins for a brain bounce?
[10:00:09] <elukey>	 even in IRC
[10:04:38] <joal>	 elukey: I do !
[10:04:46] <joal>	 elukey: batcave?
[10:04:52] <elukey>	 sure
[10:48:55] <elukey>	 joal: used the mighty select_missing_sequence_runs.hql and confirmed that the holes are all around the hour
[10:49:00] <elukey>	 so false alarm
[12:05:18] * elukey lunch!
[12:53:29] <wikibugs>	 10Analytics-Tech-community-metrics, 10Developer-Relations (Oct-Dec 2017): Make Qgil a fallback for Bitergia access (lock-in) - https://phabricator.wikimedia.org/T178381#3715414 (10Aklapper) Requested #1 and #3 in https://gitlab.com/Bitergia/c/Wikimedia/support/issues/8  #2 requires @Qgil to have an account on...
[13:22:01] <joal>	 Hey elukey - Just reading your email
[13:22:28] <joal>	 elukey: I think the check you've done for webrequet alarms is not correct
[13:22:59] <elukey>	 why not?
[13:23:04] <joal>	 elukey: Current code for statistics analysis (and therefore alarms using this dataset) removes 'dt=-' from analysis
[13:23:27] <elukey>	 are we super sure about it?
[13:24:01] <elukey>	 I mean, it seems strange to me that 1) we don't have any network level issue 2) the dt="-" are all around the hour
[13:24:07] <elukey>	 it see too much of a coincidence
[13:24:18] <joal>	 elukey: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql#L73
[13:25:11] <joal>	 elukey: Marcel's patch wes to remove dt=- from dataloss checks, and add another layer of checks for dt=- (see lines 77-> 88 of same above file)
[13:25:52] <elukey>	 joal: sure, but that means that those records (in our case ~300 per cache segment) are not added in the counts
[13:26:00] <joal>	 correct
[13:26:01] <elukey>	 not that the holes will not cause troubles no?
[13:26:19] <elukey>	 ah no right
[13:26:20] <elukey>	 mmmm
[13:26:35] <joal>	 elukey: fake holes are due to those dt=- events (misaligned sequence number)
[13:26:51] <elukey>	 riiiight I always forget
[13:27:29] <joal>	 Now, I suspect that there other events, having a timestamp, that have the same behavior as the dt=- ones - meaning very long live before release
[13:28:32] <elukey>	 yes sorry, too many things to do today, I didn't understand completely what you meant during the hangouts
[13:28:56] <joal>	 no worries elukey - just triple checking again, as usual :)
[13:29:17] <joal>	 elukey: I'll take the analysis if you wish :)
[13:29:26] <elukey>	 nono I'll do it
[13:29:28] <elukey>	 thanks
[13:34:05] <elukey>	 so joal from the script's output something that I didn't notice were records like the following
[13:34:28] <elukey>	 cp4026.ulsfo.wmnet 1358251995 135868776043576620 17-10-26T18:53:4020 17-10-26T18:56:59
[13:34:43] <elukey>	 uff
[13:34:57] <elukey>	 cp4026.ulsfo.wmnet1358251995 1358687760 43576620 17-10-26T18:53:4020 17-10-26T18:56:59
[13:35:44] <elukey>	 hostname missing_start missing_end missing_count dt_before_missing dt_after_missing
[13:36:42] <elukey>	 the ones caused by dt: "-" have not most of the time either the dt_before_missing or dt_after_missing
[13:37:47] <elukey>	 but all around the hour more or less
[13:37:52] <elukey>	 so you might be very well right
[13:38:50] <elukey>	 theoretically in this case 1358251996 (start_sequence+1) should be in the other hour's bucket if you are right 
[13:39:04] <elukey>	 with a timestamp
[13:39:52] <joal>	 elukey: I have wonders
[13:40:30] <joal>	 elukey: start/end timestamps are not super-close to hour end
[13:40:30] <elukey>	 if you want to tell me that I am dumb you definitely can
[13:40:44] <elukey>	 I will not take any offence 
[13:40:45] <elukey>	 :D
[13:41:13] <joal>	 And also, missing start / end don't match with missing number :/
[13:41:34] <joal>	 It's nothing to do with you being dumb (which I think is actually false) :)
[13:41:41] <elukey>	 ahahahah
[13:42:34] <joal>	 Like: the number of misses is 435766, not 43576620 :0
[13:42:38] <elukey>	 so 1358687760 - 1358251995 = 435765 != 43576620
[13:42:40] <elukey>	 yeah
[13:42:41] <joal>	 makes more sense
[13:43:06] <joal>	 Now, even with that - it's weird
[13:43:16] <elukey>	 did marcel add an easter egg to make us crazy once in a while?
[13:43:17] <elukey>	 :D
[13:43:47] <joal>	 How the heck are we missing sequence-numbers between 18:53:40 and 18:56:59 ????
[13:43:50] <joal>	 huhuhu
[13:43:58] <joal>	 elukey: That could be marcel style ;)
[13:44:27] <joal>	 elukey: I actually think it's a copy/paste thing: there is a wrong 20 as well at the end of start_timestamp
[13:45:31] <elukey>	 yes 17-10-26T18:53:4020 is not a great date
[13:45:50] <elukey>	 joal! 
[13:45:56] <elukey>	 cp4026.ulsfo.wmnet 1358251996 2017-10-26T19:04:22
[13:47:22] <elukey>	 so I am going to check others but you seem to be right
[13:47:25] <joal>	 elukey: I'd be interested to know what URL is that :)
[13:47:36] <joal>	 elukey: I guess it would be something long
[13:47:40] <elukey>	  select * from webrequest where webrequest_source='upload' and year=2017 and month=10 and day=26 and (hour=18 or hour=17 or hour=19) and sequence='1358251996' and hostname='cp4026.ulsfo.wmnet';
[13:47:52] <elukey>	 so we'll not breach confidentiality :P
[13:47:56] <joal>	 elukey: need to go to take Lino from the creche - will be back before standup
[13:48:15] <joal>	 :)
[13:49:01] <elukey>	 joal: I'll try to get moar verification and I'll update the other folks
[13:49:04] <elukey>	 thanks :)
[13:58:44] <mforns>	 heya
[13:59:25] <elukey>	 mforns: o/ 
[14:01:06] <mforns>	 elukey, when I try to start a spark-shell in stat1005 it says:
[14:01:07] <mforns>	 OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000670e00000, 351272960, 0) failed; error='Cannot allocate memory' (errno=12)
[14:01:13] <mforns>	 and exits
[14:01:32] <mforns>	 I should do it better in stat1004 no?
[14:01:49] <elukey>	 whattttt
[14:02:22] <elukey>	 so on stat1005 we have java8 that does not work well with the hadoop cluster
[14:02:29] <elukey>	 buuut it shouldn't die in that way :D
[14:07:55] <wikibugs>	 10Analytics, 10Data-Services, 10Research: Create a database on the wikireplica servers called "datasets_p" - https://phabricator.wikimedia.org/T173513#3530910 (10Halfak)
[14:08:20] <wikibugs>	 10Analytics, 10Data-Services, 10Research: Create a database on the wikireplica servers called "datasets_p" - https://phabricator.wikimedia.org/T173513#3530910 (10Halfak)
[14:08:29] <wikibugs>	 10Analytics, 10Data-Services, 10Research: Create a database on the wikireplica servers called "datasets_p" - https://phabricator.wikimedia.org/T173513#3530910 (10Halfak)
[14:08:39] <wikibugs>	 (03CR) 10Fdans: "Code and tests look good in general! Just a couple comments :)" (034 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084) (owner: 10Milimetric)
[14:22:45] <joal>	 Back !
[14:25:14] <elukey>	 mforns: Error: Could not run command from prerun_command: Cannot allocate memory - fork(2) when I tried to execute puppet on stat1005
[14:25:18] <elukey>	 something is wrong :(
[14:25:22] <elukey>	 joal: https://phabricator.wikimedia.org/P6203
[14:25:28] <elukey>	 also tried with one in text
[14:25:38] <mforns>	 hm
[14:25:54] <elukey>	 mforns: lemme check what is hammering the box
[14:26:08] <elukey>	 elukey@stat1005:~$ free -m total        used        free      shared  buff/cache   available
[14:26:11] <elukey>	 Mem:          64337       62939         515         675         881         267
[14:26:32] <elukey>	 bearloga: hello :D --^
[14:27:30] <elukey>	 R seems consuming a ton of memory, any chance that you could recuce the load ?
[14:27:50] <hoo>	 stat1005 is basically unusable right now
[14:28:00] <hoo>	 someone is hogging most of the memory
[14:28:01] <elukey>	 yeah
[14:28:09] <elukey>	 hoo: just posted --^
[14:28:41] <elukey>	 I'd prefer not to kill his processes
[14:28:49] <elukey>	 but I am afraid that the oom will do it for me soon
[14:29:08] <hoo>	 Is there another host that can be used for hadoop queries?
[14:29:17] <elukey>	 stat1004
[14:30:19] <hoo>	 aha… here we go :)
[14:42:29] <halfak>	 Hey folks.  I'm wondering where the data that was on /srv for stat1003 got moved to.  Specifically, I'm looking for /srv/halfak.
[14:42:58] <milimetric>	 halfak: it's not on stat1006?
[14:43:12] <halfak>	 Not in that location 
[14:43:25] <milimetric>	 halfak: /srv/home/halfak?
[14:43:33] <halfak>	 milimetric, na.  That's my home dir. 
[14:43:37] <milimetric>	 hm...
[14:43:44] <halfak>	 I used /srv on the old machine for lots of big stuff. 
[14:43:55] <milimetric>	 yeah, I remember ottomata did something...
[14:43:55] <milimetric>	 hm
[14:44:45] <halfak>	 I'm very happy that the home dirs are now just on /srv though :) No more need for symlinking around to /srv for big archives. 
[14:46:29] <milimetric>	 I donno, halfak, not sure
[14:46:39] <milimetric>	 couldn't find any docs / emails about it
[14:46:48] <halfak>	 Hmm... OK.  I imagine that stuff would have been backed up!  There was a lot there :|
[14:46:57] <elukey>	 everything should be in /srv theoretically, andrew rsynced all
[14:48:22] <halfak>	 hmmm..  There used to be a ton of folders there.  They're all gone now.  Like, a bunch of users created /srv/<username> or /srv/<project name> directories. 
[14:48:39] * halfak digs around some of the dirs in /srv
[14:48:48] <milimetric>	 elukey: something's weird about 1006... a bunch of stuff is synced  to both /srv and / (root)
[14:48:59] <milimetric>	 like /published-datasets shouldn't exist, it should just be /srv/published-datasets
[14:49:21] <milimetric>	 and yeah, I agree with halfak, the stuff is gone, and I feel like I remember Andrew doing something with it
[14:49:46] <elukey>	 ah yeah maybe a mistake in copying
[14:51:12] <halfak>	 Should I drop an email to analytics@ so we can continue the conversation there?
[14:51:17] <halfak>	 Or maybe a task would be better. 
[14:52:17] <milimetric>	 halfak: did you look around your home directory, maybe he merged them?  I see that your home has at least 197 GB in it, so maybe it's there?
[14:53:09] * joal thinks 200Gb is not tha big in halfak's mind :)
[14:53:18] <milimetric>	 heh, true
[14:56:30] <halfak>	 lol 
[14:57:40] <nuria_>	 halfak: task would be better  than e-mail 
[14:57:40] <halfak>	 Yeah, I'm pretty sure they aren't there.  Some symlinks -- e.g. to /srv/halfak/data from /home/halfak/data are still there -- just broken. 
[14:57:45] <halfak>	 OK. 
[14:58:20] <nuria_>	 bearloga: ping us when you are in today
[15:00:44] <halfak>	 https://phabricator.wikimedia.org/T179189
[15:00:48] <wikibugs>	 10Analytics: Locate data from /srv on stat1003 - https://phabricator.wikimedia.org/T179189#3716130 (10Halfak)
[15:00:54] <halfak>	 I sub'd ottomata
[15:01:08] <nuria_>	 ping milimetric 
[15:02:13] <wikibugs>	 10Analytics-Dashiki, 10Analytics-Kanban, 10Patch-For-Review: Add option to not truncate Y-axis - https://phabricator.wikimedia.org/T178602#3716143 (10Nuria) 05Resolved>03Open
[15:11:00] <wikibugs>	 10Analytics-Dashiki, 10Analytics-Kanban, 10Patch-For-Review: Add option to not truncate Y-axis - https://phabricator.wikimedia.org/T178602#3716186 (10Nuria) 05Open>03Resolved
[15:15:30] <bearloga>	 nuria_: I’m off today. What’s up?
[15:16:12] <nuria_>	 bearloga: your jobs in stat1005 are about to be  killed byoom management
[15:16:30] <nuria_>	 bearloga: because they are consuming too much memory 
[15:16:42] <nuria_>	 bearloga: and machine is not usable 
[15:16:55] <nuria_>	 bearloga: can we talk wether we can move those jobs to hadoop?
[15:17:38] <bearloga>	 nuria_: crap crackers! I’m sorry :( wait, that is a Hadoop job I’m running
[15:18:06] <bearloga>	 It’s a hive query running in r
[15:18:27] <joal>	 bearloga: looks like you get a lot of data back maybe?
[15:18:43] <bearloga>	 I shouldn’t be…lemme check
[15:18:51] <joal>	 bearloga: no hive query currently running - maybe postprocessing?
[15:19:18] <nuria_>	 bearloga: mmm no wonder is confusing
[15:19:37] <nuria_>	 bearloga:  you would not expect a hive query to eat local cpu?
[15:20:01] <nuria_>	 bearloga: ah sorry missed joal's post
[15:21:10] <bearloga>	 Oh the thing I thought it was wasn’t the thing
[15:21:33] <joal>	 bearloga: that thing issue is a well known thing to me :)
[15:22:48] <bearloga>	 You can go ahead and kill it. Something probably went wrong
[15:22:57] <joal>	 elukey: --^
[15:23:46] <bearloga>	 We don’t have this problem usually and have been using nice and ionice on all our daily metric calculation scripts, including the nice hive queue
[15:24:05] <bearloga>	 So I’m guessing it’s just a hiccup
[15:24:20] <bearloga>	 And I’m sorry
[15:24:30] <joal>	 yeah bearloga - We were saying at standup that you guys are good citizens of our platforms :)
[15:25:01] <joal>	 So no big deal thanks for answering on a day off bearloga :)
[15:25:49] <nuria_>	 bearloga: no big deal
[15:30:52] <wikibugs>	 10Analytics, 10Pageviews-API: Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3716266 (10MusikAnimal)
[15:35:09] <wikibugs>	 (03PS1) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414)
[15:44:06] <elukey>	 so stat1005 is breaking again
[15:45:16] <elukey>	 multiple people doing jobs this time
[15:45:44] <ebernhardson>	 :(
[15:53:17] <elukey>	 elukey@stat1005:~$ free -m total        used        free      shared  buff/cache   available
[15:53:20] <elukey>	 Mem:          64337       40615       22541         675        1180       22506
[15:53:39] <elukey>	 now it looks better, not sure for how long though.. we'll need some sort of user limits :(
[16:04:42] <wikibugs>	 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716362 (10mforns) Ping?
[16:07:26] <joal>	 Thanks a lot elukey for managing this
[16:10:15] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3716364 (10elukey)
[16:14:15] <wikibugs>	 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3716380 (10elukey)
[16:15:22] <elukey>	 super good that ops deployed these alarms --^
[16:16:09] <elukey>	 just had a very interesting chat with geh*el about GCs, and he pointed out http://www.gceasy.io/index.jsp and also JClarity 
[16:48:59] <elukey>	 going offline people!
[16:49:02] <elukey>	 have a good wekeend!
[16:49:09] * elukey off!
[16:56:27] <wikibugs>	 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716442 (10pmiazga) I see two solutions:  a) Can we just drop the `skin` property and `isMobile` instead? we will lose the hist...
[17:24:36] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584101 (10Halfak) Please do not wipe these disks until {T179189} is resolved.
[17:25:50] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3716518 (10Ottomata) @Cmjohnson it is looking like there was some files that @halfak needs that I did not fully sync over from stat1003 to stat1006. I'm going to try to power stat1003...
[17:42:16] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3716559 (10Cmjohnson) @ottomata okay please update task when it's okay to wipe.  Thanks
[17:45:45] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3716575 (10Ottomata) Is it possible to get this from MediaWiki instead?  If EventBus returns a 400, MediaWiki logs the error somewhere, right?  Does it possible send the even...
[17:58:14] <ottomata>	 mforns:  i see you pushed to gerrit for druid spark stuff
[17:58:18] <ottomata>	 ready for review?
[17:58:39] <mforns>	 ottomata, yes, only DataFrameToDruid though, I'm working on the EventLoggingToDruid job now
[17:59:41] <ottomata>	 ok cool
[17:59:43] <mforns>	 then I'll write some tests
[18:25:01] <wikibugs>	 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716648 (10pmiazga) I spoke with @ovasileva and we will go with the second approach, limit skins only to `vector|minerva|other`...
[18:43:40] <wikibugs>	 (03CR) 10Ottomata: "Some nits, but looks really nice." (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)
[18:43:46] <mforns>	 thx ottomata 
[18:47:39] <wikibugs>	 (03CR) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)
[19:44:00] <wikibugs>	 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716857 (10pmiazga) @mforns - we changed the values stored in skin property. now it will be only `vector`, `minerva` or `other`...
[20:15:37] <wikibugs>	 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716885 (10mforns) Thanks @pmiazga for pushing this forward.  In T175395#3665372 I suggested that we could bucketize skin and h...
[20:16:00] <mforns>	 haciendo amigos
[20:25:50] <mforns>	 ottomata, I'm writing a comment on your review, but it's getting long, if you're still around, maybe we can batcave?
[20:26:05] <ottomata>	 sure
[20:26:07] <mforns>	 k!
[20:44:12] <mforns>	 bye! have a nice weekend all!
[21:46:45] <wikibugs>	 (03CR) 10Milimetric: [WIP] Clean up UI and aggregations (033 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084) (owner: 10Milimetric)
[21:46:59] <wikibugs>	 (03PS2) 10Milimetric: [WIP] Clean up UI and aggregations [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084)
[23:10:34] <wikibugs>	 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3717090 (10Jdlrobson) So as I've pointed out the webHost is flawed as the skin can be changed on desktop.  I'm still not sure i...