[01:32:46] 10Analytics: Sqoop wbc_entity_usage from all wikis into hadoop (HDFS) - https://phabricator.wikimedia.org/T167290#3714116 (10GoranSMilovanovic) @fdans @Addshore We now have a [[ https://phabricator.wikimedia.org/diffusion/AWCM/browse/master/WDCM_Sqoop_Clients.R | WDCM_Sqoop_Clients.R ]] R script that sqoops all... [03:27:09] (03PS1) 10Milimetric: [WIP] Clean up UI and aggregations [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084) [07:36:25] !log stop & mask hadoop-httpfs.service on analytics1001 after https://gerrit.wikimedia.org/r/#/c/386684/ [07:36:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:40:18] !log re-run wikidata-articleplaceholder_metrics-wf-2017-10-26 [07:40:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:55:41] 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3714753 (10Pchelolo) So, we have an event that's > 100 Mb of serialized JSON... Wonderful. Do you think it's possibly to dump it somewhere on the filesystem to find out what... [08:23:12] Thanks elukey for rerunning wikidata job :) [08:23:22] And all the rest obviously ;) [08:25:00] joal: o/ - I didn't figure out what went wrong ut I am going to do it in a bit, too many things in parallel :) [08:25:57] no worries elukey - We experience such failures once every now and then - I think it may be related to OOM for spark executors, but not sure [09:18:25] * elukey checks dataloss alarms.. [09:41:27] 10Quarry: pawikisource_p.page table not available - https://phabricator.wikimedia.org/T179153#3714941 (10Ankry) [10:00:02] joal: do you have 10 mins for a brain bounce? [10:00:09] even in IRC [10:04:38] elukey: I do ! [10:04:46] elukey: batcave? [10:04:52] sure [10:48:55] joal: used the mighty select_missing_sequence_runs.hql and confirmed that the holes are all around the hour [10:49:00] so false alarm [12:05:18] * elukey lunch! [12:53:29] 10Analytics-Tech-community-metrics, 10Developer-Relations (Oct-Dec 2017): Make Qgil a fallback for Bitergia access (lock-in) - https://phabricator.wikimedia.org/T178381#3715414 (10Aklapper) Requested #1 and #3 in https://gitlab.com/Bitergia/c/Wikimedia/support/issues/8 #2 requires @Qgil to have an account on... [13:22:01] Hey elukey - Just reading your email [13:22:28] elukey: I think the check you've done for webrequet alarms is not correct [13:22:59] why not? [13:23:04] elukey: Current code for statistics analysis (and therefore alarms using this dataset) removes 'dt=-' from analysis [13:23:27] are we super sure about it? [13:24:01] I mean, it seems strange to me that 1) we don't have any network level issue 2) the dt="-" are all around the hour [13:24:07] it see too much of a coincidence [13:24:18] elukey: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql#L73 [13:25:11] elukey: Marcel's patch wes to remove dt=- from dataloss checks, and add another layer of checks for dt=- (see lines 77-> 88 of same above file) [13:25:52] joal: sure, but that means that those records (in our case ~300 per cache segment) are not added in the counts [13:26:00] correct [13:26:01] not that the holes will not cause troubles no? [13:26:19] ah no right [13:26:20] mmmm [13:26:35] elukey: fake holes are due to those dt=- events (misaligned sequence number) [13:26:51] riiiight I always forget [13:27:29] Now, I suspect that there other events, having a timestamp, that have the same behavior as the dt=- ones - meaning very long live before release [13:28:32] yes sorry, too many things to do today, I didn't understand completely what you meant during the hangouts [13:28:56] no worries elukey - just triple checking again, as usual :) [13:29:17] elukey: I'll take the analysis if you wish :) [13:29:26] nono I'll do it [13:29:28] thanks [13:34:05] so joal from the script's output something that I didn't notice were records like the following [13:34:28] cp4026.ulsfo.wmnet 1358251995 135868776043576620 17-10-26T18:53:4020 17-10-26T18:56:59 [13:34:43] uff [13:34:57] cp4026.ulsfo.wmnet1358251995 1358687760 43576620 17-10-26T18:53:4020 17-10-26T18:56:59 [13:35:44] hostname missing_start missing_end missing_count dt_before_missing dt_after_missing [13:36:42] the ones caused by dt: "-" have not most of the time either the dt_before_missing or dt_after_missing [13:37:47] but all around the hour more or less [13:37:52] so you might be very well right [13:38:50] theoretically in this case 1358251996 (start_sequence+1) should be in the other hour's bucket if you are right [13:39:04] with a timestamp [13:39:52] elukey: I have wonders [13:40:30] elukey: start/end timestamps are not super-close to hour end [13:40:30] if you want to tell me that I am dumb you definitely can [13:40:44] I will not take any offence [13:40:45] :D [13:41:13] And also, missing start / end don't match with missing number :/ [13:41:34] It's nothing to do with you being dumb (which I think is actually false) :) [13:41:41] ahahahah [13:42:34] Like: the number of misses is 435766, not 43576620 :0 [13:42:38] so 1358687760 - 1358251995 = 435765 != 43576620 [13:42:40] yeah [13:42:41] makes more sense [13:43:06] Now, even with that - it's weird [13:43:16] did marcel add an easter egg to make us crazy once in a while? [13:43:17] :D [13:43:47] How the heck are we missing sequence-numbers between 18:53:40 and 18:56:59 ???? [13:43:50] huhuhu [13:43:58] elukey: That could be marcel style ;) [13:44:27] elukey: I actually think it's a copy/paste thing: there is a wrong 20 as well at the end of start_timestamp [13:45:31] yes 17-10-26T18:53:4020 is not a great date [13:45:50] joal! [13:45:56] cp4026.ulsfo.wmnet 1358251996 2017-10-26T19:04:22 [13:47:22] so I am going to check others but you seem to be right [13:47:25] elukey: I'd be interested to know what URL is that :) [13:47:36] elukey: I guess it would be something long [13:47:40] select * from webrequest where webrequest_source='upload' and year=2017 and month=10 and day=26 and (hour=18 or hour=17 or hour=19) and sequence='1358251996' and hostname='cp4026.ulsfo.wmnet'; [13:47:52] so we'll not breach confidentiality :P [13:47:56] elukey: need to go to take Lino from the creche - will be back before standup [13:48:15] :) [13:49:01] joal: I'll try to get moar verification and I'll update the other folks [13:49:04] thanks :) [13:58:44] heya [13:59:25] mforns: o/ [14:01:06] elukey, when I try to start a spark-shell in stat1005 it says: [14:01:07] OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000670e00000, 351272960, 0) failed; error='Cannot allocate memory' (errno=12) [14:01:13] and exits [14:01:32] I should do it better in stat1004 no? [14:01:49] whattttt [14:02:22] so on stat1005 we have java8 that does not work well with the hadoop cluster [14:02:29] buuut it shouldn't die in that way :D [14:07:55] 10Analytics, 10Data-Services, 10Research: Create a database on the wikireplica servers called "datasets_p" - https://phabricator.wikimedia.org/T173513#3530910 (10Halfak) [14:08:20] 10Analytics, 10Data-Services, 10Research: Create a database on the wikireplica servers called "datasets_p" - https://phabricator.wikimedia.org/T173513#3530910 (10Halfak) [14:08:29] 10Analytics, 10Data-Services, 10Research: Create a database on the wikireplica servers called "datasets_p" - https://phabricator.wikimedia.org/T173513#3530910 (10Halfak) [14:08:39] (03CR) 10Fdans: "Code and tests look good in general! Just a couple comments :)" (034 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084) (owner: 10Milimetric) [14:22:45] Back ! [14:25:14] mforns: Error: Could not run command from prerun_command: Cannot allocate memory - fork(2) when I tried to execute puppet on stat1005 [14:25:18] something is wrong :( [14:25:22] joal: https://phabricator.wikimedia.org/P6203 [14:25:28] also tried with one in text [14:25:38] hm [14:25:54] mforns: lemme check what is hammering the box [14:26:08] elukey@stat1005:~$ free -m total used free shared buff/cache available [14:26:11] Mem: 64337 62939 515 675 881 267 [14:26:32] bearloga: hello :D --^ [14:27:30] R seems consuming a ton of memory, any chance that you could recuce the load ? [14:27:50] stat1005 is basically unusable right now [14:28:00] someone is hogging most of the memory [14:28:01] yeah [14:28:09] hoo: just posted --^ [14:28:41] I'd prefer not to kill his processes [14:28:49] but I am afraid that the oom will do it for me soon [14:29:08] Is there another host that can be used for hadoop queries? [14:29:17] stat1004 [14:30:19] aha… here we go :) [14:42:29] Hey folks. I'm wondering where the data that was on /srv for stat1003 got moved to. Specifically, I'm looking for /srv/halfak. [14:42:58] halfak: it's not on stat1006? [14:43:12] Not in that location [14:43:25] halfak: /srv/home/halfak? [14:43:33] milimetric, na. That's my home dir. [14:43:37] hm... [14:43:44] I used /srv on the old machine for lots of big stuff. [14:43:55] yeah, I remember ottomata did something... [14:43:55] hm [14:44:45] I'm very happy that the home dirs are now just on /srv though :) No more need for symlinking around to /srv for big archives. [14:46:29] I donno, halfak, not sure [14:46:39] couldn't find any docs / emails about it [14:46:48] Hmm... OK. I imagine that stuff would have been backed up! There was a lot there :| [14:46:57] everything should be in /srv theoretically, andrew rsynced all [14:48:22] hmmm.. There used to be a ton of folders there. They're all gone now. Like, a bunch of users created /srv/ or /srv/ directories. [14:48:39] * halfak digs around some of the dirs in /srv [14:48:48] elukey: something's weird about 1006... a bunch of stuff is synced to both /srv and / (root) [14:48:59] like /published-datasets shouldn't exist, it should just be /srv/published-datasets [14:49:21] and yeah, I agree with halfak, the stuff is gone, and I feel like I remember Andrew doing something with it [14:49:46] ah yeah maybe a mistake in copying [14:51:12] Should I drop an email to analytics@ so we can continue the conversation there? [14:51:17] Or maybe a task would be better. [14:52:17] halfak: did you look around your home directory, maybe he merged them? I see that your home has at least 197 GB in it, so maybe it's there? [14:53:09] * joal thinks 200Gb is not tha big in halfak's mind :) [14:53:18] heh, true [14:56:30] lol [14:57:40] halfak: task would be better than e-mail [14:57:40] Yeah, I'm pretty sure they aren't there. Some symlinks -- e.g. to /srv/halfak/data from /home/halfak/data are still there -- just broken. [14:57:45] OK. [14:58:20] bearloga: ping us when you are in today [15:00:44] https://phabricator.wikimedia.org/T179189 [15:00:48] 10Analytics: Locate data from /srv on stat1003 - https://phabricator.wikimedia.org/T179189#3716130 (10Halfak) [15:00:54] I sub'd ottomata [15:01:08] ping milimetric [15:02:13] 10Analytics-Dashiki, 10Analytics-Kanban, 10Patch-For-Review: Add option to not truncate Y-axis - https://phabricator.wikimedia.org/T178602#3716143 (10Nuria) 05Resolved>03Open [15:11:00] 10Analytics-Dashiki, 10Analytics-Kanban, 10Patch-For-Review: Add option to not truncate Y-axis - https://phabricator.wikimedia.org/T178602#3716186 (10Nuria) 05Open>03Resolved [15:15:30] nuria_: I’m off today. What’s up? [15:16:12] bearloga: your jobs in stat1005 are about to be killed byoom management [15:16:30] bearloga: because they are consuming too much memory [15:16:42] bearloga: and machine is not usable [15:16:55] bearloga: can we talk wether we can move those jobs to hadoop? [15:17:38] nuria_: crap crackers! I’m sorry :( wait, that is a Hadoop job I’m running [15:18:06] It’s a hive query running in r [15:18:27] bearloga: looks like you get a lot of data back maybe? [15:18:43] I shouldn’t be…lemme check [15:18:51] bearloga: no hive query currently running - maybe postprocessing? [15:19:18] bearloga: mmm no wonder is confusing [15:19:37] bearloga: you would not expect a hive query to eat local cpu? [15:20:01] bearloga: ah sorry missed joal's post [15:21:10] Oh the thing I thought it was wasn’t the thing [15:21:33] bearloga: that thing issue is a well known thing to me :) [15:22:48] You can go ahead and kill it. Something probably went wrong [15:22:57] elukey: --^ [15:23:46] We don’t have this problem usually and have been using nice and ionice on all our daily metric calculation scripts, including the nice hive queue [15:24:05] So I’m guessing it’s just a hiccup [15:24:20] And I’m sorry [15:24:30] yeah bearloga - We were saying at standup that you guys are good citizens of our platforms :) [15:25:01] So no big deal thanks for answering on a day off bearloga :) [15:25:49] bearloga: no big deal [15:30:52] 10Analytics, 10Pageviews-API: Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3716266 (10MusikAnimal) [15:35:09] (03PS1) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [15:44:06] so stat1005 is breaking again [15:45:16] multiple people doing jobs this time [15:45:44] :( [15:53:17] elukey@stat1005:~$ free -m total used free shared buff/cache available [15:53:20] Mem: 64337 40615 22541 675 1180 22506 [15:53:39] now it looks better, not sure for how long though.. we'll need some sort of user limits :( [16:04:42] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716362 (10mforns) Ping? [16:07:26] Thanks a lot elukey for managing this [16:10:15] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3716364 (10elukey) [16:14:15] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192#3716380 (10elukey) [16:15:22] super good that ops deployed these alarms --^ [16:16:09] just had a very interesting chat with geh*el about GCs, and he pointed out http://www.gceasy.io/index.jsp and also JClarity [16:48:59] going offline people! [16:49:02] have a good wekeend! [16:49:09] * elukey off! [16:56:27] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716442 (10pmiazga) I see two solutions: a) Can we just drop the `skin` property and `isMobile` instead? we will lose the hist... [17:24:36] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584101 (10Halfak) Please do not wipe these disks until {T179189} is resolved. [17:25:50] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3716518 (10Ottomata) @Cmjohnson it is looking like there was some files that @halfak needs that I did not fully sync over from stat1003 to stat1006. I'm going to try to power stat1003... [17:42:16] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3716559 (10Cmjohnson) @ottomata okay please update task when it's okay to wipe. Thanks [17:45:45] 10Analytics-Kanban, 10EventBus, 10Services (next): Malformed HTTP message in EventBus logs - https://phabricator.wikimedia.org/T178983#3716575 (10Ottomata) Is it possible to get this from MediaWiki instead? If EventBus returns a 400, MediaWiki logs the error somewhere, right? Does it possible send the even... [17:58:14] mforns: i see you pushed to gerrit for druid spark stuff [17:58:18] ready for review? [17:58:39] ottomata, yes, only DataFrameToDruid though, I'm working on the EventLoggingToDruid job now [17:59:41] ok cool [17:59:43] then I'll write some tests [18:25:01] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716648 (10pmiazga) I spoke with @ovasileva and we will go with the second approach, limit skins only to `vector|minerva|other`... [18:43:40] (03CR) 10Ottomata: "Some nits, but looks really nice." (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [18:43:46] thx ottomata [18:47:39] (03CR) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [19:44:00] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716857 (10pmiazga) @mforns - we changed the values stored in skin property. now it will be only `vector`, `minerva` or `other`... [20:15:37] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3716885 (10mforns) Thanks @pmiazga for pushing this forward. In T175395#3665372 I suggested that we could bucketize skin and h... [20:16:00] haciendo amigos [20:25:50] ottomata, I'm writing a comment on your review, but it's getting long, if you're still around, maybe we can batcave? [20:26:05] sure [20:26:07] k! [20:44:12] bye! have a nice weekend all! [21:46:45] (03CR) 10Milimetric: [WIP] Clean up UI and aggregations (033 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084) (owner: 10Milimetric) [21:46:59] (03PS2) 10Milimetric: [WIP] Clean up UI and aggregations [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/386774 (https://phabricator.wikimedia.org/T178084) [23:10:34] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10Patch-For-Review, 10Readers-Web-Kanban-Board: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3717090 (10Jdlrobson) So as I've pointed out the webHost is flawed as the skin can be changed on desktop. I'm still not sure i...