[00:15:23] (03PS1) 10Nuria: Correcting possible null pointer exception [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/367832 (https://phabricator.wikimedia.org/T164021) [00:42:17] 10Analytics-Kanban: Oliver Keyes analytics cluster access to check on some old data - https://phabricator.wikimedia.org/T171696#3473360 (10Ottomata) [10:18:20] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (2/4) - Wiki selector - https://phabricator.wikimedia.org/T170936#3474152 (10fdans) [10:21:08] mforns: we have our first CR in differential yay! [10:21:11] https://phabricator.wikimedia.org/D726 [11:34:59] * fdans lunch! [11:38:07] * elukey lunch! [12:28:49] elukey: I win [12:57:47] elukey: yesterday, after copying most of the data I needed from /a on stat1002 -> stat1005 [12:57:59] /a started throwing io errors [12:58:07] i tried to umount it [12:58:11] (it is umounted?) [12:58:17] but that hung [12:58:22] and i can't kill that process [12:58:32] nobody is currnetly on stat1002, and most crons have been disabled [12:58:36] i'm going to reboot it [12:58:39] or at least try :) [12:58:54] ottomata: o/ [12:59:17] I saw the disk alert ping but then it auto-recovered, I thought it was a temporary weirndess [12:59:39] ya its very strange, i think it is umounted, but syslog/dmesg are piling up with [12:59:44] XFS (dm-0): xfs_log_force: error 5 returned. [13:00:09] ls: cannot access /a: Input/output error [13:00:25] first error is [13:00:25] XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1999 of file /build/linux-SWvZwa/linux-3.13.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffa012aae8 [13:00:25] 17:55:32 [13:00:46] what's strange is, other partitions on the same device are fine [13:00:49] so probably not a hw issue [13:02:20] mmm there has been an issue with XFS on dataset1001 a while ago [13:02:25] with a similar problem [13:02:33] let me pull up the task [13:04:35] oo k [13:05:14] https://phabricator.wikimedia.org/T169680 [13:05:24] not sure if the same but we can grab some info [13:07:01] hm yeah [13:07:07] seems maybe different [13:07:20] fusermount doesn't help at least [13:07:41] checking stat [13:08:04] ok, lemme know, i'm about to silience icinga and reboot [13:08:32] gimme 10 mins :) [13:09:24] k [13:16:15] this is slabtop [13:16:17] Active / Total Objects (% used) : 4482859 / 4702133 (95.3%) [13:16:17] Active / Total Slabs (% used) : 115968 / 115968 (100.0%) [13:16:17] Active / Total Caches (% used) : 78 / 111 (70.3%) [13:16:17] Active / Total Size (% used) : 1781576.71K / 1822397.33K (97.8%) [13:16:20] Minimum / Average / Maximum Object : 0.01K / 0.39K / 8.00K [13:21:50] ottomata: rebooting [13:22:12] ok elukey [13:28:55] (03CR) 10Mforns: [C: 032] Correcting possible null pointer exception [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/367832 (https://phabricator.wikimedia.org/T164021) (owner: 10Nuria) [13:29:17] (03CR) 10Mforns: [V: 032 C: 032] Correcting possible null pointer exception [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/367832 (https://phabricator.wikimedia.org/T164021) (owner: 10Nuria) [13:35:06] stat1002 is up [13:35:30] /sys/kernel/debug/extfrag/unusable_index is way better now [13:41:20] before rebooting I saved slabtop's output [13:41:22] 1710618 1688764 98% 0.19K 40732 42 325856K dentry [13:41:23] 1169264 1169264 100% 0.96K 35463 33 1134816K ext4_inode_cache [13:41:41] these were the first offenders, column before the last one is the size [13:42:22] now those are few MBs [13:42:22] elukey: that makes sense then [13:42:27] lots if inodes? [13:42:47] ext4 though? [13:42:47] hm [13:43:11] elukey: the werid thing i noticed is that akrausetud had a LOT of files [13:43:22] /a looks better [13:43:40] elukey: i'm going to attempt to copyo akrausetud files to stat1005 again... [13:43:47] kok [14:05:57] ottomata: I have officially made it to make a consumer/producer work with acls and ssl [14:06:06] but the * wildcards are weird [14:06:11] at least in this version [14:07:39] also http://ranger.apache.org/ might be great to test [14:08:44] elukey: sounds like https://sentry.apache.org/ [14:09:10] oo but ranger does more [14:09:13] including kafka [14:09:13] hm [14:09:14] ! [14:09:26] yeah! [14:09:42] but for sentry it should be a matter of creating the plugin no? [14:09:47] seems very similar to ranger [14:33:45] mornin' ottomata [14:34:33] ottomata: Nithum can help with testing AMD GPU. I added you to a thread with him. please take it from here. :) [14:59:13] 10Analytics-Kanban: Oliver Keyes analytics cluster access to check on some old data - https://phabricator.wikimedia.org/T171696#3474969 (10Nuria) Aeryn verfied his nda is current, @DarTar checking other paperwork [15:00:34] milimetric: standdupp [15:08:48] 10Analytics-Dashiki, 10Analytics-Kanban, 10MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), 10Patch-For-Review, 10Wikimedia-log-errors: Warning: JsonConfig: Invalid $wgJsonConfigModels['JsonConfig.Dashiki'] array value, 'class' not found - https://phabricator.wikimedia.org/T166335#3474989 (1... [15:18:04] 10Analytics-Kanban, 10Analytics-Wikistats, 10Continuous-Integration-Config, 10Release-Engineering-Team (Kanban): Set up continuous integration for wikistats 2.0 UI - https://phabricator.wikimedia.org/T170458#3475013 (10Milimetric) Just to keep the archives happy, yes, we're using Differential for code review. [15:33:44] cool leila! [15:55:06] nuria_, do we want to merge this before cluster deploy, or not yet? [15:55:43] https://gerrit.wikimedia.org/r/#/c/362159/ [15:56:12] mforns: noo [15:56:17] ok ok [15:56:25] mforns: as in it will ahem , break it all [15:56:34] k :] [15:56:55] mforns: jaja, see notes of how to deploy that change: https://phabricator.wikimedia.org/T168874 [15:57:06] mforns: i think it can wait for joseph [15:57:17] mmmmmmmm ok :] [15:57:28] mforns: i rather move the tagging/splitting changes he has in wip before deploying that one [15:57:53] k [16:35:41] o/ [16:35:48] I'm looking at stat1005 [16:36:01] And trying to figure out what, exactly, is in my home directory [16:36:14] It looks like maybe this was old stuff from stat1002 [16:37:34] Is it a network mount? [16:37:53] "/dev/mapper/stat1005--vg-data" [16:38:36] halfak: yep it should be your home from stat1002 [16:38:46] nope it is not [16:38:52] (a network mount) [16:39:01] we just rsynced from stat1002 to stat1005 [16:39:15] stat1006 will be the same for stat1003 [16:39:34] (meanwhile stat1004 will not be replaced since it is fresh enough) [16:40:11] Oh! [16:40:18] So stat1005 will have the GPU [16:40:26] I see. Thank you! [16:40:31] When was the rsync? [16:40:37] And could I have it re-run? [16:40:45] I'd like to make a clean break :) [16:40:54] But I didn't know and did some other stuff [16:40:58] sure, I think that we just need to inform ottomata, IIRC he was planning to re-run it [16:41:08] (if necessary) [16:41:18] Gotcha. Maybe I'll just see if I can work with what's on stat1006 in the meantime [16:41:22] Or is that not advised? [16:41:35] * halfak wants to run a big processing job with the 40 cores :DDD [16:41:43] hahahaha [16:42:51] so stat1006 should contain your home dir from stat1003, so as long as you are comfortable you can definitely run your things.. please be gentle and be aware of the fact that we installed it with stretch and there might still be something not working properly :) [16:43:19] elukey, OK will consider that. :) [16:44:17] halfak: we can rerun! [16:44:34] i just rerayn the /home rsync a few hours ago [16:44:42] \o/ [16:44:58] ottomata, perfect! [16:44:58] halfak: i'm still syncing stuff from /a/{use} [16:44:59] user [16:45:02] e.g. /a/halfak [16:45:11] ottomata: I was about to run rsync -av stat1002.eqiad.wmnet::home/halfak/ ./ on stat1005 (for his home's dir) [16:45:16] i'll put that into your home dir on stat1002 under /home/$USER/stat1002-a or something [16:45:16] /a is teh lame [16:45:18] and send email when it is time [16:45:24] I'm all about /srv on stat1003 ;) [16:45:30] elukey: feel free! [16:45:34] doesn't hurt [16:45:45] halfak: is it ok if I sync your home now from stat1002 to stat1005? [16:46:32] elukey, sure, but the old sync should be good. I don't touch stat1002 much [16:46:40] ah super then [16:50:18] (03PS1) 10Mforns: Bump changelog version to 0.0.49 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/367916 [16:51:36] (03CR) 10Mforns: [V: 032 C: 032] "Self-merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/367916 (owner: 10Mforns) [16:54:29] ottomata: after we deploy .. what is the best to restart the druid indexing jobs for mediawiki history, do i restart them as hdfs user? [16:55:23] nuria_: yes, as hdfs user. I think the sub jobs that are launched run as druid [16:55:35] ottomata: ok [17:01:27] 10Analytics-Kanban: Add tagging to webrequest refine process - https://phabricator.wikimedia.org/T171760#3475367 (10Nuria) [17:01:49] ottomata: once we deploy the new stuff to the cluster i am going to add a new column to webrequest [17:01:58] k! [17:02:14] cc mforns [17:02:48] ottomata: i tested selects on my db [17:04:56] ottomata: let me know if this looks good [17:05:03] https://www.irccloud.com/pastebin/81wXMur0/ [17:06:40] nuria_, k, I guess, after deploy of source and refinery, then we add the new column, and then we restart the webrequest bundle no? [17:06:48] nuria_: +1 [17:07:30] mforns: mmmm, do we need to restart bundle ottomata? [17:07:51] nuria_: if you want it to pick up your newly deployed code, ya [17:08:05] ottomata: there are no code changes on refinery yet [17:08:06] we start jobs with specific refinery version numbers [17:08:19] ottomata: just a column addition to table [17:08:26] does something populate the column? [17:08:31] ottomata: not yet [17:08:37] ottomata: that is my next set of changes [17:08:41] ah ok [17:08:51] nuria_, ok, so the alter table is still not going to change anything, because the tagger UDF is not used yet [17:09:13] mforns: right, refine code lists all columns though [17:09:13] ah ok, so it adds column but column default should be null [17:09:16] i think that is fine [17:09:30] ok, so we could actually postpone the alter table until we deploy the use of the tagging UDF no? [17:11:30] mforns: we can , not sure what is easier, maybe that is easier [17:11:53] nuria_, not suggesting it, just checking [17:11:55] mforns: refine lists all columns, so my next set of changes will be here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql [17:12:29] aha [17:15:50] mforns: but i am not sure when it is the best time to add a column, do we stop refine, add column, start new refine jobs that include new column? [17:15:55] mforns: ya, that seems safest [17:16:25] sounds good :] [17:16:48] * elukey off! [17:16:55] cc ottomata [17:18:21] nuria_: that is probably a good idea, but I think it doesn't matter if you add the column before you add jobs that populate the column [17:18:29] as long as there aren't various select * type things in the code [17:18:32] which we generally don't do [17:18:38] ottomata: ok [17:18:43] select * across partition boundries with different schemas will fail [17:18:59] ottomata: but holding on adding as we do not need it yet [17:19:09] but as long as fields are explicitly specifiied, and those fields exist in the targeted partitions, its fine [17:19:11] aye [17:22:51] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10leila) (There is a bit of IRC, email, meeting discussions as background missing here. but basically, Aaron, Andrew, and I chatted a couple of weeks ago... [17:23:41] 10Analytics-Kanban: Add tagging to webrequest refine process - https://phabricator.wikimedia.org/T171760#3475455 (10Nuria) Alter we need to run: alter table webrequest add columns (`tags` array COMMENT 'List containing tags qualifying the request, ex: [portal, wikidata]. Will be used to split webreque... [17:52:44] anybody got a sec for a brain bounce? [17:52:48] about refine stuff? [17:52:57] milimetric: ? [17:53:11] sure [17:53:12] omw [18:13:47] milimetric: DUH [18:13:51] DateTimeFormat [18:13:54] i can just use a date time format string :) [18:14:10] oh yeah :) [18:14:30] some type of ... string ... that ... formats dates ... :) [18:14:35] haha [18:53:22] 10Analytics-Kanban, 10Analytics-Wikistats: Productionise line graph - https://phabricator.wikimedia.org/T171766#3475715 (10fdans) [18:54:23] 10Analytics-Kanban, 10Patch-For-Review: Use hive dynamic partitioning to split webrequest on tags - https://phabricator.wikimedia.org/T164020#3475732 (10Smalyshev) Is there a place where tags used for splitting are recorded (beyond the actual `webrequest_split_tag` table)? [18:57:12] !log Deployed refinery-source using jenkins [18:57:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:00:06] (03PS1) 10Nuria: Add tagging as part of webrequest refine process [analytics/refinery] - 10https://gerrit.wikimedia.org/r/367940 (https://phabricator.wikimedia.org/T171760) [20:09:17] 10Analytics-Kanban, 10Patch-For-Review: Use hive dynamic partitioning to split webrequest on tags - https://phabricator.wikimedia.org/T164020#3476067 (10Nuria) >Is there a place where tags used for splitting are recorded (beyond the actual webrequest_split_tag table)? No, the spliting process is not yet in pl... [20:09:35] 10Analytics-Kanban, 10Patch-For-Review: Add tagging to webrequest refine process - https://phabricator.wikimedia.org/T171760#3476068 (10Nuria) [20:10:51] 10Analytics-Kanban, 10Patch-For-Review: Add tagging to webrequest refine process - https://phabricator.wikimedia.org/T171760#3475367 (10Nuria) Tested this code with some fake inserts on 1002, will test bit a bit more data, i just used 1 hour. [20:17:38] 10Analytics-Kanban, 10Patch-For-Review: Add tagging to webrequest refine process - https://phabricator.wikimedia.org/T171760#3476089 (10Nuria) [20:47:34] 10Analytics-Kanban, 10Patch-For-Review: Create purging script for mediawiki-history data - https://phabricator.wikimedia.org/T162034#3476151 (10Nuria) After looking into this a bit more i think we should keep 6 months of snapshots if we can afford the space, a simple error fixing jobs could lead to deleting a... [20:53:23] (03PS2) 10Nuria: [WIP] Add tagging as part of webrequest refine process [analytics/refinery] - 10https://gerrit.wikimedia.org/r/367940 (https://phabricator.wikimedia.org/T171760) [21:00:54] (03PS31) 10Ottomata: JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [21:01:55] !log Deployed refinery using scap, then deployed onto hdfs [21:01:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:07:07] mforns: i betcha you are busy and it is v late for you! but i'd love to do a json refine walkthrough with you [21:07:16] either today or tomorrow? [21:07:30] ottomata, I'd love that too [21:07:46] I'm still working for like 20 mins or so [21:07:52] its getting super close, need to write some tests now, but i want to walk thorugh it with someone first [21:07:57] in case it needs more refactoring [21:08:05] mforns: you want to now? or tomorrow? [21:08:20] mmmmm, JFDI [21:08:24] :] [21:08:30] haha [21:08:31] ok [21:08:36] batcave? [21:08:39] yaa [21:08:42] omw [21:09:06] https://gerrit.wikimedia.org/r/#/c/346291/ [21:14:16] 10Analytics, 10Analytics-Cluster, 10Operations, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3476259 (10Halfak) Looks to me like this task is ready to be resolved. Also, I have no idea why it is assigned to me as I've only consulted on it. @dr0ptp4kt w... [21:56:22] 10Analytics: chooseCRANmirror() and install.packages problems in R on production - https://phabricator.wikimedia.org/T171790#3476286 (10Reedy) [22:06:05] 10Analytics: chooseCRANmirror() and install.packages problems in R on production - https://phabricator.wikimedia.org/T171790#3476328 (10GoranSMilovanovic) @Reedy Thanks.