[00:14:04] I haven't gotten around to sending a labs-l email yet, but there were some changes to the main tool labs page over the weekend that I think turned out rather nicely. See https://tools.wmflabs.org/?tool=hay for new hotness of toolinfo.json support [00:15:12] For my next trick I'll stop the top loading of javascript just to sort tables [00:15:47] bd808: <3 nice [00:16:10] I almost don't hate the php code there now [00:16:47] mysqli needs to be replaced with PDO still and think there are some wacky regex things that can go still too [00:17:09] nice! [00:17:13] yay for wacky things going [00:17:19] I wish it used a framework of some sort [00:17:26] the single page router using the query string was an "interesting" design choice [00:17:46] YuviPanda: so I was actually thinking about that as a possbility [00:18:16] I could make it an example program for using the slimapp library [00:18:21] * YuviPanda nods [00:18:31] which would also make it possible for us to have l10n support :) [00:19:16] If we did that it probably deserves a new repo entirely to break off from the deb stuff [00:19:48] a bigger project for another time [00:23:19] bd808: +1 [00:23:21] and :( [00:23:23] and :) [00:23:46] new repo is :( ? [00:24:46] would it be complete chaos for us to build a system that allows tool project members to spin up git repos at will? [00:25:15] hm [00:26:19] I suppose the ugly bit is providing repo browsing and possibly code review too [00:26:30] right and making it like sane [00:26:48] the first thing on my brain is namespacing repo browsing however that woudl be done [00:36:10] bd808: no 'another tie' is ;( [00:36:22] bd808: chasemp there's a bug with lotsa discussion about it [00:36:57] https://phabricator.wikimedia.org/T117071 [00:36:59] bd808: ^ [00:37:07] so it's possible with diffusion [00:37:27] awesome [00:40:38] bd808: I would like to think of the homepage to morph into something that people can use to manage members, create repos, etc [00:40:44] basically to presume wikitech doesn't exist [00:41:02] dumping people into horizon to manage tools makes no sense, and neither does calling it 'service groups' when literally the only project using it is tools [00:41:12] *nod* [00:41:40] service groups were supposed to be part of this quarter's horizon goal but I explicitly took them out for this reason [00:41:48] I'd actually like the tools homepage to be about finding tools but make account/tool management prominent [00:41:58] * YuviPanda nods [00:42:04] it just needs to be part of one coherent whole [00:42:08] yeah [00:42:16] rather than held together by balls of string as it is now [00:42:27] it's all regexs [00:43:06] The number of regexes I replaced with substring checks was fun [00:43:16] yeah [00:43:45] there was a regex ini parser I killed somwehere [00:43:45] I could tell that Coren wrote the PHP and was thinking in perl and translating to PHP in his head the whole time. :) [00:43:51] :) [00:44:09] jsub needs to be rewritten to python as well at some point [00:44:25] which totally happens. I wrote PHP like Java code for years and then wrote Python like PHP for quite a while [00:44:45] * YuviPanda wrote JS-like Python once [00:44:59] that might actually almost work [00:48:53] I also spent a while writing C#-ish python [00:48:56] mmmm LINQ [00:50:29] 6Labs, 10Tool-Labs, 5Patch-For-Review: Install libbytes-random-secure-perl on tool labs - https://phabricator.wikimedia.org/T123824#1964446 (10Anomie) 5Open>3Resolved a:3Anomie [03:19:00] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Varnent was created, changed by Varnent link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Varnent edit summary: Created page with "{{Tools Access Request |Justification=Testing new tools - and adding some python tools currently not available - but available (such as ODT converter cleanup) |Completed=false..." [03:39:15] 10Labs-Vagrant, 10Fundraising Tech Backlog, 10MediaWiki-Vagrant: Write MW-Vagrant puppet to allow us to spin up dev, staging, and testing instances, and deploy sandbox servers on WMF-labs - https://phabricator.wikimedia.org/T99957#1964851 (10bd808) Assuming this should have been tagged with #fr-tech since @a... [04:08:41] (03PS1) 10BryanDavis: Stop toploading javascript [labs/toollabs] - 10https://gerrit.wikimedia.org/r/266466 [04:10:20] (03CR) 10jenkins-bot: [V: 04-1] Stop toploading javascript [labs/toollabs] - 10https://gerrit.wikimedia.org/r/266466 (owner: 10BryanDavis) [04:19:22] (03CR) 10BryanDavis: "recheck" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/266466 (owner: 10BryanDavis) [04:19:28] 10Labs-Vagrant, 10Fundraising Tech Backlog, 10MediaWiki-Vagrant: Write MW-Vagrant puppet to allow us to spin up dev, staging, and testing instances, and deploy sandbox servers on WMF-labs - https://phabricator.wikimedia.org/T99957#1964907 (10demon) Interesting, we were just talking about this today in #relen... [04:42:58] (03PS2) 10BryanDavis: Stop toploading javascript [labs/toollabs] - 10https://gerrit.wikimedia.org/r/266466 [09:08:58] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1965106 (10cmarqu) This seems to be down again, e.g. https://tools.wmflabs.org/geohack/geohack.php?pagename=Berlin&language=de¶ms=52.518611111111_N_13... [09:25:18] 6Labs, 10Labs-Infrastructure, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Wikidata: [Bug] Wikidata JSON dumps gets deleted after every new Wikidata dump - https://phabricator.wikimedia.org/T107226#1490125 (10Addshore) So the solution here is to have them in /public/dumps/wikibase ? /public/dum... [09:51:43] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1965175 (10Yurik) @cmarqu, Italian and Russian wikis have already switched GeoHack to the much more stable maps.wikimedia.org, and English should be switc... [10:04:37] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1965200 (10Kghbln) @Yurik WikiVoyage is using the MapSources extension which in turn depends on this service via the slippymap tag which is currently down... [10:12:09] YuviPanda: <3 [10:16:44] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1965237 (10Yurik) @kghbln, en & ru wikivoyage has already switched to the new maps ([[ https://en.wikivoyage.org/wiki/Salzburg#Get_around | example ]]).... [10:29:28] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1965254 (10akosiaris) >>! In T103272#1965106, @cmarqu wrote: > This seems to be down again, e.g. https://tools.wmflabs.org/geohack/geohack.php?pagename=Be... [11:01:31] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1965328 (10Yurik) @strainu, the maps.wikimedia.org only offers `tile` service, just like the (a|b|c).toolserver.org, so swapping one for another should no... [11:40:59] (03CR) 10Hashar: "recheck" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/265544 (owner: 10Ricordisamoa) [11:41:56] (03CR) 10Hashar: [C: 032] Configure tox job for Jenkins [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/265544 (owner: 10Ricordisamoa) [11:42:21] (03Merged) 10jenkins-bot: Configure tox job for Jenkins [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/265544 (owner: 10Ricordisamoa) [11:42:44] (03CR) 10Hashar: "recheck" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/263846 (owner: 10Ricordisamoa) [11:42:49] (03CR) 10Hashar: "recheck" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [12:34:39] 6Labs, 10Tool-Labs: https://tools-static.wmflabs.org/video2commons/video2commons.js 403 Forbidden despite normal file permissions - https://phabricator.wikimedia.org/T124773#1965451 (10zhuyifei1999) 3NEW [13:13:47] 6Labs, 10Tool-Labs, 5Patch-For-Review: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1965523 (10intracer) Why can't you use https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa ? I just installed OpenJDK 8 with sudo add-apt-repository ppa:openjdk-r/ppa sudo apt-get update sudo apt-ge... [13:17:23] 6Labs, 10Tool-Labs, 5Patch-For-Review: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1965525 (10valhallasw) We don't use PPA's, because they effectively give the PPA owner root access to our systems. See {T114645}. [13:28:51] 6Labs, 10Tool-Labs: https://tools-static.wmflabs.org/video2commons/video2commons.js 403 Forbidden despite normal file permissions - https://phabricator.wikimedia.org/T124773#1965541 (10zhuyifei1999) 5Open>3Resolved a:3zhuyifei1999 Seems like a temporary issue, now everything is good. [13:45:43] 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Install wikitech private settings directly onto wikitech hosts - https://phabricator.wikimedia.org/T124732#1965555 (10Aklapper) (@Andrew: Please associate projects to tasks. Thanks!) [14:13:08] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1965591 (10Kghbln) @Yurik Thanks for your insight. > If MapSources/slippymap uses leaflet or openlayers to get tiles Yeah the slippymap tag uses openla... [14:16:56] 10Labs-Other-Projects, 7Tracking: Configure Single Sign On at discourse.wmflabs.org - https://phabricator.wikimedia.org/T124691#1965595 (10Aklapper) [14:17:18] 10Labs-Other-Projects: Configure Single Sign On at discourse.wmflabs.org - https://phabricator.wikimedia.org/T124691#1962910 (10Aklapper) [14:17:28] 10Labs-Other-Projects: Succesful pilot of Discourse on discourse.wmflabs.org as an alternative to wikimedia-l mailinglist - https://phabricator.wikimedia.org/T124690#1965607 (10Aklapper) [15:57:55] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Varnent was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=272077 edit summary: [16:41:06] there's some performance issue on tools-bastion-01 [16:42:34] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Mbch331 was created, changed by Mbch331 link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Mbch331 edit summary: Created page with "{{Tools Access Request |Justification=Running a script to modify data on Wikidata |Completed=false |User Name=Mbch331 }}" [16:48:31] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:10] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 771971 bytes in 2.798 second response time [17:01:28] it seems that's not just tools-bastion-01 [17:02:18] liangent: nothing on tools-bastion-01 looks particularly horrible as far as cpu usage goes [17:02:51] slow nfs is slow nfs as always. someday we will find a better way to ship files around [17:04:00] well I didnt look into it [17:04:15] I just feel it's too slow [17:16:57] silly question : do i need to request my tool labs project be added in phabricator if i want to tag a question i'm intending to ask there on how to better run a request with my tool, or do i just use the "tool labs" tag ? [17:17:47] (the request i intend to run being a db replica request) [17:18:05] Alphos: #tool-labs and #dba should be good [17:18:20] ok, thanks :) [17:18:34] as tags, not irc channels, right ? [17:21:06] Alphos: yeag [17:23:26] thanks again ^^ [17:24:48] does the grid engine stumble right now? [17:31:00] doctaxon: ? [17:32:08] here?! [17:32:37] cronjobs with parameter -once was started five times, maybe a cron issue [17:33:20] but it seems to be one time only [17:33:34] it's running well again [17:35:56] it looks like toollabs disintegrates more and more to zero, first crontab, then grid master, ... [17:38:58] ... [17:39:24] the amount of people using tool labs grows, and with that, there are growing pains [17:40:21] counteract it [17:40:53] and we are. [17:40:59] nice [17:42:46] valhallasw`cloud: saw your reply to teh maint thread for the sge master, hopefully that works for you? having you around would make me feel better for sure [17:43:17] chasemp: yeah, 18 UTC is OK for me, although I may be eating dinner somewhere around that time [17:43:36] I won't be around at 2 AM UTC, though ;-) [17:45:14] right :) was something proposed at that time? I missed the reference [17:45:21] but if a general heads up makes sense ;) [17:45:40] chasemp: yuvi announced 18-02 UTC as maintenance window :-) [17:45:46] ohh [17:45:48] right [17:47:26] i hope it will be over before 0:00 UTC, because starting important cronjobs [17:48:13] i wrote it to the labs mailing list [17:49:05] doctaxon: yes, and I sent a reply. [17:49:16] I've read [17:50:48] let's simply hope, all will be well [18:21:38] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Mbch331 was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=272540 edit summary: [18:32:56] 6Labs, 10Tool-Labs, 10DBA: What would be the preferred way to run a series of read queries on DB replica, each on two *_p databases ? - https://phabricator.wikimedia.org/T124805#1966479 (10Alphos) 3NEW [18:40:25] (03Abandoned) 10BryanDavis: Replace array_key_exists with isset [labs/toollabs] - 10https://gerrit.wikimedia.org/r/266178 (owner: 10BryanDavis) [19:11:27] YuviPanda: andrewbogott fyi the *.wikimedia.org outage [19:11:36] seems to be affecting a lot of tools negatively [19:11:44] as in load on NFS is now close to 200 [19:11:59] I think tools are basically stuck / retrying constantly [19:12:08] logging a rate of errors taht is crazy and things [19:12:14] not much I can do, whackamole here is long tailed [19:12:55] valhallasw`cloud: ^ fyi [19:13:03] * valhallasw`cloud nods [19:13:16] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#1966721 (10Egedda) Hello everyone. I'm the project manager for the earlier discussed group of 10 (not 8) students at KTH, Royal Institute of Technology, and we... [19:37:05] chasemp: o,h, ok [19:37:16] YuviPanda: are you going to be about for a bit? [19:37:24] I'm not sure nfs will recover sanely here [19:37:45] @s [19:38:22] I just got woken up so getting my bearings [19:38:23] yeah sure [19:38:23] mornin' [19:40:59] chasemp: hmm, so I got a recovery now [19:41:11] it's going to flake for a bit [19:41:18] at least [19:41:26] * YuviPanda is looking at iftop [19:42:23] yeah I can't ssh [19:42:44] can now [19:43:05] but only as root [19:43:08] the other disavantage to hard mount I guess is clients keep trying to at their i/o persistently and it makes something like this hard to come back from [19:43:35] hmm [19:43:43] this is a fucked up cascacde, heh. [19:43:45] chasemp: I would expect there to be a kernel parameter for that [19:43:57] only for soft mount afaik [19:44:14] it's entirely client specific behavior etc [19:44:26] timeo=n; The time in deciseconds (tenths of a second) the NFS client waits for a response before it retries an NFS request.; For NFS over TCP the default timeo value is 600 (60 seconds). The NFS client performs linear backoff: After each retransmission the timeout is increased by timeo up to the maximum of 600 seconds. [19:44:40] right, for sure [19:44:48] and I think that backoff period does apply in both cases [19:44:51] The kernel doesn't do backoff in the case of a soft mount, right? It just returns an error [19:45:05] but in this case it's just not enough of a grace period [19:45:14] it can do either I think, or at least timeout and retry etc [19:45:44] yes, you're right. For a soft mount, it retries three times, then errors out [19:46:01] (configurable) [19:46:10] *nod* [19:46:16] it's back now [19:46:24] what's back? [19:46:39] NFS access? [19:46:39] ah [19:46:40] top - 19:46:32 up 41 days, 11:26, 5 users, load average: 244.06, 243.26, 215.72 [19:46:42] no, back to being stuck again [19:46:45] oh dear [19:46:49] it's just going to flake until we do someting or it works it's way down [19:46:54] and I'm not sure if it will sanely [19:46:55] * YuviPanda times an ls [19:47:10] do you think restarting the nfs server processes will have an effect? [19:47:14] 8 cores -- 271 load atm [19:47:17] I'll send an email to labs-l in the meanwhile [19:47:25] well, I'm not sure how clietns will react to that at this moment [19:47:31] I'm contemplating it [19:48:48] I was hoping you guys were about to help w/ deal with fallout :) [19:48:49] chasemp: let's just do it :D clients aren't in great shape now anyway [19:48:49] because we may need to [19:48:49] > real 0m21.406s [19:48:49] right [19:48:50] yeah I'll be here [19:48:50] for a ls [19:48:50] wait, 271 load for the NFS *server*? [19:48:50] it's so far behind it's like spinning it's wheels trying to sort out the clients and scheduling [19:48:50] ha yes [19:48:55] seriously [19:49:15] I was hopeful for a minute as we were in teh 170's but it's gotten worse not better [19:49:24] it's hard to gauage at this point [19:49:50] doesn't that mean there are IO issues to the backend drives? [19:50:08] well, yes and no I guess [19:50:12] yes iowait [19:50:23] but no as in, ith as to have cpu / mem to schedule and keep track of it all [19:50:29] and id was literally 0 for a long while [19:50:38] chicken /egg I guess [19:50:51] let's restart nfs-kernel-server? :) [19:51:01] I want to give it another minute here if that's ok [19:51:17] ok [19:51:27] so there's 271 processes waiting to run, and I don't see how that can purely be from NFS usage. But I may be misunderstanding load. [19:51:43] in any case, I'm going to figure out if there's any tool hosts hammering the network [19:51:49] (as a proxy for hammering nfs) [19:51:57] sweet :) [19:56:19] I don't pretend to fully grok load but my mental model is more like the queue lenghth for things that need cpu attention [19:56:19] iftop is almost empty now [19:56:19] which can be influenced by other blockers [19:56:19] * YuviPanda restarts iftop [19:56:19] ok it shot back up to 280 load [19:56:20] let's do a restart [19:56:20] ready? [19:56:20] ok [19:56:20] yeah, iftop is completely blank [19:56:21] yeah do it chasemp [19:57:15] yeah, I'm mostly wondering whether the issue might be the disks dying or something like that [19:57:15] load is right back...I assume as clients keep aggressively trying [19:57:16] though it is less now I think [19:57:17] giving it another minute [19:57:18] yeah, since there's no network traffic [19:57:18] well also iftop can be behind [19:57:19] or the buffer will drop things if it needs, unusual to see nothing but [19:57:19] it's not like a real hard and fast mirror of net things under extreme duress I think [19:57:19] 23 5m load wait [19:57:19] yeah [19:57:19] ok [19:57:20] * valhallasw`cloud softly kills graphite [19:57:20] can you help me run through tools to make sure things function sanely? [19:57:20] :) [19:57:20] don't see this every day [19:57:20] load average: 10.86, 112.07, 174.03 [19:57:23] apparently graphite doesn't like requesting the data of a hundred measures [19:57:59] :) [19:58:08] ls on tools-login works, but somewhat slower than normal [19:58:08] yeah am doing a bunch chasemp [19:58:12] thanks man [19:58:28] widar stuff looks ok [19:58:36] I'm looking at the huge sleeping procs thing (639) [19:58:43] load is climbing again and I'm a bit worried [19:58:48] yeah it's going back up [19:59:14] load average: 158.45, 118.50, 166.96 [19:59:19] ok so [19:59:20] what to do here [19:59:44] yeah, tools got fucked up again [19:59:54] hmm [19:59:56] only some [20:00:07] fucked up? [20:00:20] mostly they seem to load fine [20:00:28] load is right back up there [20:00:46] chasemp: I see watchdog taking up a lot of CPU [20:00:53] last time this happened was hardware related [20:00:56] well [20:00:59] kernel related [20:01:06] the cpu soft lockup bug caused similar load patterns [20:01:06] switch over to labstore1002? [20:01:21] I'm not sure how well-documented that process is these days [20:01:25] that unfortunately involves someone driving to the dc and is not so easy [20:01:46] it's more feasible to start rebooting clients idk [20:01:49] chasemp: and CPU stalls in dmesg [20:01:55] yeah [20:02:09] > [Tue Jan 26 19:59:28 2016] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [kworker/7:5:25260] [20:02:12] am pretty sure it's the same thing [20:02:16] and the server needs at least a reboot [20:02:31] mark: paravoid ^ [20:02:38] (since you guys helped deal with this last time) [20:02:57] I guess this just happened to co-incide with the wikimedia.org issue [20:03:23] one triggered the other possibly but it would be hugely coincidental if unrelated [20:03:33] https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1453838595.134&target=secondYAxis(servers.labstore1001.cpu.total.iowait)&target=servers.labstore1001.network.eth0.rx_byte&target=servers.labstore1001.network.eth0.tx_byte [20:03:57] iowait was ok until midnight UTC, went up to ~20 at that point [20:04:23] chasemp: so my suggestion at this point is to restart the server [20:04:28] if you go back further it's not super unusual https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1453838595.134&target=secondYAxis%28servers.labstore1001.cpu.total.iowait%29&target=servers.labstore1001.network.eth0.rx_byte&target=servers.labstore1001.network.eth0.tx_byte&from=-7d [20:04:54] YuviPanda: I thought the recovery time on that was long? [20:05:09] chasemp: yes, but there isn't really another option there... [20:05:16] to a stalling CPU from an unidentified kernel bug [20:06:38] chasemp: ? [20:06:43] thinking [20:06:45] ok [20:08:15] I think it's servicing nfs atm it's just far beyond deluged with requests, a stop and a restart clears it out but it quickly builds back to it's previous high load [20:08:39] I'm not sure about the dmesg stuff, I have a thought that it's not something a reboot will change [20:08:51] well [20:08:52] this is the 4th time [20:08:52] but I'm not sure and I'm consdering blocking nfs clients [20:08:52] we're having this exact issue [20:08:53] kernel:[3587732.792205] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:1567] [20:08:56] yeah [20:09:01] and a reboot has cleared it all the time [20:09:12] was it always high load? [20:09:14] yes [20:09:20] you always notice it when load skyrockets [20:09:49] Message from syslogd@labstore1001 at Jan 26 20:08:39 ... [20:09:51] kernel:[3587732.792205] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:1567] [20:09:53] heh [20:09:55] now it's sending messeges to shell [20:11:17] let me know what you decide to do [20:11:32] I'm not against what you are saying, maybe no choice, I just want a minute to look here [20:11:57] yeah man, no pressure :) [20:12:29] just saying it so you know I'm not doing anything atm on labstore (just checkin out different tools) [20:12:34] kk [20:12:41] valhallasw`cloud: interestingly, a lot of the webservices seem up [20:12:50] valhallasw`cloud: and I can actually ssh as myself [20:12:58] and a time ls [20:13:00] is [20:13:02] real 0m0.154s [20:14:59] YuviPanda: can you get on console cli? [20:15:16] chasemp: on labs instances? [20:15:19] yeah [20:15:24] on labstore :) [20:15:26] that didn't happen the previous times btw [20:15:34] chasemp: I still have an ssh on labstore [20:15:42] or are you talkinga bout mgmt? [20:15:44] yeah [20:15:57] yeah, let me get on mgmt then [20:17:52] chasemp: yeah I'm on mgmt now [20:18:19] k thanks I swear I'm looking at something, not that I have any great ideas [20:18:30] YuviPanda: have you checked the raid status? Just to be sure that it's not a hardware issue. [20:18:59] valhallasw`cloud: so in the past when we've had raid issues it manifests itself as madm reassembly taking up lots of io/cpu [20:19:01] that isn't the case now [20:19:08] instead this is the cpu lockup [20:19:09] ah, that makese sense [20:19:15] valhallasw`cloud: but am going to verify it now [20:19:31] chasemp: yeah, np man, since the webservice stuff is actually up take your time [20:40:32] is phab search down? [20:40:33] viewing tickets seems to work fine [20:43:37] niedzielski: {{worksforme}} [20:43:38] valhallasw`cloud: hm, seems to be working now :) was hung for about 4-5 minutes [20:43:39] hosts with highest network usage https://www.irccloud.com/pastebin/0cRwwrQb/ [20:43:40] but can't login to exec-1211 as normal user [20:43:40] valhallasw`cloud: so we stopped NFS kernel [20:43:40] err [20:43:40] nfs-kernel-server [20:43:41] and debugging [20:43:41] ok [20:43:42] possibly lvm issues [20:43:42] we might be able to bring it back up [20:43:42] ok [20:43:42] without having to restart server [20:46:04] valhallasw`cloud: should be back now [20:46:11] YuviPanda: yep. [20:46:51] YuviPanda: ok, so tools-exec-1211 is doing ~20MB/s tx [20:47:14] but nethogs can't tell me what the underlying process is, of course, because it's probably NFS [20:48:00] * valhallasw`cloud eyes sort /data/project/templatetiger(...) [20:48:31] valhallasw`cloud: yeah, I usually use iftop followed by iotop [20:48:34] iotop shows who it is [20:48:35] `sort` on a 5GB file? [20:48:49] yah [20:48:50] me thinks [20:49:00] tools.te... [20:50:07] someone killed it or it's gone? [20:50:08] I suspended the job [20:51:43] it was started 01/26/2016 19:03:39 UTC, so around the same time frame [20:51:55] but did that actually decrease load on the nfs host? [20:52:03] valhallasw`cloud: what does suspending the job actually do? [20:52:11] valhallasw`cloud: we suspect the nfs issue was actually an lvs bug [20:52:14] err [20:52:16] lvm [20:52:18] bug [20:52:22] suspending = ctrl-Z [20:52:47] ctrl-z just sends it to background but keeps executing, no? [20:52:52] no, stops execution [20:53:17] oh [20:53:20] ok [20:53:37] * YuviPanda has had an incompetent morning so far :D [20:53:51] valhallasw`cloud: ok, so do we send the maintainers a message now? [20:54:45] it looks like something that should be in a database to begin with [20:55:02] yeah [20:55:21] but, getting back to the point -- is it an issue, or just something that should be fixed eventually? [20:56:45] I think regularly running sort on 5Gish files on NFS is definitely an issue [20:58:34] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#1967255 (10DannyH) Hi @Egedda, it's nice to meet you! I'm on the WMF's Community Tech team, and we were just talking to Jan @Ainali earlier today about offering... [20:58:34] my thinking on what happened now is that the redirect on *.wikimedia.org made tools go a bit crazy and churn NFS (maybe lots of error logs?!) and this caused an increase in load that somehow co-incided with an lvremove, causing the actual outage [20:58:34] yes I caught one yesterday as well [20:58:34] a large merge and sort on some files [20:58:36] YuviPanda: yes basically was what I had put together as well, although I'm not sure if lvremove is related of if the high use trigged the same bug [20:58:51] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1967259 (10valhallasw) 3NEW [20:58:59] so what do we do with this one? kill it? [20:59:26] Jan 26 01:00:01 labstore1001 systemd[1]: Starting Clean replication snapshots from labstore... [20:59:29] Jan 26 01:00:05 labstore1001 cleanup-snapshots[6160]: Logical volume "others20160120030010" successfully removed [20:59:31] Jan 26 01:00:06 labstore1001 cleanup-snapshots[6160]: Logical volume "maps20160120040013" successfully removed [20:59:33] Jan 26 01:00:06 labstore1001 systemd[1]: Started Clean replication snapshots from labstore. [20:59:48] chasemp: so lvremove wasn't during the outage [21:00:05] yes agreed [21:00:06] valhallasw`cloud: yeah [21:01:51] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1967271 (10valhallasw) I have killed this job. The start time coincides with the NFS outage today (job was started at 01/26/2016 19:03:39 UTC), and -- although we're not sure i... [21:04:07] ok [21:04:12] why is tools-webgrid-generic-1405 doing so much i/o [21:04:28] * YuviPanda checks iotop [21:04:42] hmm [21:04:44] I don't see anything [21:05:05] load is pretty high [21:05:07] nethogs on labstore says it's the top or second to top [21:05:17] and load is high [21:05:22] but where is the proc :) [21:05:36] yeah, nethogs on the host notes ~7MB/s [21:06:06] I don't see much w/ fatrace either [21:07:09] I think this vm needs a reboot [21:07:16] I'm flummoxed [21:07:24] 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Install wikitech private settings directly onto wikitech hosts - https://phabricator.wikimedia.org/T124732#1967300 (10Andrew) 5Open>3Resolved [21:07:32] all is I see if you guys looking :) [21:08:08] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: mail from testlabs to ops list - https://phabricator.wikimedia.org/T124516#1967301 (10Andrew) 5Open>3Resolved [21:08:37] heh [21:08:59] chasemp: valhallasw`cloud I bet it's ifttt [21:09:15] it's one of our most hit applications [21:09:25] I had that thought [21:09:31] but I don't actually see it [21:09:52] hmm [21:09:57] ls on /data/project/ifttt just hung [21:10:06] load on labstore is ifne [21:10:12] although [21:10:17] nethogs it taking up 60% of one CPU :D [21:10:30] everything is now hanging on tools-exec-1217, which was also having lots of io [21:10:34] net io, that is [21:12:02]