[00:28:32] Coren: Its' not critical, but I left a few comments at https://gerrit.wikimedia.org/r/#/c/204528/12/manifests/role/ci.pp - it seems the override file is cleared by something right after it starts. [00:28:35] Is that expected? [00:29:00] every 30 minutes the file is re-created and enable changed 'false' to 'true' [00:29:10] From what I can tell it does not restart, but it seems odd [00:43:31] (03PS1) 10Jforrester: Move some from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206040 [00:49:31] (03CR) 10Alex Monk: [C: 032] Move some from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206040 (owner: 10Jforrester) [00:52:46] (03Merged) 10jenkins-bot: Move some from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206040 (owner: 10Jforrester) [00:53:33] (03PS1) 10Jforrester: Move Math repo from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206041 [01:00:05] !log tools good bye tools-login.eqiad.wmflabs [01:00:11] Logged the message, Master [01:01:35] PROBLEM - Host tools-login is DOWN: CRITICAL - Host Unreachable (10.68.16.7) [01:01:43] * YuviPanda pats shinken-wm [01:01:44] we know [02:05:23] 6Labs, 10Tool-Labs: Move tools to designate - https://phabricator.wikimedia.org/T96641#1230004 (10yuvipanda) [02:05:24] 6Labs, 10Tool-Labs: Bare instance name does not resolve on tools - https://phabricator.wikimedia.org/T96642#1230001 (10yuvipanda) 5Open>3declined a:3yuvipanda It's ok. dig doesn't go through resolv.conf And dnsmasq was responding because it was also the dhcp server. nbd. Let's worry about this if it bre... [03:06:09] * Negative24 plays taps [03:31:38] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1230075 (10Shilad) 3NEW [04:23:27] 6Labs: lighttpd redirects URLs of directories without a trailing slash from https to http - https://phabricator.wikimedia.org/T66627#1230155 (10Krinkle) See also {T95164}. [04:56:01] 6Labs, 10Continuous-Integration: Create an instance image like m1.small with 2 CPUs and 30GB space - https://phabricator.wikimedia.org/T96706#1230309 (10Krinkle) 5Resolved>3Open Our goal for 30G space was based on the following estimate: > 10G for system, 10G for git replication and 10G for workspace. How... [05:04:54] 10Tool-Labs, 3Labs-Q4-Sprint-3, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1230315 (10yuvipanda) 5Open>3Resolved a:3yuvipanda It is gone now. [06:17:53] I have problemets with php on command line on tool-dev. It's display sourcecode insted for run php-script. [06:23:07] hey steenth [06:23:15] what command are you executing? what tool is this? [06:26:01] it is in my own account.. command is php [06:26:30] can you tell me what account it is and the exact commandline you were using? [06:28:30] php ~steenth/test.php [06:30:54] steenth: it was missing a full opening tag [06:30:57] it is running a newer version of PHP now [06:31:07] that does not support the short tag [06:31:14] I added and it works now [06:31:20] okay [06:31:50] yw [06:32:35] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1230376 (10yuvipanda) Hi! Have you considered hosting it on tools.wmflabs.org instead? [06:32:53] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1230377 (10yuvipanda) Also @halfak might already have a project for this? [06:54:17] YuviPanda, my problem is change in config-setting for short_open_tag - It's change from on to off (I use same version at home) [06:54:35] steenth: yup - this is a side effect of php changing from version 5.3 to 5.5 [06:54:45] this was announced on labs-l and the labs-announce mailing lists [06:54:58] the short_open_tag has been deprecated by PHP for many years now [06:55:02] and scripts should use whatever happened to the sweet ASCII art on tools-login and tools-dev? :( [07:53:05] (03CR) 10Nemo bis: [C: 031] "Thanks for clarifying the intention." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205222 (owner: 10Werdna) [08:03:01] a few tools fails on http but work the 2nd time on https :/ [08:05:41] afeder: hope they do come back at some point, yeah. there’s a bug with our PAM setup... [08:09:27] ah [08:27:36] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-static redundant - https://phabricator.wikimedia.org/T96966#1230510 (10yuvipanda) 3NEW [08:28:05] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-mail redundant - https://phabricator.wikimedia.org/T96967#1230517 (10yuvipanda) 3NEW [08:29:07] afeder: https://phabricator.wikimedia.org/T85910 for reference [08:29:18] and https://phabricator.wikimedia.org/T85307 [09:53:58] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/FNDE was created, changed by FNDE link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fFNDE edit summary: Created page with "{{Tools Access Request |Justification=Using the API for recent changes |Completed=false |User Name=FNDE }}" [11:27:29] hello [11:27:41] I cannot get my PHP script to work on cron [11:27:54] cron.err says "libgcc_s.so.1 must be installed for pthread_cancel to work" [11:30:20] hello? [11:45:52] Superyetkin: you need to give it more memory, see https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Why_am_I_getting_errors_about_libgcc_s.so.1_must_be_installed_for_pthread_cancel_to_work.3F [11:46:15] e.g. jstart -mem 2G ./somescript [11:48:03] sitic: thanks [11:48:35] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Cipher was created, changed by Cipher link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fCipher edit summary: Created page with "{{Tools Access Request |Justification=Making things better. Personal experience. |Completed=false |User Name=Cipher }}" [12:40:36] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1231012 (10Halfak) @yuvipanda, I don't have a specific project in mind for WikiBrain. The services will need to hold onto a lot of of ram (~16GB) long term, so it seemed to me that a shared environment that a shared environment wouldn't w... [13:17:50] !log tools beginning migration of tools instances to labvirt100x hosts [13:17:53] Logged the message, dummy [13:18:39] andrewbogott: This is done one at a time, right? [13:18:50] Coren: yep, once every 20 mins. [13:19:04] alphabetically by name [13:20:29] Speaking of, I'm going to start the reboot-if-idmap rounds in ~40m. tools and deployment-prep are already done, so no impact there and I know many others have been done by their admins. [13:20:54] ok, shouldn’t interferere. [13:20:58] 10Tool-Labs-tools-Global-user-contributions: https://tools.wmflabs.org/guc - https://phabricator.wikimedia.org/T94351#1231078 (10Oleg3280) please, fix this bug. links do not work. sorry for bad english. thanks. [13:21:00] um… interferererererere [13:22:18] * Coren ponders. [13:23:02] Actually, I should tweak that script to actually attempt unmounting and unloading the modules before reboot - that may save a couple of needless reboot of instances that aren't actually using NFS at this time. [13:49:05] andrewbogott: should anything have changed with ldap recently, and/or who should I ask about that. deployment-prep (and staging to a lesser extent) are having a bad day looking for gid 50062 deployment-bastion, keep seeing "nslcd[1293]: [13d149] error writing to client: Broken pipe" [13:50:16] thcipriani|afk: nothing should have happened with ldap. It’s possible that your instance is just starved for network access. Is that mostly happening on one particular vm? [13:51:33] andrewbogott: no, seems to be fairly systemic, see the shinken alerts in -releng, happening for the past 9 hours or so across various instances [13:52:12] is it failing every time, or intermittently? [13:52:24] I mean, the shinken alerts + the ldap problems could be systemic of a larger network issue, the only thing I see in syslog is what I mentioned [13:52:28] intermittently [13:52:51] also instances are running super slowly [13:54:42] thcipriani|afk: ok — let me clear the thing I’m working on now, then I’ll look around [13:54:57] andrewbogott: thanks! [14:15:20] YuviPanda: https://gerrit.wikimedia.org/r/206118 but needs a lookover from someone who gets puppet syntax -_-' [14:15:39] and I fail pep8 [14:15:41] trololol [14:16:16] 'line too long, 81 > 79 characters' [14:16:31] http://troll.me/images/atomic-rage/fuuuuuuuuu.jpg [15:03:21] I'm getting random 404 from tools [15:05:01] Nemo_bis: by ‘random’ do you mean intermittent? [15:05:11] 10Tool-Labs: s7.labsdb long lag - https://phabricator.wikimedia.org/T96646#1231278 (10eranroz) 5Open>3Resolved [15:05:43] Nemo_bis: 404s? That's normally from the tool itself, I don't think the proxy can generate a 404 itself. [15:06:53] andrewbogott: yes, intermittent [15:07:17] the "not serviced" error, that's a 404 right? [15:08:18] Is any one tool going down and back up and back down again, or is that that each given tool shows only a short period down? [15:08:26] Ah, no, that's a 503. If you're getting those intermittently it means that the tool is crashing repeatedly and being restarted. [15:10:34] Coren: could also be from the proxy being unable to contact the exec host? [15:10:41] andrewbogott: Daily lulz: if you run a script that reboots instances from an instance, make sure the instance you're running the script /from/ isn't in the lst. :-) [15:11:06] hah! Like sawing off the branch you’re sitting on [15:11:08] andrewbogott: No, that'd just give a normal 502 - 503 means there is no mapping for the tool in redis [15:11:39] Coren: great, then I will presume that this is not my fault :) [15:12:08] andrewbogott: Well unless the tool is really sensitive to network bandwidth and that's why it crashes. :-) But yeah, probably not you. [15:12:16] Nemo_bis: Are you seeing this on a particular tool? [15:12:49] (Yuvi's new manifest thing is more obstinate about restarting webservices, but it doesn't help webservices that crash not crashing in the first place) [15:12:55] 6Labs, 10Beta-Cluster: Migrate deployment-prep to new labvirt hosts - https://phabricator.wikimedia.org/T96678#1231284 (10Andrew) 5Open>3Resolved This is done! [15:15:31] andrewbogott: Also now I totally share your pain re "instances that don't properly run puppet or merge from it regularily" [15:15:57] I think it was https://tools.wmflabs.org/xtools/pages/?user=Yiyi&lang=it&wiki=wikipedia&namespace=all&redirects=none [15:16:12] Might be a timeout, rather than a progressing restart [15:16:51] Nemo_bis: xtools is about as stable as jenga on rolleskates. I know T13 is helping rewrite most of it in a separate project, but it's a wip. [15:17:21] Of course, but errors should be truthful :) [15:17:37] Nemo_bis: ? [15:17:47] The proxy really can't tell /why/ a tool just went down, all it can do is tell you that it's not currently up. [15:19:21] The only time I see 404 on any xTools is when something is going on with labs (tool specific or otherwise). When the tool locks up it returns a 503. [15:19:42] Restarting doesn't fix 404s. [15:19:48] T13|mobile: They weren't 404 - they were 503s. :_) [15:20:09] I'll restart in a minute then. [15:20:20] * Coren loves it when people name instances after themselves. Makes some error messages and status reports more interesting. [15:20:32] "legoktm is out of date; rebooting" [15:21:00] legoktm is out of date though... [15:21:04] :p [15:23:36] well, he *is* on vacation, isn't he? :P [15:24:05] Rename and flee [15:24:46] T13|mobile: are you restarting right now? just got a blank page :) [15:25:02] Yep, done. [15:25:39] What drives me crazy are the 5xx errors upon URLs which just redirect elsewhere. Sigh. [15:26:07] Where? Huh? [15:27:52] -ec is working now. -articleinfo too. [15:28:47] No topedits though. Jrm. [15:29:46] So xTools core still isn't fixed... restarting again. [15:30:16] * Coren is gratified by the number of 'not needed' the run is getting. Looks like some people actually read announcements. :-) [15:31:34] Done. [15:31:46] Announcements? [15:31:50] ;p [15:42:16] !log migrating cvn-app4 to labvirt1001 in an attempt to redistribute load a bit [15:42:17] migrating is not a valid project. [15:42:31] !log cvn migrating cvn-app4 to labvirt1001 in an attempt to redistribute load a bit [15:42:35] Logged the message, dummy [16:10:02] andrewbogott: Looks like all but the broken puppets are done. [16:53:49] andrewbogott: Do you have any particular preference on how I handle instances that haven't been puppetized properly in (apparently) some time? [16:54:33] Coren: not really. If they’re self-hosted sometimes I make a local branch to preserve local changes and then just unilaterally reset to origin [16:54:48] But it’s polite to notify people when you do that :) [17:01:20] andrewbogott: Alternately, I can hotfix I suppose - it's a one-liner file to add. [17:01:37] andrewbogott: And if puppet kicks in it's a noop since all it does is add that file. [17:01:50] that seems easy enough [17:04:16] I'm going to let the instances that aren't running at all just lay - they'll have to fix the broken if/when they are restarted. [17:05:45] It’s not integration hosts that you’re working on, is it? [17:05:56] * Coren checks. [17:06:30] andrewbogott: Nope, they all popped up 'not needed' [17:06:57] ok, unrelated shinken warnings then [17:07:00] Ah, not integration-publisher - that one ended up being rebooted. Quite some time ago though. [17:07:11] (~100m ago) [17:10:37] andrewbogott: afaict, integration is having puppet issues because hiera. Not my work. [17:10:49] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item contint::nodepool_host in any Hiera data file and no default supplied at /etc/puppet/modules/contint/manifests/firewall.pp:28 on node i-00000932.eqiad.wmflabs [17:10:52] great, I will continue to ignore then :) [17:10:54] must be wip [17:13:25] !log cvn rebooted cvn-app5 and cvn-app4 (not at the same time) because I suspect them of running amok [17:13:28] Logged the message, dummy [17:19:47] !log deployment-prep rebooting deployment-parsoidcache02 because it seems troubled [17:19:50] Logged the message, dummy [17:48:42] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/FNDE was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=155354 edit summary: [17:48:49] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Cipher was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=155355 edit summary: [17:59:32] Coren: any network issues right now? [18:00:04] Betacommand: There may be ocassional spikes of network lagginess as instances are being migrated to the new hardware. [18:00:35] Tends to be short bursts though. What are you seeing? [18:00:35] OK, ive just seen my IRC bot timeout again [18:00:53] Hm. Do you know what your timeout is? [18:01:28] its an IRC ping timeout factor, not sure what freenode's setting is [18:02:13] I think they're fairly conservative; maybe a minute. [18:02:29] That's still fairly long - I'll keep a closer eye on it. [18:03:16] its my stalkboten job thats timing out [18:03:34] Just the one? [18:04:29] Its the only copy im seeing die [18:05:17] Hm. Might not be network related then - I would expect others would suffer equally. Perhaps something else is making the bot stall and not respond to pings? [18:05:28] DB dependency? [18:05:33] Nope [18:05:57] filesystem is reasonably snappy given the network load, so probably not that. [18:05:57] its freenode/wmf crossover irc bot [18:06:01] * Coren ponders. [18:06:17] Do you know what job number it has? I'll check the node it runs on. [18:06:23] the codes fairly stable, havent really changed it in 10 years [18:08:08] Coren: http://pastebin.com/Nj27Pr0w [18:08:59] Hm. [18:09:22] exec-03 doesn't seem unusually loaded or laggy. [18:09:35] Mind if I strace your job while it runs so I can catch it in the act if it times out again? [18:09:44] feel free [18:10:21] Just dont publish the IRC creds if you run across them :P [18:13:18] oh. mah. gawd. Php is such a ridiculously bad I/O design - it actuall does a poll() with a timout of 0 with a loop around it that does a nanosleep(500000) [18:14:31] And does recvfrom() with MSG_DONTWAIT inbetween. [18:15:20] This isn't going to help - I can't log this to disk it'll add about 30 lines of strace info 2000 times per second. [18:18:21] Coren: Ive been meaning to re-write it for years, I inherited this code [18:18:45] Betacommand: I don't think it's your fault - php's io layer has always been crap. :-) [18:19:01] Coren: Its PHP [18:19:09] which is crap [18:19:16] I prefer python [18:19:30] At any rate, I've got as much instrumentation as I can on the process; it may help figure out what happens exactly if it dies again. [18:19:55] I'm a perl hacker myself, but python will do in a pinch. :-) [18:20:21] Coren: its only died twice today, nothing earth shattering. And the actual process doesnt die, it just kinda disconnects [18:28:53] hey, Everytime I want to connect to labs I get this warning: [18:28:53] Warning: the ECDSA host key for 'tools-dev.wmflabs.org' differs from the key for the IP address '208.80.155.132' [18:29:05] Offending key for IP in /home/amir/.ssh/known_hosts:15 [18:29:05] Matching host key in /home/amir/.ssh/known_hosts:22 [18:29:18] It makes me worried if something is wrong about my pc [18:29:26] Amir1: tools-dev has been upgraded; the new key is mentionned in the announcement email. [18:29:42] so I should fix it by hand [18:29:49] thanks Coren :) [19:14:57] andrewbogott_afk: We need to do more to discourage people away from self-hosted puppet. [19:22:29] Coren: out of curiosity how many virt nodes we have now and what is their hardware? do we have this somewhere public? [19:28:36] petan: I don't know if it's public anywhere, but it's not secret. Lemme try to find that out for you - I'm not sure anymore because we got new hardware coming in and we're decomissioning some. :-) [19:30:18] Coren, any idea what's going on with beta? [19:30:43] Krenair: I know of nothing going on with the beta cluster, specifically, but I'm not the best one to ask. :-) [19:30:51] ok [19:31:07] I noticed shinken in -releng saying a bunch of hosts were going down and up etc. [19:31:20] and there are issues connecting to DBs apparently [19:36:34] Coren, could you add me to https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta ? [19:37:25] Coren: I can help you decomission some of the hardware XD I could find a room for it in my appartment [19:38:08] like some cute server with 16 cpu's and 512gb of ram :3 [19:38:22] valhallasw`cloud: Actually, petan's the one to ask for that. :-P [19:38:43] valhallasw`cloud: sec [19:38:47] wait, I see I can just add myself to toolsbeta.admin :P [19:38:49] petan: I think that the WMF generally donates hardware that's still useful to nonprofits who need it. :-) [19:38:58] (but I'm not an admin in the project? wth) [19:39:25] Coren: I would make a cat-nest out of that server for my cats and that counts :P [19:39:50] valhallasw`cloud: so, you want to break toolsbeta right? [19:39:56] petan: basically. [19:40:02] ok. approved [19:40:05] petan: I'm wondering what rm -rf / does. [19:40:19] thanks :) [19:40:25] I heard it display some easter egg linux hardcoded in kernel if you do that [19:40:35] * Linus [19:40:47] actually, it really does [19:40:56] feel free to do that :P [19:41:51] petan: hm, https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta doesn't show me and sudo doesn't work, so something is still not working... [19:41:59] I tried logging out/in and purging the page [19:42:18] valhallasw`cloud: tell me what is your labs name [19:42:27] merlijn van deen [19:42:58] case sensitive [19:43:19] Merlijn van Deen, then. Autocomplete, yo. [19:43:52] mhm... can't find you [19:44:22] Merlijn_van_Deen? ;D https://wikitech.wikimedia.org/w/index.php?title=User:Merlijn_van_Deen [19:44:33] Failed to add Merlijn van Deen to toolsbeta. [19:44:38] I tried all combinations [19:44:42] even with underscores [19:44:56] I'm already a member, just not an admin, I think [19:45:55] mhm [19:46:09] you actually are [19:46:28] which explains why you could add yourself to that group [19:46:38] but sudo doesn't work... *confused* [19:46:52] because you didn't logoff and back? [19:47:17] I did a few times [19:47:23] on linux or wikitech? [19:47:25] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 36%, RTA = 4572.32 ms [19:47:26] hm :S [19:47:38] Krenair: you broke it! you fix it! [19:47:53] petan: both :P [19:48:02] let me try both again [19:48:07] valhallasw`cloud: ok... maybe you don't deserve it :P [19:48:10] :D [19:48:12] https://wikitech.wikimedia.org/wiki/Special:NovaProxy is not showing me proxies for any project. I logged out and back in with no change. [19:48:27] petan, no... I didn't break anything [19:48:50] you aren't in "roots" sudo policy group [19:48:54] pinging deployment-db2 from deployment-mediawiki01 is showing some really weird responses [19:48:56] which means you don't have root [19:49:25] Krenair: ip conflict? :P [19:50:26] * valhallasw`cloud clicks the 'sudo policies' button [19:50:33] valhallasw`cloud: I added you there because you clearly are incompetent to fix yourself :P [19:50:42] petan: love you too <3 [19:50:59] everybody loves me. almost :o [19:51:31] petan, why would an ip conflict cause such a huge variation in responses to ping? [19:51:39] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=1 ttl=64 time=0.229 ms [19:51:39] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=2 ttl=64 time=4321 ms [19:51:39] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=3 ttl=64 time=2.06 ms [19:51:40] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=4 ttl=64 time=50.7 ms [19:51:40] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=5 ttl=64 time=0.396 ms [19:52:45] well, I can't thnk of anything better right now. it could many things though, even if kernel is loaded too much that it hangs up, the network can be overloaded or failing etc... there could be so many reasons, best thing is to ssh on destination and figure out OR reboot it [19:53:05] then read logs to figure out why it was [19:53:35] I personally think that target machine is just fucked up [19:54:16] deployment-db1 has the same behaviour [19:54:16] btw there is console window in wikitech that can display errors to you [19:54:23] if there are some errors [19:54:34] they would show up on tty1 [19:54:37] or active tty [19:54:51] which is visible in wikitech [19:55:02] it works even if you can't use ssh [19:55:22] I can SSH to them [19:57:17] ok, what is system load? is it normal? can you ping other vm's fine? [19:57:45] if everything is normal on system, then the problem is likely on network [19:59:43] petan: https://wikitech.wikimedia.org/wiki/Labs_infrastructure [20:00:23] What's up with labs? [20:00:52] T13|mobile, what are you seeing wrong specifically/ [20:00:54] I'm getting 404s from all tools [20:01:24] Also, getting timeout trying to restart job [20:01:26] T13|mobile: https://tools.wmflabs.org/magnustools/multistatus.html works for me? [20:01:36] *shrug* works for me [20:02:28] andrewbogott, is ping between labs machines supposed to frequently jump up to a few seconds? [20:02:44] valhallasw`cloud: curl returns 404 on that. [20:02:54] curl /where/ [20:03:08] Krenair: it shouldn’t, although I see what you mean. [20:03:19] you're seeing the same thing andrewbogott? [20:03:20] tools-login [20:03:45] Krenair: yes — I assume you’re talking about how shinken keeps thinking that deployment instances are down, and up, and down? [20:03:46] ah, okay. That I can reproduce. [20:03:55] andrewbogott, yes [20:03:59] I am also looking at ping [20:04:25] And I know that at least one of the deployment-mediawiki* instances are having trouble connecting to deployment-db* instances [20:04:39] probably all of them [20:04:52] valhallasw`cloud: Do you guys have packetloss/error graphs somewhere? [20:05:19] valhallasw`cloud: I get 404 across the board curling wikiviewstats xtools xtools-ec and xtools-articleinfo [20:05:20] Hm. Those symptoms are consistent with connection timeouts Betacommand has had with his irc bot [20:05:29] multichill: http://graphite.wmflabs.org/ doesn't really have packet loss info, I think [20:05:29] T13|mobile: 404s? [20:05:36] T13|mobile: *nod*. The proxy is doing something weird [20:05:41] Yep [20:06:05] andrewbogott: It does look as though something funky is going on with the network. [20:06:11] valhallasw@tools-bastion-01:~$ ping tools.wmflabs.org [20:06:11] PING tools-webproxy (10.68.16.4) 56(84) bytes of data. [20:06:21] but that's not the internal IP for tools-webproxy-01... [20:06:22] Run $ sh /data/project/xtools/checkall.sh [20:06:29] wikitech isn't seeing proxy config for me either. Possibly related? https://wikitech.wikimedia.org/wiki/Special:NovaProxy is not showing me proxies for any project. I logged out and back in with no change. [20:06:50] andrewbogott: beyond the simple spikes of migration. [20:07:00] Coren: I’ve stopped migration so we can get a clean signal. [20:07:03] You'll see my script that reports the status of all 4 tools [20:07:29] andrewbogott: did the migration change IP addresses for the hosts? [20:07:32] T13|mobile: I know YuviPanda has been experimenting with the new DNS server, but I think only on tools-dev [20:07:41] YuviPanda: can you check out https://wikitech.wikimedia.org/wiki/Special:NovaProxy ? It’s possible that it’s related to our network problems, but you also just refactore that, right? [20:07:52] Coren: tools.wmflabs.org resolves to the wrong IP internally. Externally everything is fine. [20:08:02] valhallasw`cloud: no, it shouldn’t. If you see any examples of that I’d like to know though. [20:08:02] the beta problems have been happening since about 7pm PDT yesterday, weird variations and shinken exploding [20:08:06] i get same on tools-dev and tools-login atm [20:08:35] Having traffic and error graphs on underlying infra makes it a lot easier to debug seemingly weird problems.... [20:09:12] multichill: We /do/ have those. They're showing nothing beyond the ~20m spike of instance migration atm [20:09:30] Also interface error graphs? [20:10:24] wrt beta stuff andrewbogott tried shuffling around virt hosts, I've been monitoring the virtualmachines themselves nothing seems to be wrong there (some high load excepted), ganglia isn't showing any weird issues with the virt hosts themselves (near as I can tell) [20:10:44] Coren, I’m comparing https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring to ping times and I don’t see any visible spikes that correspond when ping times go up. [20:10:59] Yeah, neither do I. [20:11:20] There are big spikes, but they are your instance moves and don't match the actual lag. [20:11:45] * Coren ponders. [20:11:55] Coren: where is /etc/hosts defined? [20:12:02] So labnet1001's interface being saturated isn't the cause. [20:12:13] Hi [20:12:17] valhallasw`cloud: Not all projects do it the same way. [20:12:23] My body doesn't like daylight [20:12:25] someone is reporting prod network performance issues as well [20:12:27] For tools, specifically. Sorry :-) [20:12:30] valhallasw`cloud: etc hosts isn't puppetized [20:12:42] Coren, Krenair, if you ping something on an older host (e.g. bastion1) do you see the same irregular ping times? [20:12:52] So far I do not. [20:12:55] Coren: designate isn't being used on any tools hosts atm [20:13:08] YuviPanda: Didn't you switch tools-dev to it? [20:13:19] YuviPanda: tools.wmflabs.org resolved to tools-webproxy-01 originally, right? [20:13:20] andrewbogott: looking at the novaproxy as soon as my computer decides to start up [20:13:31] YuviPanda: thanks [20:13:36] Coren: no, I deleted and recreated that instance since... [20:13:42] andrewbogott, I logged into bastion1 and pinged deployment-mediawiki01 [20:13:42] valhallasw`cloud: yes [20:13:45] It should [20:13:50] andrewbogott, seeing some irregular ping times still [20:14:09] Krenair: yes, all deployment hosts are on new labvirtxxx hardware though [20:14:19] I see no interface errors. [20:14:23] Sorry, it’s not obvious to anyone but me where instances are hosted right now, do to an issue with the pages updating. [20:15:04] Wouldn't horizon have up to date info atm? [20:15:17] True, it would. [20:16:38] ping times to the old virt hosts (including HP virt hosts like virt1010) are stable. [20:16:48] This is specific to the new labvirt hosts, I’m 90% convinced. [20:17:08] andrewbogott: That would explain why tools users only today started reporting issues. [20:17:29] yeah. Although only a very small slice of tools is on new hosts [20:17:54] The IP/hosts mixup for tools-webproxy is unrelated to the current move; the .139 IP has been there for a week at least [20:17:58] Now I need to verify that it’s happening on all new hosts… [20:18:13] YuviPanda: time to puppetize /etc/hosts? [20:18:26] or is it different between our hosts? [20:18:37] (...all the more reason to puppetize it) [20:18:42] valhallasw`cloud: Time to get /rid/ of /etc/hosts. [20:18:49] There's a task for that, depending on designate [20:19:20] valhallasw`cloud: tools.wmflabs.org is resolved via dnsmasq [20:19:22] not /etc/hosts [20:19:28] valhallasw`cloud: nova/network.pp [20:19:34] * YuviPanda is on laptop now [20:19:55] YuviPanda: so I should kick it out of /etc/hosts? [20:20:02] valhallasw`cloud: oh it shouldn’t be on /etc/hosts at all :| [20:20:03] yes [20:20:05] you should [20:20:15] ok; will do [20:21:04] valhallasw`cloud: I am not finding it on tools-login [20:21:11] of course [20:21:12] YuviPanda: I just removed it :P [20:21:15] bah [20:21:17] thanks [20:21:25] boo unpuppetized /etc/hosts [20:21:33] hah. [20:21:41] but tools.wmflabs.org still resolves incorrectly. wth [20:21:50] for ping at least [20:21:55] andrewbogott: definitely network issue with deployment-salt [20:22:04] rtt min/avg/max/mdev = 0.138/45.223/4296.494/349.471 ms, pipe 359, ipg/ewma 0.650/0.349 ms [20:22:05] valhallasw`cloud: dig gives me correect values [20:22:09] !log tools removed 10.68.16.4 tools-webproxy tools.wmflabs.org from /etc/hosts [20:22:12] Logged the message, Master [20:22:28] YuviPanda: yeah, nslookup as well. [20:22:52] curls doesn't work, neitehr does ping [20:23:01] !log project-proxy restarted dynamicproxy-api [20:23:03] Logged the message, Master [20:23:09] andrewbogott: ^ novaproxy is back up [20:23:21] valhallasw`cloud: curl works for me (tools-login) [20:23:43] YuviPanda: thanks [20:24:17] YuviPanda: I get an empty page from integration-something (which is what 10.68.16.4 is now) [20:24:33] valhallasw`cloud: tools-dev? [20:24:51] andrewbogott: Can you tell me which virt hosts currently has deployment-salt? [20:24:53] that one works [20:25:30] valhallasw`cloud: yeah I see it [20:25:33] (just on tools-dev) [20:26:10] Coren: labvirt1001 [20:26:24] valhallasw`cloud: that’s because you didn’t remove the etchosts entry from tools-dev :D [20:26:28] i just did. works fine now [20:26:37] YuviPanda: wut? [20:26:41] also, fuck this. let me puppetize it [20:26:47] [Thu Apr 23 16:29:28 2015] br1102: port 25(vnet23) entered disabled state [20:27:04] YuviPanda: valhallasw@tools-bastion-01:~$ ping tools.wmflabs.org --> PING tools-webproxy (10.68.16.4) 56(84) bytes of data. [20:27:21] wtf [20:27:24] it worked 5 minutes ago [20:27:33] Then 5s later: [Thu Apr 23 18:20:34 2015] br1102: port 25(vnet23) entered forwarding state [20:27:43] Well, everything is stable now that I’m watching :( [20:28:00] YuviPanda: and it's not in /etc/hosts, and host tools.wmflabs.org works as expected. [20:28:04] yeah [20:28:11] valhallasw`cloud: works now [20:28:19] odd [20:28:22] oh well [20:28:23] !log tools restarted nscd on tools-login and tools-dev [20:28:26] valhallasw`cloud: ^ that’s why [20:28:26] Logged the message, Master [20:28:27] I presume [20:28:47] YuviPanda: Don't restart when you can just invalidate cache [20:28:53] andrewbogott: It's still very unstable. [20:28:59] T13|mobile: should be fixed now [20:29:05] rtt min/avg/max/mdev = 0.135/2.359/1423.753/39.751 ms, pipe 119, ipg/ewma 0.389/0.262 ms [20:29:24] Not as bad - 1.4s vs 5 earlier, but still ridiculously variable. [20:29:36] Yeah. I’m pinging instances on all six hosts and watching the times... [20:29:46] They very by 2 or 3x, not by 5000x like before [20:29:52] But I’ll keep watching [20:29:58] andrewbogott: The bursts are short enough that you need a flood ping to see them. [20:30:03] rtt min/avg/max/mdev = 0.135/4.567/2786.208/88.220 ms, pipe 232, ipg/ewma 0.405/0.419 ms [20:30:07] valhallasw`cloud: +1 [20:30:13] YuviPanda: I puppetized the exim queue report, but I'm unsure how to test it on tools-dev [20:30:18] YuviPanda: er, toolsbeta [20:30:52] andrewbogott: zero packet loss though - things are getting stalled not lost. I have suspicion something odd is going on at the networking layer. [20:31:03] valhallasw`cloud: me neither. scfc is basically the only person actively doing things there [20:31:33] Coren: At first these boxes didn’t have eth2 or eth3 wired up. For a little while I was thinking we just had 1/3 as much bandwidth for the new instances… [20:31:43] But that doesn’t seem right anymore [20:32:12] Coren: any idea how else we can investigate? [20:32:49] andrewbogott: I'm trying to locate where the clog is. Knowing this will help the network peeps. [20:32:58] ok [20:33:02] vm to vm: rtt min/avg/max/mdev = 0.141/8.612/2758.261/118.247 ms, pipe 230, ipg/ewma 0.453/0.330 ms [20:34:01] lavbirt1001 <-> labnet1001: rtt min/avg/max/mdev = 0.058/0.076/1.332/0.038 ms, ipg/ewma 0.111/0.068 ms [20:34:06] So it's not physical. [20:35:39] host -> instance on host: rtt min/avg/max/mdev = 0.181/1.259/1123.077/25.814 ms, pipe 94, ipg/ewma 0.311/0.225 ms [20:35:58] Something's wrong with the internal bridging - or the actual vm is stalling. [20:36:26] So it could be cpu on the host [20:36:34] Could be. [20:36:37] * Coren digs further. [20:36:57] Coren: called it :P [20:37:14] same (host <-> instance on host): rtt min/avg/max/mdev = 0.175/6.549/4064.094/123.653 ms, pipe 338, ipg/ewma 0.335/0.255 ms [20:37:20] Really, really bursty. [20:37:31] * Coren compares to old hardware. [20:37:55] Everything looks so happy on ganglia :( [20:38:26] andrewbogott: That's normal, ganglia can't see the instances stalling; it only sees the things that /do/ go out. [20:38:50] sure, I just thought there might be obvious CPU spikes [20:39:03] andrewbogott: and it's a pipeline stall; the packets aren't lost. [20:40:06] Bah. The interfaces have absolutely no errors. [20:42:20] aha! [20:42:48] aha? [20:43:07] Looks like complete CPU stalls on the instances themselves. I'm seeing non-monotonal wallclock/cpu time increases. [20:43:19] In bursts. [20:43:49] It’s not just clock drift from the migration? [20:44:12] Something is wrong with the qemu running on the new hardware afaict. [20:44:25] andrewbogott: No, it comes in bursts matching the apparent network stalls. [20:44:37] oh! cool. [20:44:48] Turns out the latter are symptoms of the vms simply not running at all for sometimes seconds. [20:44:59] But the box isn't CPU starved at all. [20:45:49] Wait. Why do we have both kvms and qemus running? [20:46:52] That is probably this: https://gerrit.wikimedia.org/r/#/c/205093/ [20:47:00] 10Tool-Labs: toolsbeta: create tool to apply a gerrit change for testing - https://phabricator.wikimedia.org/T97081#1232397 (10valhallasw) 3NEW [20:47:06] andrewbogott: Can you look up the instance ID of deployment-salt for me? [20:47:23] abb50762-93d0-4c9d-8853-adbbb6b56e00 [20:47:31] Ah, okay, so they are really all qemu [20:47:36] I meant the i-* name [20:47:57] 10Tool-Labs: toolsbeta: create tool to apply a gerrit change for testing - https://phabricator.wikimedia.org/T97081#1232412 (10demon) There's already puppet compiler which lets you run it on a list of hosts. It's triggered via a Jenkins job. [20:48:06] But that's pointless - I wanted to know if the issue was only visible on one of the two; they're really the same. [20:49:21] ec2 id is i-0000015c [20:49:22] * Coren tries to find a way to diagnose the core issue. [20:49:23] 6Labs, 7Puppet: Fix Puppet timestamp updater for wikitech - https://phabricator.wikimedia.org/T97082#1232425 (10scfc) 3NEW a:3scfc [20:49:41] Coren: thanks for digging, I’m a bit burned out on this [20:49:44] I can't really strace a qemu - that'd be... verbose. :-) [20:49:49] yeah [20:50:46] andrewbogott: At this time the only thing I can tell you for sure is that the vms themselves stall entirely for seconds at intermitent intervals, but not because they're actually CPU starved on the host. :-( [20:51:43] dang [20:52:10] virt1012 should be running the same software on the same hardware. The only difference is that it was an in-place upgrade from precise to trusty rather than a fresh build. [20:52:19] It does not seem to have the problem [20:52:33] That... makes no sense to me. [20:52:43] quick google turns up this on pausing http://porkrind.org/missives/libvirt-based-qemu-vm-pausing-by-itself/ [20:52:52] YuviPanda: Hm.. can you look at bigbrother/service manifest? Something weird is going on for tools.list. I removed all bigbrother files and stopped the service, but it keeps coming back up from bigbrother and then failing because it is already running. How can I stop bigbrother? In all other tools I just removed bigbrotherrc and restarted the webservice and it was fine. [20:53:04] Krinkle: yeah, you can’t stop bigbrother [20:53:09] let me restart it and hope that lets it pick it up [20:53:15] Coren: could be bios settings or something like that [20:53:46] !log tools restart bigbrother [20:53:48] YuviPanda: The problem is that because of bigbrother, it keeps overwriting service.manifest with -precise. [20:53:49] Logged the message, Master [20:53:50] Eventhough I want trusty [20:53:55] Krinkle: this is why I haven’t announced yet [20:54:05] Krinkle: I restarted bigbrother, I wonder if that makes it ‘forget' [20:54:09] It was fine for my other tools though... [20:54:09] andrewbogott: I'd be at a loss to guess which. [20:54:12] assuming you have no .bigbrotherrc [20:54:21] YuviPanda: It does. [20:54:24] I just delted bbrc and restarted webservice and that created a fresh service-manifest [20:54:34] but this one got stuck someho [20:54:40] should be gone now [20:54:46] and thanks for moving things to trusty! [20:54:47] Coren: yeah, I documented the changes I made, it was just turning on hyperthreading and intel virt. [20:54:51] It’s possible that hyperthreading is off on virt1012... [20:54:54] thcipriani: That's unrelated - that's instances entering 'paused' state, not just stalling. [20:55:01] Of course it’s hard to check the settings without rebooting :( [20:55:16] OK. I stopped it and restarted it. it is now without precise. Cool :) [20:55:32] andrewbogott: HT shouldn't be harmful nowadays. And if it had an effect it would be poor cache behaviour not complete vm stalls. [20:56:21] 10Tool-Labs: toolsbeta: create tool to apply a gerrit change for testing - https://phabricator.wikimedia.org/T97081#1232473 (10valhallasw) That's https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/ If I understand correctly, that's triggered manually, and it runs on prod and not labs? O... [20:56:43] The only thing I can think of is if rbsu> SET CONFIG INTEL(R) VIRTUALIZATION TECHNOLOGY has alternatives, like there’s SOME OTHER(R) VIRTUALIZATION TECHNOLOGY that’s enabled on virt1010-1012 [20:56:49] unlikely [20:58:22] 10Tool-Labs: toolsbeta: set up puppet-compiler - https://phabricator.wikimedia.org/T97081#1232490 (10valhallasw) [21:01:15] ^d: wait, puppet-compiler doesn't actually apply the change, right, it just shows the diff? [21:02:20] <^d> It compiles it for the destination node and shows you what the net change would be, yeah [21:02:43] andrewbogott: This looks suspiciously like https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1307473 [21:02:53] 10Tool-Labs: toolsbeta: set up puppet-compiler / temporary-apply - https://phabricator.wikimedia.org/T97081#1232507 (10valhallasw) [21:03:00] ^d: ok, that's super useful. [21:04:49] andrewbogott: quoth the report: "213 packets transmitted, 213 received, 0% packet loss, time 211998ms [21:04:49] rtt min/avg/max/mdev = 0.136/106.283/2651.359/428.403 ms, pipe 3 [21:04:51] [21:04:51] " [21:04:59] That looks suspiciously like [21:05:03] what we are seeing. [21:05:25] virt1012 is running a slightly newer kernel [21:05:31] virt1012 3.13.0-46-generic #76-Ubuntu [21:05:40] labvirt1001 3.13.0-24-generic #47-Ubuntu [21:06:51] Coren: what do you think, shall I upgrade and reboot one of the labvirts? (Wouldn’t be the first time today :( ) [21:07:26] Coren: I could, in fact, start a migration back off of one of the hosts so we can test w/out causing downtime... [21:07:33] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917 suggests "This bug was fixed in the package linux - 3.13.0-33.58" [21:07:53] fits [21:07:56] dammit [21:07:56] So 1012 would have the fix and 1001 not. fits. [21:08:15] I guess I will evacuate labvirt1001 [21:08:32] At least live migration works reliably. :-) [21:08:44] with a lot of hand-holding it does [21:08:58] well... [21:09:07] you know, I’m not going to evacuate entirely. [21:09:10] I’m going to move the tools instances off [21:09:16] and cvn [21:09:25] and let beta suffer downtime, since it is /already/ suffering downtime anyway [21:09:34] I have to go eat. Anything you urgently need me to do before? [21:09:37] thcipriani: ^ any objection? [21:09:41] Coren: no, you’ve done lots! Thanks [21:10:02] hi andrewbogott [21:10:02] anything I can do to help? [21:10:33] YuviPanda: I don’t think so — it’ll just be slow, not that much work. [21:10:48] andrewbogott: alright [21:11:04] andrewbogott: certain tools instances can be failed over trivially if it’ll save you time from moving them [21:11:18] but if it’s not much time / effort I guess you can just move them [21:11:39] hm... [21:11:39] andrewbogott: no objections, beta is already pretty unstable [21:11:50] moving them back might be hard, actually. Let’s see [21:14:21] andrewbogott: ouch, most of the exec nodes are on labsvirt [21:14:36] and both the bastions are on the same host. ugh. [21:14:48] YuviPanda: keep in mind that I don’t have to reboot all of them at once [21:15:27] andrewbogott: yeah, fair enough. if you want to reboot one at a time, we can drain the exec nodes. [21:15:39] but if just moving them around is going to be simpler, that’s definitely going to be simpler [21:16:46] andrewbogott: let me know if moving back is hard. me / Coren can co-ordinate draining tools [21:17:25] andrewbogott: Yeah, https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917 fits our symptoms perfectly. [21:17:59] andrewbogott: the good news is that linux guests' symptom is stalls in the 2s range, whereas windows guests just bsod. :-) [21:18:11] YuviPanda: migrating to a mostly-full host is hard. Migrating around within the labvirt nodes should be ok. [21:18:11] andrewbogott: Hm. so the ping failures continue. Here's a summary https://gist.github.com/Krinkle/8a1ddead6b2f7c7b089b [21:18:15] Just tedious :) [21:18:23] andrewbogott: alright :) [21:18:42] Krinkle: I think we just diagnosed the issue, it’ll take several hours to resolve fully though. [21:18:52] Krinkle: and there will be some reboots [21:19:20] * YuviPanda debates going to the office [21:20:02] andrewbogott: I just checked the lkml and the patch is known to be in 3.14 also; what are your upgrade options? [21:20:54] Coren: I don’t know yet, I was just going to accept whatever apt-get dist-upgrade provides. [21:21:10] I welcome more specific suggestions... [21:21:49] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=64a9a34e22896dad430e21a28ad8cb00a756fefc is the specific page. This has been backported to some 3.13 kernels by Ubuntu and/or Debian but is part of mainline in 3.14 since -rc1 [21:22:00] So if you want to be extra safe, a 3.14 kernel seems best. [21:22:08] s/page/patch/ [21:23:08] apt wants to install 3.13.0-49 [21:23:19] What’s the advantave of .14? [21:23:58] We know the patch is in the head and thus shpepherded/cared for by the kernel dev team. [21:24:07] vs "just" a backport. [21:24:26] * YuviPanda goes afk for food [21:24:29] I’ll be back shortly [21:24:49] ok, so, to force that I’d just apt-get install linux-image-3.14 ? Or something more specific? [21:25:07] Also I hear of regressions in 3.13.0-46-generic. [21:26:04] Ah, Trusty doesn't have a 3.14; it hops from 3.13 to 3.16 [21:28:43] PROBLEM - Host tools-bastion-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.44) [21:29:09] we run 3.18 on Jessie, so I see no good reason to eschew the more modern kernels. [21:29:19] linux-image-3.16.0-34-generic would be my pick. [21:30:20] ok, great. [21:30:25] I will try, when I get there... [21:30:41] andrewbogott: is this ^ shinken-wm alert from you moving it or? [21:30:51] YuviPanda: it’s from… [21:30:55] something that happened... [21:31:00] ah :) [21:31:00] ok [21:31:14] all the migrations suddenly declared that they received a ‘hangup’ and then nova shut down two of the migrating instances and resumed the other three. [21:31:19] ouch [21:31:21] How did it decide? We will never know [21:31:24] but I will do one at a time now. [21:31:39] andrewbogott: Ah, okay. Good to know. I see it's affecting other instances as well, so I guess it started around the virt migration somehow. [21:31:42] Anyway, that bastion is back up, shinken should notice shortly [21:31:53] Krinkle: kernel bug on the new hardware [21:32:53] RECOVERY - Host tools-bastion-02 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [21:33:07] don’t get too comfortable, shinken! [21:46:44] PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228) [21:47:51] ^ [21:48:39] Is there something in scrollback? [21:50:02] a930913: we’re working on it [21:50:48] YuviPanda: Seem to be in now. [21:50:54] set topic [21:51:00] RECOVERY - Host tools-bastion-01 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [21:52:50] (03CR) 10Yuvipanda: Return a unique list of channels (remove dupes). (032 comments) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 (owner: 10Subramanya Sastry) [21:54:57] PROBLEM - Puppet failure on tools-static is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:00:14] PROBLEM - Puppet failure on tools-static is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:00:20] (03CR) 10Yuvipanda: Return a unique list of channels (remove dupes). (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 (owner: 10Subramanya Sastry) [22:11:01] (03PS3) 10Subramanya Sastry: Return a unique list of channels (remove dupes). [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 [22:11:52] (03CR) 10GWicke: Return a unique list of channels (remove dupes). (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 (owner: 10Subramanya Sastry) [22:17:30] YuviPanda: can you drain tools-exec-09? That one is turning out to be troublesome. [22:18:30] Coren, same question ^ [22:19:58] RECOVERY - Puppet failure on tools-static is OK: OK: Less than 1.00% above the threshold [0.0] [22:29:46] andrewbogott: in about 10mins sure... [22:30:09] YuviPanda: let me know when I’m clear to reboot. [22:37:10] andrewbogott: attempting to do so now [22:38:38] !log tools take tools-exec-09 from @general group [22:38:41] Logged the message, Master [22:39:03] * Coren catches up [22:39:39] YuviPanda: There's a MUCH easier way, dude. [22:39:44] ah :) [22:39:45] do tell me [22:40:06] (and I’ll add it to https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin) [22:40:07] qmod -d '*@tools-exec-09' [22:40:25] YuviPanda: That's just duplicating 'man qmod' :-) [22:40:37] whelp, I was looking at qmod -help [22:40:47] !log add tools-exec-09 back to @general [22:40:47] add is not a valid project. [22:40:52] !log tools add tools-exec-09 back to @general [22:40:55] Logged the message, Master [22:41:14] !log tools disabled *@tools-exec-09 [22:41:17] Logged the message, Master [22:41:44] YuviPanda: I usually follow this up with a 'qmod -r all of them? [22:42:12] so that’s qmod -rj continuous [22:42:24] err [22:42:25] -rq [22:42:25] No, no, you just want those on the host! [22:42:26] :-) [22:42:43] Sorry, shouldn't have been implicit. :-) [22:43:10] * YuviPanda is still a fairly bumbling idiot about GridEngine [22:44:00] You can qhost to find them, but I generally just do [22:44:01] qmod -r $(qhost -j -h tools-exec-cyberbot|sed -e 's/^\s*//' | cut -d ' ' -f 1|egrep ^[0-9]) [22:44:08] replacing for the right host [22:44:13] so… ready for tools-exec-09 to go down? [22:44:17] That will give you an error for non-restartable jobs. [22:44:24] But it's harmless. [22:44:34] yeah but too late because I’m an idiot and I rescheduled them all.. [22:44:43] Ow. [22:44:50] Well, that'll cause churn but won't harm. [22:44:57] yeah, just continuous jobs [22:44:59] andrewbogott: nope [22:45:12] ok [22:45:18] YuviPanda: You can remove the tasks after a bit with the same command line, just qdel rather than qmod -r [22:45:25] yeah [22:45:35] I didn’t know of qhost at all [22:45:35] Make sure the continuous jobs are all gone with qhost -j -h first [22:45:46] I should probably sit and read all their man pages today [22:45:56] 'qhost -j -h' -> your best friend. :-) [22:46:04] yeah [22:46:07] I can see that now :D [22:47:06] Coren: so there are currently three non-restartable, task@ jobs there [22:47:13] do I just delete them? [22:47:30] Coren: they’ve been running for a while [22:47:31] hmm [22:47:34] just deleting them seems wrong [22:47:48] Coren: I guess we can just let andrewbogott reboot now, and gridengine will schedul ethem appropriately? [22:47:57] Yeah - I usually give 'em a few minutes, but those seem to have been there for a long time so it's not predictable how they'll end. I only do this when it's important to evacuate the node. [22:48:10] YuviPanda: tasks are not restartable anyways. [22:48:20] Coren: yeah [22:48:24] so good to reboot? [22:48:29] * Coren nods. [22:48:32] andrewbogott: ^ [22:48:40] Coren: can you document ‘draining a node’ at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin [22:48:54] * Coren does so now then heads out. [22:49:08] Coren: cool :) I created a stub [22:50:48] YuviPanda: Remember to reenable the queues once the node is back up. (qmod -e) [22:50:55] yup! [22:53:29] andrewbogott: let me know when it’s back up? [22:53:33] PROBLEM - Host tools-exec-09 is DOWN: CRITICAL - Host Unreachable (10.68.17.64) [22:53:42] yep [22:53:50] as will shinken :) [23:04:35] Coren_away: are you really away? [23:04:48] andrewbogott: :D yay shinken [23:04:57] You caught me, literally, as I was getting up from my chair. :-) [23:04:59] shinken itself needs redundancy tho [23:07:25] andrewbogott: ? [23:07:44] The new kernel can’t mount the filesystem. It drops into busybox [23:07:54] ouch [23:07:58] I’m rebooting now, going to see what options grub offers. [23:08:18] But if you want to grab the console and have a look, I won’t object :) [23:08:58] That's... how is that even possible? What error message are you getting? [23:09:20] (You can boot the older kernel with grub in a pinch, mind you) [23:09:45] https://dpaste.de/Aeer [23:11:36] PROBLEM - Host tools-exec-05 is DOWN: CRITICAL - Host Unreachable (10.68.16.34) [23:12:30] andrewbogott: You might have to explicitly do an update-initramfs? Did the disk controller need a special driver when you did the original install? [23:12:57] The original install was a no-hands pxe boot [23:13:06] Lemme go poke at the box for a minute. [23:13:26] the .13-49 booted just fine [23:13:35] so you can ssh now probably [23:14:04] Did the drac password change? [23:14:25] not that I know of [23:14:36] Ah. HP box. admin@ not root@ [23:15:30] no, that’s for ciscos [23:15:30] RECOVERY - Host tools-exec-05 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [23:15:32] its root for hps [23:15:46] Yeah, misspoke. [23:16:06] Oh, you rebooted. I wanted to poke at the busybox. [23:16:36] I rebooted into 3.13.0-49-generic which is working fine. [23:16:36] But you can boot again if you want to let it fail [23:17:13] Or we can just agree that 3.13.0-49-generic is the way to go :) [23:17:38] I think we need this back up sooner rather than later. We'll experiment with 3.16 some other time. [23:17:53] But do keep an eye on the behaviour before you migrate too many things. :-) [23:18:03] yep [23:18:19] * Coren_away goes away for real now unless you need me. [23:18:23] So, to ensure that it sticks with 3.13.0-49-generic on the next reboot... [23:18:52] How do I ^ ? [23:19:21] Coren_away ^ [23:19:28] Oh. [23:19:40] In our case, just remove the newer kernel entirely. [23:19:48] No point in keeping it around. [23:19:57] dpkg --purge? [23:20:24] Now that I’ve swapped kernel on reboot, I’m sure I know how to predict what grub will do next time. [23:20:25] PROBLEM - Host tools-exec-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.31) [23:20:28] Otherwise, you can edit /etc/default/grub [23:20:40] Yeah, a purge will redo the grub-setup [23:20:51] Err, update-grub [23:21:30] there’s nothing about a kernel version in /etc/default/grub [23:21:51] You have to add an entry. Like I said, in our case, just purge the package and you're golden. [23:22:10] ok, great. [23:22:10] Done. [23:22:11] Thanks — you can safely dine now [23:22:15] Probably :( [23:24:15] RECOVERY - Host tools-exec-09 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [23:24:21] RECOVERY - Host tools-exec-02 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [23:26:44] whee! [23:28:22] andrewbogott: let me know if I should re-enable it. It’s ok to just let it be as well, tho [23:28:26] we have enough capacity atm [23:28:47] YuviPanda: The next step is to decide if instances on labvirt1001 are no longer stuttering. [23:28:54] So, I’m not sure… maybe best to wait a bit. [23:28:55] ah, hmm [23:28:58] alright [23:45:52] Coren_away: when you have a chance (tomorrow AM is fine), could you verify that instances on labvirt1001 look right to you now? (They do, to me and Tyler). Here’s a sampling: https://dpaste.de/A2pb