[00:28:32] Coren: Its' not critical, but I left a few comments at https://gerrit.wikimedia.org/r/#/c/204528/12/manifests/role/ci.pp - it seems the override file is cleared by something right after it starts. [00:28:35] Is that expected? [00:29:00] every 30 minutes the file is re-created and enable changed 'false' to 'true' [00:29:10] From what I can tell it does not restart, but it seems odd [00:43:31] (03PS1) 10Jforrester: Move some from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206040 [00:49:31] (03CR) 10Alex Monk: [C: 032] Move some from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206040 (owner: 10Jforrester) [00:52:46] (03Merged) 10jenkins-bot: Move some from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206040 (owner: 10Jforrester) [00:53:33] (03PS1) 10Jforrester: Move Math repo from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206041 [01:00:05] !log tools good bye tools-login.eqiad.wmflabs [01:00:11] Logged the message, Master [01:01:35] PROBLEM - Host tools-login is DOWN: CRITICAL - Host Unreachable (10.68.16.7) [01:01:43] * YuviPanda pats shinken-wm [01:01:44] we know [02:05:23] 6Labs, 10Tool-Labs: Move tools to designate - https://phabricator.wikimedia.org/T96641#1230004 (10yuvipanda) [02:05:24] 6Labs, 10Tool-Labs: Bare instance name does not resolve on tools - https://phabricator.wikimedia.org/T96642#1230001 (10yuvipanda) 5Open>3declined a:3yuvipanda It's ok. dig doesn't go through resolv.conf And dnsmasq was responding because it was also the dhcp server. nbd. Let's worry about this if it bre... [03:06:09] * Negative24 plays taps [03:31:38] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1230075 (10Shilad) 3NEW [04:23:27] 6Labs: lighttpd redirects URLs of directories without a trailing slash from https to http - https://phabricator.wikimedia.org/T66627#1230155 (10Krinkle) See also {T95164}. [04:56:01] 6Labs, 10Continuous-Integration: Create an instance image like m1.small with 2 CPUs and 30GB space - https://phabricator.wikimedia.org/T96706#1230309 (10Krinkle) 5Resolved>3Open Our goal for 30G space was based on the following estimate: > 10G for system, 10G for git replication and 10G for workspace. How... [05:04:54] 10Tool-Labs, 3Labs-Q4-Sprint-3, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1230315 (10yuvipanda) 5Open>3Resolved a:3yuvipanda It is gone now. [06:17:53] I have problemets with php on command line on tool-dev. It's display sourcecode insted for run php-script. [06:23:07] hey steenth [06:23:15] what command are you executing? what tool is this? [06:26:01] it is in my own account.. command is php [06:26:30] can you tell me what account it is and the exact commandline you were using? [06:28:30] php ~steenth/test.php [06:30:54] steenth: it was missing a full opening tag [06:30:57] it is running a newer version of PHP now [06:31:07] that does not support the short tag [06:31:14] I added and it works now [06:31:20] okay [06:31:50] yw [06:32:35] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1230376 (10yuvipanda) Hi! Have you considered hosting it on tools.wmflabs.org instead? [06:32:53] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1230377 (10yuvipanda) Also @halfak might already have a project for this? [06:54:17] YuviPanda, my problem is change in config-setting for short_open_tag - It's change from on to off (I use same version at home) [06:54:35] steenth: yup - this is a side effect of php changing from version 5.3 to 5.5 [06:54:45] this was announced on labs-l and the labs-announce mailing lists [06:54:58] the short_open_tag has been deprecated by PHP for many years now [06:55:02] and scripts should use whatever happened to the sweet ASCII art on tools-login and tools-dev? :( [07:53:05] (03CR) 10Nemo bis: [C: 031] "Thanks for clarifying the intention." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205222 (owner: 10Werdna) [08:03:01] a few tools fails on http but work the 2nd time on https :/ [08:05:41] afeder: hope they do come back at some point, yeah. there’s a bug with our PAM setup... [08:09:27] ah [08:27:36] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-static redundant - https://phabricator.wikimedia.org/T96966#1230510 (10yuvipanda) 3NEW [08:28:05] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-mail redundant - https://phabricator.wikimedia.org/T96967#1230517 (10yuvipanda) 3NEW [08:29:07] afeder: https://phabricator.wikimedia.org/T85910 for reference [08:29:18] and https://phabricator.wikimedia.org/T85307 [09:53:58] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/FNDE was created, changed by FNDE link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fFNDE edit summary: Created page with "{{Tools Access Request |Justification=Using the API for recent changes |Completed=false |User Name=FNDE }}" [11:27:29] hello [11:27:41] I cannot get my PHP script to work on cron [11:27:54] cron.err says "libgcc_s.so.1 must be installed for pthread_cancel to work" [11:30:20] hello? [11:45:52] Superyetkin: you need to give it more memory, see https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Why_am_I_getting_errors_about_libgcc_s.so.1_must_be_installed_for_pthread_cancel_to_work.3F [11:46:15] e.g. jstart -mem 2G ./somescript [11:48:03] sitic: thanks [11:48:35] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Cipher was created, changed by Cipher link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fCipher edit summary: Created page with "{{Tools Access Request |Justification=Making things better. Personal experience. |Completed=false |User Name=Cipher }}" [12:40:36] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1231012 (10Halfak) @yuvipanda, I don't have a specific project in mind for WikiBrain. The services will need to hold onto a lot of of ram (~16GB) long term, so it seemed to me that a shared environment that a shared environment wouldn't w... [13:17:50] !log tools beginning migration of tools instances to labvirt100x hosts [13:17:53] Logged the message, dummy [13:18:39] andrewbogott: This is done one at a time, right? [13:18:50] Coren: yep, once every 20 mins. [13:19:04] alphabetically by name [13:20:29] Speaking of, I'm going to start the reboot-if-idmap rounds in ~40m. tools and deployment-prep are already done, so no impact there and I know many others have been done by their admins. [13:20:54] ok, shouldn’t interferere. [13:20:58] 10Tool-Labs-tools-Global-user-contributions: https://tools.wmflabs.org/guc - https://phabricator.wikimedia.org/T94351#1231078 (10Oleg3280) please, fix this bug. links do not work. sorry for bad english. thanks. [13:21:00] um… interferererererere [13:22:18] * Coren ponders. [13:23:02] Actually, I should tweak that script to actually attempt unmounting and unloading the modules before reboot - that may save a couple of needless reboot of instances that aren't actually using NFS at this time. [13:49:05] andrewbogott: should anything have changed with ldap recently, and/or who should I ask about that. deployment-prep (and staging to a lesser extent) are having a bad day looking for gid 50062 deployment-bastion, keep seeing "nslcd[1293]: [13d149] error writing to client: Broken pipe" [13:50:16] thcipriani|afk: nothing should have happened with ldap. It’s possible that your instance is just starved for network access. Is that mostly happening on one particular vm? [13:51:33] andrewbogott: no, seems to be fairly systemic, see the shinken alerts in -releng, happening for the past 9 hours or so across various instances [13:52:12] is it failing every time, or intermittently? [13:52:24] I mean, the shinken alerts + the ldap problems could be systemic of a larger network issue, the only thing I see in syslog is what I mentioned [13:52:28] intermittently [13:52:51] also instances are running super slowly [13:54:42] thcipriani|afk: ok — let me clear the thing I’m working on now, then I’ll look around [13:54:57] andrewbogott: thanks! [14:15:20] YuviPanda: https://gerrit.wikimedia.org/r/206118 but needs a lookover from someone who gets puppet syntax -_-' [14:15:39] and I fail pep8 [14:15:41] trololol [14:16:16] 'line too long, 81 > 79 characters' [14:16:31] http://troll.me/images/atomic-rage/fuuuuuuuuu.jpg [15:03:21] I'm getting random 404 from tools [15:05:01] Nemo_bis: by ‘random’ do you mean intermittent? [15:05:11] 10Tool-Labs: s7.labsdb long lag - https://phabricator.wikimedia.org/T96646#1231278 (10eranroz) 5Open>3Resolved [15:05:43] Nemo_bis: 404s? That's normally from the tool itself, I don't think the proxy can generate a 404 itself. [15:06:53] andrewbogott: yes, intermittent [15:07:17] the "not serviced" error, that's a 404 right? [15:08:18] Is any one tool going down and back up and back down again, or is that that each given tool shows only a short period down? [15:08:26] Ah, no, that's a 503. If you're getting those intermittently it means that the tool is crashing repeatedly and being restarted. [15:10:34] Coren: could also be from the proxy being unable to contact the exec host? [15:10:41] andrewbogott: Daily lulz: if you run a script that reboots instances from an instance, make sure the instance you're running the script /from/ isn't in the lst. :-) [15:11:06] hah! Like sawing off the branch you’re sitting on [15:11:08] andrewbogott: No, that'd just give a normal 502 - 503 means there is no mapping for the tool in redis [15:11:39] Coren: great, then I will presume that this is not my fault :) [15:12:08] andrewbogott: Well unless the tool is really sensitive to network bandwidth and that's why it crashes. :-) But yeah, probably not you. [15:12:16] Nemo_bis: Are you seeing this on a particular tool? [15:12:49] (Yuvi's new manifest thing is more obstinate about restarting webservices, but it doesn't help webservices that crash not crashing in the first place) [15:12:55] 6Labs, 10Beta-Cluster: Migrate deployment-prep to new labvirt hosts - https://phabricator.wikimedia.org/T96678#1231284 (10Andrew) 5Open>3Resolved This is done! [15:15:31] andrewbogott: Also now I totally share your pain re "instances that don't properly run puppet or merge from it regularily" [15:15:57] I think it was https://tools.wmflabs.org/xtools/pages/?user=Yiyi&lang=it&wiki=wikipedia&namespace=all&redirects=none [15:16:12] Might be a timeout, rather than a progressing restart [15:16:51] Nemo_bis: xtools is about as stable as jenga on rolleskates. I know T13 is helping rewrite most of it in a separate project, but it's a wip. [15:17:21] Of course, but errors should be truthful :) [15:17:37] Nemo_bis: ? [15:17:47] The proxy really can't tell /why/ a tool just went down, all it can do is tell you that it's not currently up. [15:19:21] The only time I see 404 on any xTools is when something is going on with labs (tool specific or otherwise). When the tool locks up it returns a 503. [15:19:42] Restarting doesn't fix 404s. [15:19:48] T13|mobile: They weren't 404 - they were 503s. :_) [15:20:09] I'll restart in a minute then. [15:20:20] * Coren loves it when people name instances after themselves. Makes some error messages and status reports more interesting. [15:20:32] "legoktm is out of date; rebooting" [15:21:00] legoktm is out of date though... [15:21:04] :p [15:23:36] well, he *is* on vacation, isn't he? :P [15:24:05] Rename and flee [15:24:46] T13|mobile: are you restarting right now? just got a blank page :) [15:25:02] Yep, done. [15:25:39] What drives me crazy are the 5xx errors upon URLs which just redirect elsewhere. Sigh. [15:26:07] Where? Huh? [15:27:52] -ec is working now. -articleinfo too. [15:28:47] No topedits though. Jrm. [15:29:46] So xTools core still isn't fixed... restarting again. [15:30:16] * Coren is gratified by the number of 'not needed' the run is getting. Looks like some people actually read announcements. :-) [15:31:34] Done. [15:31:46] Announcements? [15:31:50] ;p [15:42:16] !log migrating cvn-app4 to labvirt1001 in an attempt to redistribute load a bit [15:42:17] migrating is not a valid project. [15:42:31] !log cvn migrating cvn-app4 to labvirt1001 in an attempt to redistribute load a bit [15:42:35] Logged the message, dummy [16:10:02] andrewbogott: Looks like all but the broken puppets are done. [16:53:49] andrewbogott: Do you have any particular preference on how I handle instances that haven't been puppetized properly in (apparently) some time? [16:54:33] Coren: not really. If they’re self-hosted sometimes I make a local branch to preserve local changes and then just unilaterally reset to origin [16:54:48] But it’s polite to notify people when you do that :) [17:01:20] andrewbogott: Alternately, I can hotfix I suppose - it's a one-liner file to add. [17:01:37] andrewbogott: And if puppet kicks in it's a noop since all it does is add that file. [17:01:50] that seems easy enough [17:04:16] I'm going to let the instances that aren't running at all just lay - they'll have to fix the broken if/when they are restarted. [17:05:45] It’s not integration hosts that you’re working on, is it? [17:05:56] * Coren checks. [17:06:30] andrewbogott: Nope, they all popped up 'not needed' [17:06:57] ok, unrelated shinken warnings then [17:07:00] Ah, not integration-publisher - that one ended up being rebooted. Quite some time ago though. [17:07:11] (~100m ago) [17:10:37] andrewbogott: afaict, integration is having puppet issues because hiera. Not my work. [17:10:49] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item contint::nodepool_host in any Hiera data file and no default supplied at /etc/puppet/modules/contint/manifests/firewall.pp:28 on node i-00000932.eqiad.wmflabs [17:10:52] great, I will continue to ignore then :) [17:10:54] must be wip [17:13:25] !log cvn rebooted cvn-app5 and cvn-app4 (not at the same time) because I suspect them of running amok [17:13:28] Logged the message, dummy [17:19:47] !log deployment-prep rebooting deployment-parsoidcache02 because it seems troubled [17:19:50] Logged the message, dummy [17:48:42] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/FNDE was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=155354 edit summary: [17:48:49] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Cipher was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=155355 edit summary: [17:59:32] Coren: any network issues right now? [18:00:04] Betacommand: There may be ocassional spikes of network lagginess as instances are being migrated to the new hardware. [18:00:35] Tends to be short bursts though. What are you seeing? [18:00:35] OK, ive just seen my IRC bot timeout again [18:00:53] Hm. Do you know what your timeout is? [18:01:28] its an IRC ping timeout factor, not sure what freenode's setting is [18:02:13] I think they're fairly conservative; maybe a minute. [18:02:29] That's still fairly long - I'll keep a closer eye on it. [18:03:16] its my stalkboten job thats timing out [18:03:34] Just the one? [18:04:29] Its the only copy im seeing die [18:05:17] Hm. Might not be network related then - I would expect others would suffer equally. Perhaps something else is making the bot stall and not respond to pings? [18:05:28] DB dependency? [18:05:33] Nope [18:05:57] filesystem is reasonably snappy given the network load, so probably not that. [18:05:57] its freenode/wmf crossover irc bot [18:06:01] * Coren ponders. [18:06:17] Do you know what job number it has? I'll check the node it runs on. [18:06:23] the codes fairly stable, havent really changed it in 10 years [18:08:08] Coren: http://pastebin.com/Nj27Pr0w [18:08:59] Hm. [18:09:22] exec-03 doesn't seem unusually loaded or laggy. [18:09:35] Mind if I strace your job while it runs so I can catch it in the act if it times out again? [18:09:44] feel free [18:10:21] Just dont publish the IRC creds if you run across them :P [18:13:18] oh. mah. gawd. Php is such a ridiculously bad I/O design - it actuall does a poll() with a timout of 0 with a loop around it that does a nanosleep(500000) [18:14:31] And does recvfrom() with MSG_DONTWAIT inbetween. [18:15:20] This isn't going to help - I can't log this to disk it'll add about 30 lines of strace info 2000 times per second. [18:18:21] Coren: Ive been meaning to re-write it for years, I inherited this code [18:18:45] Betacommand: I don't think it's your fault - php's io layer has always been crap. :-) [18:19:01] Coren: Its PHP [18:19:09] which is crap [18:19:16] I prefer python [18:19:30] At any rate, I've got as much instrumentation as I can on the process; it may help figure out what happens exactly if it dies again. [18:19:55] I'm a perl hacker myself, but python will do in a pinch. :-) [18:20:21] Coren: its only died twice today, nothing earth shattering. And the actual process doesnt die, it just kinda disconnects [18:28:53] hey, Everytime I want to connect to labs I get this warning: [18:28:53] Warning: the ECDSA host key for 'tools-dev.wmflabs.org' differs from the key for the IP address '208.80.155.132' [18:29:05] Offending key for IP in /home/amir/.ssh/known_hosts:15 [18:29:05] Matching host key in /home/amir/.ssh/known_hosts:22 [18:29:18] It makes me worried if something is wrong about my pc [18:29:26] Amir1: tools-dev has been upgraded; the new key is mentionned in the announcement email. [18:29:42] so I should fix it by hand [18:29:49] thanks Coren :) [19:14:57] andrewbogott_afk: We need to do more to discourage people away from self-hosted puppet. [19:22:29] Coren: out of curiosity how many virt nodes we have now and what is their hardware? do we have this somewhere public? [19:28:36] petan: I don't know if it's public anywhere, but it's not secret. Lemme try to find that out for you - I'm not sure anymore because we got new hardware coming in and we're decomissioning some. :-) [19:30:18] Coren, any idea what's going on with beta? [19:30:43] Krenair: I know of nothing going on with the beta cluster, specifically, but I'm not the best one to ask. :-) [19:30:51] ok [19:31:07] I noticed shinken in -releng saying a bunch of hosts were going down and up etc. [19:31:20] and there are issues connecting to DBs apparently [19:36:34] Coren, could you add me to https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta ? [19:37:25] Coren: I can help you decomission some of the hardware XD I could find a room for it in my appartment [19:38:08] like some cute server with 16 cpu's and 512gb of ram :3 [19:38:22] valhallasw`cloud: Actually, petan's the one to ask for that. :-P [19:38:43] valhallasw`cloud: sec [19:38:47] wait, I see I can just add myself to toolsbeta.admin :P [19:38:49] petan: I think that the WMF generally donates hardware that's still useful to nonprofits who need it. :-) [19:38:58] (but I'm not an admin in the project? wth) [19:39:25] Coren: I would make a cat-nest out of that server for my cats and that counts :P [19:39:50] valhallasw`cloud: so, you want to break toolsbeta right? [19:39:56] petan: basically. [19:40:02] ok. approved [19:40:05] petan: I'm wondering what rm -rf / does. [19:40:19] thanks :) [19:40:25] I heard it display some easter egg linux hardcoded in kernel if you do that [19:40:35] * Linus [19:40:47] actually, it really does [19:40:56] feel free to do that :P [19:41:51] petan: hm, https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta doesn't show me and sudo doesn't work, so something is still not working... [19:41:59] I tried logging out/in and purging the page [19:42:18] valhallasw`cloud: tell me what is your labs name [19:42:27] merlijn van deen [19:42:58] case sensitive [19:43:19] Merlijn van Deen, then. Autocomplete, yo. [19:43:52] mhm... can't find you [19:44:22] Merlijn_van_Deen? ;D https://wikitech.wikimedia.org/w/index.php?title=User:Merlijn_van_Deen [19:44:33] Failed to add Merlijn van Deen to toolsbeta. [19:44:38] I tried all combinations [19:44:42] even with underscores [19:44:56] I'm already a member, just not an admin, I think [19:45:55] mhm [19:46:09] you actually are [19:46:28] which explains why you could add yourself to that group [19:46:38] but sudo doesn't work... *confused* [19:46:52] because you didn't logoff and back? [19:47:17] I did a few times [19:47:23] on linux or wikitech? [19:47:25]

PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 36%, RTA = 4572.32 ms [19:47:26] hm :S [19:47:38] Krenair: you broke it! you fix it! [19:47:53] petan: both :P [19:48:02] let me try both again [19:48:07] valhallasw`cloud: ok... maybe you don't deserve it :P [19:48:10] :D [19:48:12] https://wikitech.wikimedia.org/wiki/Special:NovaProxy is not showing me proxies for any project. I logged out and back in with no change. [19:48:27] petan, no... I didn't break anything [19:48:50] you aren't in "roots" sudo policy group [19:48:54] pinging deployment-db2 from deployment-mediawiki01 is showing some really weird responses [19:48:56] which means you don't have root [19:49:25] Krenair: ip conflict? :P [19:50:26] * valhallasw`cloud clicks the 'sudo policies' button [19:50:33] valhallasw`cloud: I added you there because you clearly are incompetent to fix yourself :P [19:50:42] petan: love you too <3 [19:50:59] everybody loves me. almost :o [19:51:31] petan, why would an ip conflict cause such a huge variation in responses to ping? [19:51:39] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=1 ttl=64 time=0.229 ms [19:51:39] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=2 ttl=64 time=4321 ms [19:51:39] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=3 ttl=64 time=2.06 ms [19:51:40] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=4 ttl=64 time=50.7 ms [19:51:40] 64 bytes from deployment-db2.eqiad.wmflabs (10.68.17.94): icmp_seq=5 ttl=64 time=0.396 ms [19:52:45] well, I can't thnk of anything better right now. it could many things though, even if kernel is loaded too much that it hangs up, the network can be overloaded or failing etc... there could be so many reasons, best thing is to ssh on destination and figure out OR reboot it [19:53:05] then read logs to figure out why it was [19:53:35] I personally think that target machine is just fucked up [19:54:16] deployment-db1 has the same behaviour [19:54:16] btw there is console window in wikitech that can display errors to you [19:54:23] if there are some errors [19:54:34] they would show up on tty1 [19:54:37] or active tty [19:54:51] which is visible in wikitech [19:55:02] it works even if you can't use ssh [19:55:22] I can SSH to them [19:57:17] ok, what is system load? is it normal? can you ping other vm's fine? [19:57:45] if everything is normal on system, then the problem is likely on network [19:59:43] petan: https://wikitech.wikimedia.org/wiki/Labs_infrastructure [20:00:23] What's up with labs? [20:00:52] T13|mobile, what are you seeing wrong specifically/ [20:00:54] I'm getting 404s from all tools [20:01:24] Also, getting timeout trying to restart job [20:01:26] T13|mobile: https://tools.wmflabs.org/magnustools/multistatus.html works for me? [20:01:36] *shrug* works for me [20:02:28] andrewbogott, is ping between labs machines supposed to frequently jump up to a few seconds? [20:02:44] valhallasw`cloud: curl returns 404 on that. [20:02:54] curl /where/ [20:03:08] Krenair: it shouldn’t, although I see what you mean. [20:03:19] you're seeing the same thing andrewbogott? [20:03:20] tools-login [20:03:45] Krenair: yes — I assume you’re talking about how shinken keeps thinking that deployment instances are down, and up, and down? [20:03:46] ah, okay. That I can reproduce. [20:03:55] andrewbogott, yes [20:03:59] I am also looking at ping [20:04:25] And I know that at least one of the deployment-mediawiki* instances are having trouble connecting to deployment-db* instances [20:04:39] probably all of them [20:04:52] valhallasw`cloud: Do you guys have packetloss/error graphs somewhere? [20:05:19] valhallasw`cloud: I get 404 across the board curling wikiviewstats xtools xtools-ec and xtools-articleinfo [20:05:20] Hm. Those symptoms are consistent with connection timeouts Betacommand has had with his irc bot [20:05:29] multichill: http://graphite.wmflabs.org/ doesn't really have packet loss info, I think [20:05:29] T13|mobile: 404s? [20:05:36] T13|mobile: *nod*. The proxy is doing something weird [20:05:41] Yep [20:06:05] andrewbogott: It does look as though something funky is going on with the network. [20:06:11] valhallasw@tools-bastion-01:~$ ping tools.wmflabs.org [20:06:11] PING tools-webproxy (10.68.16.4) 56(84) bytes of data. [20:06:21] but that's not the internal IP for tools-webproxy-01... [20:06:22] Run $ sh /data/project/xtools/checkall.sh [20:06:29] wikitech isn't seeing proxy config for me either. Possibly related? https://wikitech.wikimedia.org/wiki/Special:NovaProxy is not showing me proxies for any project. I logged out and back in with no change. [20:06:50] andrewbogott: beyond the simple spikes of migration. [20:07:00] Coren: I’ve stopped migration so we can get a clean signal. [20:07:03] You'll see my script that reports the status of all 4 tools [20:07:29] andrewbogott: did the migration change IP addresses for the hosts? [20:07:32] T13|mobile: I know YuviPanda has been experimenting with the new DNS server, but I think only on tools-dev [20:07:41] YuviPanda: can you check out https://wikitech.wikimedia.org/wiki/Special:NovaProxy ? It’s possible that it’s related to our network problems, but you also just refactore that, right? [20:07:52] Coren: tools.wmflabs.org resolves to the wrong IP internally. Externally everything is fine. [20:08:02] valhallasw`cloud: no, it shouldn’t. If you see any examples of that I’d like to know though. [20:08:02] the beta problems have been happening since about 7pm PDT yesterday, weird variations and shinken exploding [20:08:06] i get same on tools-dev and tools-login atm [20:08:35] Having traffic and error graphs on underlying infra makes it a lot easier to debug seemingly weird problems.... [20:09:12] multichill: We /do/ have those. They're showing nothing beyond the ~20m spike of instance migration atm [20:09:30] Also interface error graphs? [20:10:24] wrt beta stuff andrewbogott tried shuffling around virt hosts, I've been monitoring the virtualmachines themselves nothing seems to be wrong there (some high load excepted), ganglia isn't showing any weird issues with the virt hosts themselves (near as I can tell) [20:10:44] Coren, I’m comparing https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring to ping times and I don’t see any visible spikes that correspond when ping times go up. [20:10:59] Yeah, neither do I. [20:11:20] There are big spikes, but they are your instance moves and don't match the actual lag. [20:11:45] * Coren ponders. [20:11:55] Coren: where is /etc/hosts defined? [20:12:02] So labnet1001's interface being saturated isn't the cause. [20:12:13] Hi [20:12:17] valhallasw`cloud: Not all projects do it the same way. [20:12:23] My body doesn't like daylight [20:12:25] someone is reporting prod network performance issues as well [20:12:27] For tools, specifically. Sorry :-) [20:12:30] valhallasw`cloud: etc hosts isn't puppetized [20:12:42] Coren, Krenair, if you ping something on an older host (e.g. bastion1) do you see the same irregular ping times? [20:12:52] So far I do not. [20:12:55] Coren: designate isn't being used on any tools hosts atm [20:13:08] YuviPanda: Didn't you switch tools-dev to it? [20:13:19] YuviPanda: tools.wmflabs.org resolved to tools-webproxy-01 originally, right? [20:13:20] andrewbogott: looking at the novaproxy as soon as my computer decides to start up [20:13:31] YuviPanda: thanks [20:13:36] Coren: no, I deleted and recreated that instance since... [20:13:42] andrewbogott, I logged into bastion1 and pinged deployment-mediawiki01 [20:13:42] valhallasw`cloud: yes [20:13:45] It should [20:13:50] andrewbogott, seeing some irregular ping times still [20:14:09] Krenair: yes, all deployment hosts are on new labvirtxxx hardware though [20:14:19] I see no interface errors. [20:14:23] Sorry, it’s not obvious to anyone but me where instances are hosted right now, do to an issue with the pages updating. [20:15:04] Wouldn't horizon have up to date info atm? [20:15:17] True, it would. [20:16:38] ping times to the old virt hosts (including HP virt hosts like virt1010) are stable. [20:16:48] This is specific to the new labvirt hosts, I’m 90% convinced. [20:17:08] andrewbogott: That would explain why tools users only today started reporting issues. [20:17:29] yeah. Although only a very small slice of tools is on new hosts [20:17:54] The IP/hosts mixup for tools-webproxy is unrelated to the current move; the .139 IP has been there for a week at least [20:17:58] Now I need to verify that it’s happening on all new hosts… [20:18:13] YuviPanda: time to puppetize /etc/hosts? [20:18:26] or is it different between our hosts? [20:18:37] (...all the more reason to puppetize it) [20:18:42] valhallasw`cloud: Time to get /rid/ of /etc/hosts. [20:18:49] There's a task for that, depending on designate [20:19:20] valhallasw`cloud: tools.wmflabs.org is resolved via dnsmasq [20:19:22] not /etc/hosts [20:19:28] valhallasw`cloud: nova/network.pp [20:19:34] * YuviPanda is on laptop now [20:19:55] YuviPanda: so I should kick it out of /etc/hosts? [20:20:02] valhallasw`cloud: oh it shouldn’t be on /etc/hosts at all :| [20:20:03] yes [20:20:05] you should [20:20:15] ok; will do [20:21:04] valhallasw`cloud: I am not finding it on tools-login [20:21:11] of course [20:21:12] YuviPanda: I just removed it :P [20:21:15] bah [20:21:17] thanks [20:21:25] boo unpuppetized /etc/hosts [20:21:33] hah. [20:21:41] but tools.wmflabs.org still resolves incorrectly. wth [20:21:50] for ping at least [20:21:55] andrewbogott: definitely network issue with deployment-salt [20:22:04] rtt min/avg/max/mdev = 0.138/45.223/4296.494/349.471 ms, pipe 359, ipg/ewma 0.650/0.349 ms [20:22:05] valhallasw`cloud: dig gives me correect values [20:22:09] !log tools removed 10.68.16.4 tools-webproxy tools.wmflabs.org from /etc/hosts [20:22:12] Logged the message, Master [20:22:28] YuviPanda: yeah, nslookup as well. [20:22:52] curls doesn't work, neitehr does ping [20:23:01] !log project-proxy restarted dynamicproxy-api [20:23:03] Logged the message, Master [20:23:09] andrewbogott: ^ novaproxy is back up [20:23:21] valhallasw`cloud: curl works for me (tools-login) [20:23:43] YuviPanda: thanks [20:24:17] YuviPanda: I get an empty page from integration-something (which is what 10.68.16.4 is now) [20:24:33] valhallasw`cloud: tools-dev? [20:24:51] andrewbogott: Can you tell me which virt hosts currently has deployment-salt? [20:24:53] that one works [20:25:30] valhallasw`cloud: yeah I see it [20:25:33] (just on tools-dev) [20:26:10] Coren: labvirt1001 [20:26:24] valhallasw`cloud: that’s because you didn’t remove the etchosts entry from tools-dev :D [20:26:28] i just did. works fine now [20:26:37] YuviPanda: wut? [20:26:41] also, fuck this. let me puppetize it [20:26:47] [Thu Apr 23 16:29:28 2015] br1102: port 25(vnet23) entered disabled state [20:27:04]