[00:54:25] YuviPanda, Coren_away: If you happen to be around, something seems to have gotten confused. The grid says job 120810 is running on tools-exec-13, which it is (pid 31243), but it's also running on tools-exec-14 (pid 18252). Same for job 120815, grid has it on -14 (pid 23331) but it's also running on -15 (pid 29007). And job 120816, grid has it on -15 (pid 19830) but it's also running on -14 (pid 18251). And job 120818, grid has it on -06 (pid [00:54:25] 12680) but it's also running on -14 (pid 20783). And the grid thinks job 120817 is on -14, but it's not running anywhere I can see. [00:55:19] Err, scratch that last. I forgot it uses a different process name. [01:01:34] PROBLEM - Host tools-login is DOWN: CRITICAL - Host Unreachable (10.68.16.7) [02:45:56] PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228) [02:46:59] RECOVERY - Host tools-bastion-01 is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [04:39:21] 6Labs, 10LabsDB-Auditor, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, and 10 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1237456 (10Andrew) @aklapper, 1,000,000 thanks for taking this on. I'm happy to review those tasks, although you might have to ping me di... [06:21:16] (03PS1) 10Yuvipanda: Support repositories field being just a dictionary [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206754 [06:30:25] (03PS2) 10Yuvipanda: Support repositories field being just a dictionary [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206754 [06:33:42] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [06:49:10] PROBLEM - Puppet failure on tools-exec-11 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:49:37] (03PS3) 10Yuvipanda: Support repositories field being just a dictionary [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206754 [06:54:36] (03CR) 10Yuvipanda: [C: 032 V: 032] Support repositories field being just a dictionary [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206754 (owner: 10Yuvipanda) [07:02:03] (03PS1) 10Yuvipanda: Support 'repository' in package.json [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206757 [07:03:46] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [07:06:58] (03PS2) 10Yuvipanda: Support 'repository' in package.json [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206757 [07:15:48] (03PS1) 10Yuvipanda: Support repositories without protocol specified [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206758 [07:18:21] (03PS2) 10Yuvipanda: Support repositories without protocol specified [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206758 [07:27:49] (03CR) 10Yuvipanda: [C: 032 V: 032] Support 'repository' in package.json [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206757 (owner: 10Yuvipanda) [07:28:02] (03CR) 10Yuvipanda: [C: 032 V: 032] Support repositories without protocol specified [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206758 (owner: 10Yuvipanda) [07:57:17] 6Labs: /etc/ssh/userkeys/ubuntu or /etc/ssh/userkeys/admin notices for every puppet run on labs instances - https://phabricator.wikimedia.org/T94866#1237571 (10akosiaris) [08:01:01] <_joe_> !log deployment-prep installed hhvm 3.6 on deployment-mediawiki02 [08:01:07] Logged the message, Master [09:35:57] PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228) [09:36:57] RECOVERY - Host tools-bastion-01 is UP: PING OK - Packet loss = 0%, RTA = 22.50 ms [10:40:12] 10Wikimedia-Labs-wikitech-interface, 6Project-Creators: Migrate shell access request process to Phabricator - https://phabricator.wikimedia.org/T72627#1237818 (10Qgil) The best process is no process at all, so if this works for the Labs team, it works for me as well. :) [11:02:39] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:32:43] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [12:19:49] 10Tool-Labs, 5Patch-For-Review: Get rid of jlocal - https://phabricator.wikimedia.org/T95796#1237940 (10valhallasw) 5Open>3Invalid a:3valhallasw [12:20:31] 10Tool-Labs: Puppetize gridengine master configuration - https://phabricator.wikimedia.org/T95747#1237942 (10valhallasw) p:5Triage>3Normal [12:20:45] 10Tool-Labs: Crontabs are not backed up - https://phabricator.wikimedia.org/T95798#1237944 (10valhallasw) p:5Triage>3High [12:21:28] 10Tool-Labs, 3ToolLabs-Goals-Q4: Admin should not rely on NFS to run - https://phabricator.wikimedia.org/T95925#1237945 (10valhallasw) p:5Triage>3Low [12:23:04] 10Tool-Labs, 3ToolLabs-Goals-Q4: Crontabs are not backed up - https://phabricator.wikimedia.org/T95798#1237948 (10valhallasw) [12:23:21] 10Tool-Labs, 3ToolLabs-Goals-Q4: Harmonize VMEM available on all exec hosts - https://phabricator.wikimedia.org/T95979#1237949 (10valhallasw) p:5Triage>3Low [12:24:26] 10Tool-Labs: Trusty instances do not show the motd banners - https://phabricator.wikimedia.org/T85307#1237956 (10valhallasw) p:5Triage>3Normal [12:25:21] 10Tool-Labs: Make puppet cron mails informative - https://phabricator.wikimedia.org/T96122#1237959 (10valhallasw) [12:25:40] 10Tool-Labs: Make puppet cron mails informative - https://phabricator.wikimedia.org/T96122#1208806 (10valhallasw) p:5Triage>3Low [12:25:49] 10Tool-Labs: Move tools-mail to trusty - https://phabricator.wikimedia.org/T96299#1237962 (10valhallasw) p:5Normal>3Low [12:26:06] 10Tool-Labs, 3ToolLabs-Goals-Q4: Clean up local crontabs - https://phabricator.wikimedia.org/T96472#1237964 (10valhallasw) p:5Triage>3Normal [12:26:45] 10Tool-Labs, 5Patch-For-Review: Webservices get unregistered with proxy randomly - https://phabricator.wikimedia.org/T96625#1237970 (10valhallasw) 5Open>3Resolved a:3valhallasw [12:27:55] 10Tool-Labs: tools-redis broken - https://phabricator.wikimedia.org/T96485#1237972 (10valhallasw) 5Open>3Resolved a:3valhallasw Overcommit was disabled by @YuviPanda and the biggest memory usage was solved by {{T91979}}, so hopefully this is resolved. [12:28:45] 6Labs, 10Tool-Labs: Move tools to designate - https://phabricator.wikimedia.org/T96641#1237977 (10valhallasw) p:5Triage>3Normal [12:29:12] 10Tool-Labs, 5Patch-For-Review: Provide a clone of cdnjs for toollabs users - https://phabricator.wikimedia.org/T96799#1237979 (10valhallasw) 5Open>3Resolved a:3valhallasw [12:29:33] 10Tool-Labs: Document email setup - https://phabricator.wikimedia.org/T96884#1237981 (10valhallasw) p:5Triage>3Normal [12:30:53] 10Tool-Labs, 5Patch-For-Review: Set up diamond/graphite/shinken for tools-mail exim paniclog - https://phabricator.wikimedia.org/T96898#1237986 (10valhallasw) p:5Triage>3Low [12:31:46] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-mail redundant - https://phabricator.wikimedia.org/T96967#1237988 (10valhallasw) p:5Triage>3Low Mail is fairly redundant by design (it will just be delivered later), so triaging as Low. [12:32:20] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Switchover Labs NFS server to labstore1002 - https://phabricator.wikimedia.org/T97219#1237990 (10valhallasw) [12:32:40] 10Tool-Labs, 3ToolLabs-Goals-Q4: Support for truly 'generic' webservices via service manifests - https://phabricator.wikimedia.org/T97230#1237992 (10valhallasw) p:5Triage>3Normal [12:32:51] 10Tool-Labs: webservice start / restart shouldn't always need to specify type of server - https://phabricator.wikimedia.org/T97245#1237994 (10valhallasw) p:5Triage>3Normal [12:33:39] 10Tool-Labs, 7Documentation: Wikimedia Labs system admin (sysadmin) documentation sucks - https://phabricator.wikimedia.org/T57946#1237996 (10valhallasw) [12:33:41] 10Tool-Labs: Document email setup - https://phabricator.wikimedia.org/T96884#1237997 (10valhallasw) [12:41:06] 10Tool-Labs, 5Patch-For-Review: Move list of proxies to hiera - https://phabricator.wikimedia.org/T91954#1238010 (10valhallasw) @yuvipanda, this means we should add ``` "toollabs::proxies": ['tools-webproxy-01', 'tools-webproxy-02'] ``` to https://wikitech.wikimedia.org/wiki/Hiera:Tools, right? Could you check... [13:03:40] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [13:11:42] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Muehlenhoff was created, changed by Muehlenhoff link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fMuehlenhoff edit summary: Created page with "{{Tools Access Request |Justification=General ops work, so that I can e.g. track down usage on the gridengine installation (I tried logging into tools-login, but that fails) |..." [13:12:52] petan: ^ that link doesn't work :( [13:13:09] I think it shouldn't urlescape the slashes? [13:13:27] hum, and also not the pluses, because this is a URL and not a parameter [13:13:36] the spaces to pluses* [13:17:03] mhm [13:33:40] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [13:37:25] Coren: have time to help me diagnose a few things? [13:37:38] andrewbogott: Sure. [13:39:13] ok… three things (the first two possibly related): 1) tools-bastion-01 keeps sending shinken ‘host down’ alerts. 2) CPU usage on virt1005 is weirdly high 3) lots of shinken puppet failures overnight. [13:39:31] um, sorry, labvirt1005 [13:39:58] tools-exec-04 and tools-bastion-01 are both on labvirt1005. They’re consistently at the top of ‘top’ on that host. [13:40:04] Hm. I take it if you think they are related that bastion-01 is on labvirt1005? [13:41:12] ... clearly [13:41:38] It’s possible that the cpu on virt1005 is fine and we just happen to have some busy instances there… [13:41:47] But I’m surprised it filled up with only 20 instances [13:43:03] Hm. Yuvi may have been a bit generous with the bastion. 4 cores [13:43:30] exec nodes are always ‘large’ right? [13:43:32] And there are users running bots on it; of course. [13:43:41] nowadays they are yet. [13:43:44] yes* [13:43:46] ok [13:44:05] I need to kick some process butt on the bastion. [13:44:18] But I'm a bit annoyed that this would suffice to overload 005 [13:44:22] Still I feel like a couple of busy large instances shouldn’t be enough to push one of these giant servers into the yellow. [13:44:31] I agree. [13:44:39] Yeah, exactly. I suspect that there’s something else going on. But we’ll see after you clean up. [13:45:14] Actually, Ima keep the load because I want to see what's up with labvirt1005 first. There may be some tunable that needs tuning. [13:45:58] ok. Thank you! [13:48:21] Hm. It's actually not unreasonably bad - I see them consistently using between 1 and 2 cores at 100% which looks like it matches what the instances are, in fact, using. [13:49:06] Why shinken is being noisy about it, though, it's not obvious. The boxes are loaded but not sluggish. [13:49:39] Yeah, actually load has dropped a bit. So bastion hitting the top might’ve been just right when I looked. [13:49:54] The cvn instance is often up there as well, but that’s expected; it’s always busy. [13:54:29] There may actually be something odd with the VM itself. [13:54:54] the bastion vm you mean? [13:54:57] [3049391.525080] BUG: soft lockup - CPU#0 stuck for 22s! [puppet:16765] [13:54:59] Yeah. [13:55:05] oh.. [13:55:26] That’s not the same as last week’s thing is it? I keep rechecking the kernel on 1005 and it keeps looking right [13:56:06] Doesn't look like. [13:58:44] Hm. Can you give me another random instance on 1005 that isn't in tools? [13:58:47] or cvn [13:59:58] here’s all of them: https://dpaste.de/ZpNJ [14:00:17] shinken isn’t reporting anything about those deployment instances, fwiw [14:01:06] Yeah, afaict the only instance that has real issues is tools-bastion-01. Something is going odd with it; I see bursts of the watchdog/* kernel processes showing in top which is... insane. [14:01:36] Everything else is responsive. [14:02:24] But now that I'm looking close it's not happening again. [14:03:06] * Coren hates heisenbugs. [14:04:24] puppet in action: https://s-media-cache-ak0.pinimg.com/736x/35/a8/75/35a875b1989518e53af2d776bb8df0c7.jpg [14:08:03] andrewbogott: I'm not trusting the actual VM. A simple reboot won't actually spawn a new qemu will it? [14:08:28] I don’t think so. [14:08:36] We could migrate it — that will. [14:08:48] Or a nova stop/start will as well, I believe. [14:08:59] suspect/restart… might? [14:09:01] I can try that. [14:09:03] Sounds like a plan - a migrate won't actually interrupt users. [14:09:17] ok, stay tuned [14:09:35] Just don't migrate it to wherever -bastion-02 is. :-) [14:10:19] ok, it’s en route to labvirt1006 [14:18:19] Coren: in mysql, given a process id from ‘show processlist’ how can I see the actual query syntax? [14:18:42] there is a table for it in mysql schema [14:18:46] hold on [14:19:43] Coren: if you have time, could you go over a few tools puppet patches in gerrit? [14:20:02] andrewbogott: show processlist does show the query. [14:20:16] valhallasw`cloud: I will, once I feel a bit more confident about the bastion thing. [14:20:28] andrewbogott: there is also "show full processlist" [14:20:33] Oh… [14:20:33] sure! thanks :-) [14:20:39] that will display whole query [14:20:43] not trimmed [14:20:48] Every process save one says ‘null’ in the Info column [14:20:52] which is where the query should be. [14:21:06] What does /that/ mean? [14:21:14] andrewbogott: That means the process is sleeping, idle. [14:21:33] Ah, I see — it’s just someone who has an open mysql session but isn’t using it? [14:21:39] andrewbogott: Right. [14:21:42] Great, then I definitely have nothing to worry about on this front. [14:21:49] thanks [14:21:57] The command column should also be 'Sleep' [14:22:02] yep, it is [14:22:27] I’m just casting about for reasons why virt1000 puppetmaster might be timing out. Clearly mysql usage isn’t one of them, at least not at the moment. [14:24:07] virt1000? That's an unrelated issue to the one I'm looking at now then? [14:24:07] network issues? [14:24:17] Coren: yes, that was 3) on my list above :) [14:24:31] petan: yeah, possible. [14:24:46] there was someone complaining about weird timeouts few days ago [14:24:51] maybe it's related [14:24:59] Oh, ah, okay - I thought your 3) applied to the hosts mentionned in 1) and 2) :-) [14:25:41] Coren: nope, 3) is happening to lots of vms. [14:25:51] Not just ones on 1005, alas [14:26:08] petan: weird timeouts should be fixed, that was the kernel bug. [14:26:14] Hm. At least puppetmaster runs on metal so we know it's not another freak virt thing. :_) [14:28:19] Coren: I’ve been attributing those puppet failures to https://phabricator.wikimedia.org/T96256. But, that number is way down, and the puppet failures stopped for a week, then resumed last night. Naturally they resumed the moment I closed that phab ticket. [14:29:17] The the number of tokens didn't mysteriously grow again? [14:30:42] 1771279 [14:30:49] still decreasing [14:31:04] although, still huge, so maybe I should ignore puppet failures until that number gets into the queryable range. [14:31:14] I know that keystone periodically tries to purge and then the query times out. [14:36:57] PROBLEM - Host tools-bastion-01 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2471.09 ms [14:37:53] andrewbogott: Hows the migration? [14:38:11] Slow but still happening best I can tell. [14:38:25] maybe 1/3 done [14:38:41] PROBLEM - SSH on tools-bastion-01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:11] andrewbogott: Yeah, the vm is stalling badly - but it's the only vm that does so on 1005 [14:43:40] RECOVERY - SSH on tools-bastion-01 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [14:43:52] What do you think — want to regard this as a blocker for further migration, or should I resume the tools migration? [14:44:53] andrewbogott: I think I'll reserve judgement until I see the vm running in a fresh qemu on a different server. [14:45:08] ok [15:22:02] PROBLEM - Host tools-bastion-01 is DOWN: PING CRITICAL - Packet loss = 14%, RTA = 2449.40 ms [15:22:09] andrewbogott: Do you have a ballpark eta for the migration? [15:22:45] It’s slow. Maybe another 30 mins [15:23:28] Coren: did you already follow up on ‘Wikimedia and Genealogics’? [15:24:03] andrewbogott: I'm looking at it now, but I don't think it's going to be possible to locate who did this after the fact. [15:24:29] oh yeah, if it’s not ongoing there’s not much option other than writing to the mailing list and asking, ‘who did this?' [15:24:35] Well, except ask. I'm guessing it's someone snarfing up the data to check against wikidata. [15:24:43] well, guess based on the projects? [15:24:56] if there's a project for genealogy, that could be a good first bet [15:25:13] is that IP a floating IP? or is it part of some general NAT pool? [15:25:33] we can always run a find on the filesystem as well with some filename guesses [15:26:13] paravoid: My first reflex would be to just ask. I doubt this was a bad faith move, just a bonehead maneuver. [15:26:30] Ask + remind people to not do things like that. [15:31:55] paravoid: If you have exact times I might be able to find a specific user by matching against jobs being run, but otherwise all I can tell is 'was run on the grid' [15:32:11] s/have/had/ [15:32:23] Do you want me to contact the dude from genealogics.org? [15:32:37] sure [15:32:50] but we should have a way to handle abuse requests like that, I'd say [15:38:28] Well, I'd think that anything sent to abuse@ should end up in a security phab task for tracking anyways, as a general sop [15:39:51] abuse@ gets lotsofspam [15:40:06] but feel free to file a task for this one specifically [15:41:59] PROBLEM - Host tools-bastion-01 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2903.84 ms [15:43:46] paravoid: Sorry, didn't mean to imply that there should be an *automatic* phab task on incoming email. That would have horrid s/n. :-) [15:44:31] :) [15:44:45] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:46:51] RECOVERY - Host tools-bastion-01 is UP: PING OK - Packet loss = 0%, RTA = 470.65 ms [15:50:25] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Muehlenhoff was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=155708 edit summary: [15:50:46] Coren: tools-webgrid-07 can’t fork. [15:51:02] * Coren looks at it [15:51:09] Error: /Stage[main]/Base::Standard-packages/Package[pv]/ensure: change from 1.2.0-1 to latest failed: Could not get latest version: Cannot allocate memory - fork(2) [15:51:23] Ah, running oom [15:52:13] There's a job that's going a bit cray-cray. Killing. [15:54:12] 10Tool-Labs, 5Patch-For-Review: Make puppet cron mails informative - https://phabricator.wikimedia.org/T96122#1238399 (10valhallasw) 5Open>3Resolved a:3valhallasw [15:54:33] andrewbogott: puppet then ran without issue. [15:54:51] great. [15:55:06] bastion-01 seems completely down atm. Migration woes? [15:56:56] PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228) [15:57:16] Yes, shinken-wm, I just /said/ that. [16:01:15] is there a tools bastion host which is still running precise? [16:02:43] sitic: Hm, there should be. I hope Yuvi didn't go overboard in his lets-do-trusty enthusiasm. :-) [16:03:19] sitic: The old tools-dev still does. :-) [16:03:32] ah yes thanks [16:03:57] Coren: this is the sync at the end of the migration, should be done soon. [16:09:37] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [16:20:19] Is there an equivilant for usercontibs that work for getting special contribs, eg an ip user? https://en.wikipedia.org/wiki/Special:Contributions/1.2.3.4 [16:21:27] 6Labs, 6Analytics-Engineering: LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. - https://phabricator.wikimedia.org/T76075#1238472 (10kevinator) [16:22:31] 6Labs, 6Analytics-Kanban: LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. - https://phabricator.wikimedia.org/T76075#789100 (10kevinator) [16:23:19] 6Labs, 6Analytics-Kanban: LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. {mole} - https://phabricator.wikimedia.org/T76075#1238477 (10kevinator) [16:23:35] FutureTense: https://tools.wmflabs.org/guc/ ? [16:24:05] using the api [16:24:53] 6Labs, 6Analytics-Kanban: LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. {mole} - https://phabricator.wikimedia.org/T76075#789100 (10kevinator) [16:25:16] 6Labs, 6Analytics-Kanban: LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. {mole} - https://phabricator.wikimedia.org/T76075#789100 (10kevinator) [16:25:58] andrewbogott: For different values of 'soon'. Something went wrong or is it just being sluggish? [16:26:08] no idea what’s happening now :( [16:26:12] It still says it’s migrating [16:27:18] 6Labs, 6Analytics-Kanban: LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. {mole} - https://phabricator.wikimedia.org/T76075#789100 (10kevinator) [16:40:55] Coren: we can keep waiting and see what happens, or murder it. (Or I can explicitly kill it on 1005 and restart it on 1006, which might work or might accidentally murder it.) [16:41:33] andrewbogott: It already has been without a pulse for some time now - Might as well do the murder-and-revive. [16:42:12] ok. btw, we have a second working bastion, why ‘tools-login is having issues’? [16:42:40] Oh, is tools-dev bastion-02? [16:42:44] * Coren nods [16:51:58] RECOVERY - Host tools-bastion-01 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [16:52:13] well… that was graceless [16:52:17] But it’s on 1006 now [16:53:55] It shows none of the symptoms that were plaging it on 1005. Given that no other VM on 1005 seemed ill, I'm going to tentatively conclude that the problem was with the VM and not the host. Makes sense to you? [16:54:15] Coren: that's excellent, thanks! :) [16:54:46] Coren: Yeah, makes sense. I’m still concerned that 1005 is so busy, though, makes me suspect something else (or something unrelated) is happening there. [16:55:36] Coren: I’m going to restart the migration but exclude 1005 for the moment. [16:56:28] andrewbogott: May be a perfect storm of too many busy instances. The bastions occasionally get badly misused, and we know cvn is a hog... [16:56:56] Yeah. I guess in a few minutes ganglia will show the change if the bastion was a big slice of the load. [16:57:21] paravoid: Now to write the polite reminder to everyone. [16:58:30] Krinkle|detached: the cvn app nodes are gobbling CPU. Is this a case where a bit of dev work could shave that down, or is it already reasonably well-designed in your experience? [17:01:00] 10Tool-Labs, 3ToolLabs-Goals-Q4: Crontabs are not backed up - https://phabricator.wikimedia.org/T95798#1238648 (10valhallasw) As I understand, /data/project and /home are backed up, while most other places are not. Possible options: - make crontab store a copy of the crontab in $USER/.crontab. That has the a... [17:06:50] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:16] andrewbogott: It is not reasondably well-designed, and the 2 lead developers of that application have left the organisation. [17:22:11] Until resources are available in the community or at the foundation to focus on reasonable counter-vandalism efforts, I've mostly ended up with the task on trying to herd these applications. CVNBot.exe is one of them. I assume that one is the CPU gobbling one. [17:22:40] It's written in .NET / C# [17:23:55] andrewbogott: I've tried reaching out in the past on several occasions for someone with C# to knowledge to give pointers on debugging it or some perf improvements. https://github.com/countervandalism/CVNBot [17:23:58] But no luck so far. [17:24:00] runs on mono? [17:24:03] Yup [17:24:33] neat. first honest to god effort I've seen for the framework. [17:24:44] I made a few minor bug fixes and some compat fixes to make it with on Precise mono (there were some changes since the Lucid mono that didn't work well.) [17:25:17] But I wouldn't know where to start tracing the internals for CPU spikes [17:25:54] andrewbogott: Has it gotten worse since the migration? Or are you watching it more closely? [17:26:24] https://tools.wmflabs.org/nagf/?project=cvn#h_cvn-app4_cpu [17:26:39] I noticed the graphs look different post-migration. Maybe it's measured differently [17:27:11] Krinkle: I’d be surprised if it got worse. But, I don’t know much about how nagf works. [17:27:26] andrewbogott: Graphite.wmflabs / standard Diamond collected by labs [17:27:32] Krinkle: it’s just that I’m noticing it now. [17:28:10] andrewbogott: I did notice (both in cvn and in integration) that since about a month ago, disk access has gotten significantly worse. Quite often it stagnates local disk access for 30 seconds or more. [17:28:19] Not NFS related. [17:28:53] Krinkle: that’s interesting. If you know exactly when it got worse I can see what changed then. [17:29:04] This was most notably exposed in how MediaWiki installs mysql databases in Jenkins, sometimes it woudl take 10 seconds, sometimes 4 minutes and 10 seconds. [17:29:10] All of that went away when I moved mysql into tmpfs. [17:29:15] Possibly instances on the HP nodes (or on Trusty) have different performane. [17:29:44] That was before the labsvirt migration of last week [17:30:03] Krinkle: If you have time to investigate… create three test instances and I’ll move them around so that we have one on precise/cisco one on precise/hp and one on trusty/hp and we can see what the difference is. [17:30:24] We’ve had all three of those options for a couple of months. [17:35:21] Krinkle: another thing that would satisfy my curiosity (but may not be worth the pain): Just shut down cvn-app4 for 10 minutes and see if that makes labvirt1005 chill out. [17:35:44] andrewbogott: Nothing from cvn-app5? [17:36:19] it’s on labvirt1006 which is not especially busy. [17:36:52] Krinkle: I’m looking at https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Virtualization%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false and my question is… what gives, labvirt1005? [17:44:32] andrewbogott: I'll see if I can move the bots to cvn-app5 in a few hours. [17:44:41] Krinkle: great! thanks. [17:45:28] andrewbogott: The standard warning of https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org applies to cvn as well, so I really don't want any more downtime. Else they'll burn me alive :P [17:46:06] Krinkle: agreed! [18:06:27] PROBLEM - Host tools-exec-14 is DOWN: CRITICAL - Host Unreachable (10.68.17.233) [18:08:36] RECOVERY - Host tools-exec-14 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [18:09:11] andrewbogott: ^ [18:09:31] YuviPanda: it just migrated [18:09:36] ah ok [18:12:58] YuviPanda: migration is happening in alphabetical order, so you can guess what’s next. [18:13:10] heh [18:21:39] 6Labs, 6operations: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1239133 (10yuvipanda) 3NEW [18:22:01] 6Labs, 10Tool-Labs, 6operations, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1239144 (10yuvipanda) [18:32:43] Coren: Did you see the message I left while you were away about some jobs running on two different exec hosts? [18:33:22] anomie: I have, I was waiting to talk to you to clarify thing - I'm in the ops meeting atm, do you have a few minutes in about ~30m? [18:33:35] Yes, I'll be around [18:33:46] I'll ping you once I'm free. [18:45:25] I had received tools access for my account Jeph_paul, I recently changed my key pair & I'm no longer able to login [18:45:45] Jeph_paul: hey! did you update your new key on wikitech.wikimedia.org? [18:45:45] Warning: the ECDSA host key for 'login.tools.wmflabs.org' differs from the key for the IP address '208.80.155.130' [18:45:49] on preferences? [18:45:51] Jeph_paul: oh [18:45:55] I did [18:46:12] Jeph_paul: ah that’s new fingerprints for new machines [18:46:19] Jeph_paul: so you’re ok :) [18:46:47] Jeph_paul: https://lists.wikimedia.org/pipermail/labs-announce/2015-April/000009.html has the new keys [18:48:08] So me not being able to ssh into tool labs has to do with my keys? [18:49:17] Jeph_paul: did you answer’ yes’ to the warning? [18:49:35] Jeph_paul: and if you had changed your keys, then yes, you need to update your public key in wikitech.wikimedia.aorg [18:50:19] I answered yes to the warning & I had changed my keys, let me try updating my keys again then [18:52:43] I tried updating my new key in the preferences. But it still says "Permission denied (publickey)." [18:54:40] Jeph_paul: what is your username? [18:55:40] "Jeph_paul" [18:56:07] that is my user name on wikitech [18:56:18] thats the same I guess I used before [18:57:04] Jeph_paul: I’m tailing the logs now, can you attempt to ssh again? [18:57:18] anomie: Allright. Your original long scrolled out of my buffer; can you point me at the issue again and I'll look into it. [18:57:29] Coren: The grid says job 120810 is running on tools-exec-13, which it is (pid 31243), but it's also running on tools-exec-14 (pid 18252). Same for job 120815, grid has it on -14 (pid 23331) but it's also running on -15 (pid 29007). And job 120816, grid has it on -15 (pid 19830) but it's also running on -14 (pid 18251). And job 120818, grid has it on -06 (pid 12680) but it's also running on -14 (pid 20783). [18:58:04] Yes, sure, ssh ing now [18:59:49] PROBLEM - Host tools-exec-20 is DOWN: CRITICAL - Host Unreachable (10.68.17.251) [19:02:45] RECOVERY - Host tools-exec-20 is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [19:02:54] anomie: Looking into it now. [19:03:49] Jeph_paul: hmm, Jeph_paul is an invalid user. can you check what your shell username is? [19:03:55] Jeph_paul: it should be in your preferences on wikitech [19:06:13] cosmiclattes is my shell username, & now it works, I can ssh [19:06:23] Thanks [19:06:33] Jeph_paul: yw :) [19:09:24] anomie: That's really odd. As far as I can tell, those are jobs that were forcibly restarted by the gridengine that had not really properly died. [19:09:51] anomie: I'm not entirely certain how that can have happened in the first place. [19:09:57] * Coren digs deeper in the logs. [19:10:48] I mean, given the timestamps it's pretty clear that this happened as a side effect of the new hardware snafu, but it's the first time that I see that particular odd side effect. [19:17:57] Coren: Whenever you're finished investigating, feel free to kill the rogue processes or do whatever else is necessary to return things to a sane state. [19:20:55] 10Tool-Labs, 5Patch-For-Review: Fix and clean up generation of ssh_known_keys - https://phabricator.wikimedia.org/T92379#1239413 (10valhallasw) 5Open>3Resolved [19:20:57] 10Tool-Labs: Remove redundant ssh host keys from users' known_hosts - https://phabricator.wikimedia.org/T92497#1239415 (10valhallasw) [19:30:04] So, my tool on labs that runs queries against assorted databases is not working right now for things except enwiki. It looks like it can't connect to some of the replica databases. Is that known/expected right now? [19:30:23] ragesoss: no. which databases are being affected? [19:30:29] ragesoss: No, I'va had no report of db outages. [19:30:47] s3.labsdb refuses new connections: "ERROR 1040 (08004): Too many connections" [19:30:55] boo [19:31:05] Coren: YuviPanda: "Cannot connect to auth database" [19:31:06] Coren: do you want to take a look or should I? [19:31:15] YuviPanda: I will; that's an easy one. [19:31:18] cool [19:33:24] tools.sigma is being unreasonable. [19:35:28] so yeah, it's the centralauth database that my tool can't connect to. [19:35:33] Not sure what sandbot3 is trying to do, but >900 connections to the db making the same query is clearly a bug. [19:40:11] The processes are all in killed state, should get better shortly [19:46:13] Hm. [19:47:28] has the 'add instance' link gone somewhere else? i have admin on a few projects, but not seeing an 'add instance' link on any of their Nova_Resource: pages [19:48:12] ebernhardson|lch: 1. have you tried logging out and back in? 2. also, the add instance link was always on Special:NovaInstance and I’m unsure if it was ever on Nova_Resource [19:48:57] YuviPanda: i always forget that wikitech wants me to log out and back in :) Also that's where Help: [19:49:06] Help:Instance says it should be there i mean, its probably the log out and back in ... trying [19:49:20] all the Help pages on wikitech about labs should be killed at some point [19:49:21] and rewritten [19:49:27] :) [19:50:20] YuviPanda: still not at the documented one, but its at Special:NovaInstance. htanks! [19:50:35] cool [19:51:26] i imagine there are some sort of resource limits i should follow? basically these phantomjs tests for cirrus take >1hr on my laptop so wanted to boot up a 4 or more core machine i could run them on in labs [19:55:09] ebernhardson|lch: don’t hit NFS :) and you can enable the srv role in Special:NovaInstance configure to have more storage in /srv if you want [19:55:16] ebernhardson|lch: outside of that, we have quotas in place so.. [19:55:27] ok, sounds good [20:00:20] YuviPanda: Turns out the db is ill in some unspecified manner. Working on it. [20:01:39] (03PS4) 10Awight: Use block style YAML [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/196852 [20:03:32] It was stuck on a futex in some sort of deadlock. [20:03:48] All better now. [20:05:19] sitic: looks like your gsoc project was picked up :) [20:08:07] YuviPanda: yep =) [20:10:36] ragesoss: Should be all better now. [20:11:42] hello folks :) [20:18:26] o/ hashar [20:18:39] anomie: Can you stop all your jobs for a little bit? [20:19:20] Coren: Stop as in qdel, or stop as in signal them all to pause until they get the unpause signal? [20:19:38] anomie: Stop as in qdel and not restart for a few minutes. [20:19:57] Coren: I don't have any automatic restart. qdeled [20:20:27] Well, not actually qdeled. [20:20:47] But sent the "halt at the earliest opportunity" signal. [20:20:59] Coren: All seem to have exited. [20:21:48] 10Wikimedia-Labs-wikitech-interface: Automatically grant shell user right to everyone who signs up on wikitech - https://phabricator.wikimedia.org/T97334#1239554 (10yuvipanda) 3NEW [20:22:53] anomie: afaict, everything is clean. You can restart them. [20:23:02] * anomie restarts [20:26:19] 10Wikimedia-Labs-Infrastructure: Set ulimits on bastion - https://phabricator.wikimedia.org/T47670#1239576 (10yuvipanda) [20:27:15] 10Wikimedia-Labs-wikitech-interface: Automatically grant shell user right to everyone who signs up on wikitech - https://phabricator.wikimedia.org/T97334#1239582 (10Andrew) The reason for human interaction is to prevent a bot (or malicious human) from suddenly allocating 100 accounts and using us to mine bitcoin... [20:27:35] 10Wikimedia-Labs-wikitech-interface: Automatically grant shell user right to everyone who signs up on wikitech - https://phabricator.wikimedia.org/T97334#1239583 (10yuvipanda) [20:27:37] 10Wikimedia-Labs-Infrastructure: Set up ulimits on bastion - https://phabricator.wikimedia.org/T56719#611594 (10yuvipanda) [20:31:38] 10Wikimedia-Labs-wikitech-interface: Automatically grant shell user right to everyone who signs up on wikitech - https://phabricator.wikimedia.org/T97334#1239591 (10yuvipanda) @Andrew yeah but the people adding people to new projects is different from the people granting shell access, so it requires two manual i... [20:33:00] 10Wikimedia-Labs-wikitech-interface: Automatically grant shell user right to everyone who signs up on wikitech - https://phabricator.wikimedia.org/T97334#1239593 (10yuvipanda) @Andrew we can also put in the same roadblocks in place against bot account creation that prod has (Captcha, AbuseFilter, Rate Limiting) [20:43:12] 10Wikimedia-Labs-wikitech-interface: Automatically grant shell user right to everyone who signs up on wikitech - https://phabricator.wikimedia.org/T97334#1239613 (10Legoktm) >>! In T97334#1239593, @yuvipanda wrote: > @Andrew we can also put in the same roadblocks in place against bot account creation that prod h... [20:50:19] 10Wikimedia-Labs-wikitech-interface: Automatically grant shell user right to everyone who signs up on wikitech - https://phabricator.wikimedia.org/T97334#1239663 (10yuvipanda) We do not get much account spam atm with no captchas at all, and you just use abusefilter after the fact. And remember what happened when... [20:58:41] PROBLEM - Host tools-exec-24 is DOWN: CRITICAL - Host Unreachable (10.68.17.255) [21:03:10] RECOVERY - Host tools-exec-24 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [21:08:20] Krinkle: did you change anything about app4? labvirt1005’s cpu graph is suddenly normal. [21:08:30] andrewbogott: Nope, not yet [21:08:34] curious! [21:09:18] So app4 runs CVNBot5-10 https://github.com/countervandalism/stillalive/blob/master/localSettings-cvn.json#L29-L34 which is what powers #cvn-sw [21:09:24] The most busy channel we have [21:09:37] it monitors several hundred so called "small" wikis [21:09:44] [[m:SWMT]] [21:09:49] (on Freenode) [21:10:20] It basically just listens to irc.wikimedia.org, filters it based on data in its sqlite db, and then outputs it to a Freenode channel. [21:11:24] Maybe the wikis themselves calmed down [21:13:54] Krinkle: could be, or it could be that the activity on that host is nothing to do with cvn. [21:14:26] andrewbogott: Yeah.. [21:14:54] andrewbogott: I don't know how queriable our cpu metrics are on graphite, but maybe you can correlate it with another instance/project somewhere? [21:15:39] limited to the instance names hosted on that virt host [21:16:41] I’m pretty much limited to total-server usage and running ‘top’ to see what’s busy. [21:18:05] which, the top user isn’t cvn anymore. [21:21:20] 6Labs, 10Tool-Labs, 6operations, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1239797 (10yuvipanda) To begin with, how about we just have one for when toollabs home page is up and showing sensible things? [21:25:30] 10Wikimedia-Labs-Other, 6Phabricator: phab-01.wmflabs.org test instance's statuses are out of date - https://phabricator.wikimedia.org/T76943#1239808 (10Aklapper) 5Open>3Resolved a:3Aklapper Got admin. Updated https://phab-01.wmflabs.org/config/edit/maniphest.statuses/ with the latest values, hence closi... [21:25:37] (03CR) 10Merlijn van Deen: [C: 032] "Sorry for letting this lie for so long, and thank you for your patience in rebasing!" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/196852 (owner: 10Awight) [21:28:20] (03Merged) 10jenkins-bot: Use block style YAML [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/196852 (owner: 10Awight) [21:31:00] [13intuition] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/Krinkle/intuition/commit/03f9c3919de12b5c535fe45c5f2d8cfd538946f0 [21:31:01] 13intuition/06master 1403f9c39 15Timo Tijhof: build: Update devDependencies and use JSCS preset [21:34:00] YuviPanda: if you have time, there's a few more tool labs infra patches that could use a lookover / +2 [21:34:36] valhallasw`cloud: oh sure. link? [21:35:36] valhallasw`cloud: am looking at the exim collector now. [21:35:43] valhallasw`cloud: you should probably contribute that upstream too :) [21:35:50] YuviPanda: there's a few older ones from scfc as well [21:36:07] but I forgot to give them a lookover and +1 where applicable [21:36:56] e.g. https://gerrit.wikimedia.org/r/#/c/203667/ and https://gerrit.wikimedia.org/r/#/c/202363/ [21:37:29] valhallasw`cloud: I -1’d https://gerrit.wikimedia.org/r/#/c/148917/2 on which the first patch depends [21:38:27] valhallasw`cloud: I left a comment on the exim one [21:39:03]