[06:46:53] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [07:16:51] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [07:25:45] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make list.php not rely on portgranter - https://phabricator.wikimedia.org/T93197#1202289 (10scfc) [07:25:46] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Provide a status page (list) of all active proxy definitions - https://phabricator.wikimedia.org/T88216#1202288 (10scfc) 5Open>3stalled [07:36:55] Using the uwsgi-python webservice, is there a built-in way to get some basic usage stats? I'm thinking hit counts, maybe a 30-day history of hit count, binned into intervals of a couple of minutes. [07:37:02] Or is it necessary to roll your own? [07:37:24] Sorry, talking tools labs here. [07:38:47] GoldenRing: none yet, sadly. hopefully in the future, though [07:39:00] you can probably parse uwsgi.log for it easily enough tho [08:48:27] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1202334 (10scfc) I was looking at one of the transient Puppet failures and found this on `tools-exec-09`'s `/var/log/syslog.1` (Precise): ``` Apr 13 06:25:01 tools-exec-09 CRON[17813]: (root) CMD (test -x /... [08:51:46] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1202340 (10scfc) [10:04:18] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1202418 (10scfc) (Filed https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=782501 for the underlying issue.) [10:37:38] 10Tool-Labs: Convert updatetools.pl into a puppetized Python service with monitoring - https://phabricator.wikimedia.org/T94858#1202504 (10scfc) a:3scfc [10:59:01] 10Tool-Labs: Improve & force sudo lecture - https://phabricator.wikimedia.org/T95882#1202539 (10valhallasw) 3NEW [11:01:30] 10Tool-Labs: Improve & force sudo lecture - https://phabricator.wikimedia.org/T95882#1202546 (10valhallasw) [11:20:49] 10Tool-Labs: Improve & force sudo lecture - https://phabricator.wikimedia.org/T95882#1202579 (10scfc) Users use `sudo` disguised as `become` to access their tools. You don't want to show them a warning every time, I assume? I'm not sure if the problem doesn't lie with the warning side of it; i. e. the informat... [11:22:34] 10Tool-Labs-tools-Other: Migrate http://toolserver.org/~Emijrp to Tool Labs - https://phabricator.wikimedia.org/T62887#1202580 (10TTO) 5Open>3Resolved a:3TTO https://en.wikipedia.org/w/index.php?title=User_talk:Emijrp&diff=656243483&oldid=656242154 [11:22:36] 10Tool-Labs-tools-Other, 7Tracking: [tracking] toolserver.org tools that have not been migrated - https://phabricator.wikimedia.org/T60865#1202583 (10TTO) [11:22:53] 10Tool-Labs-tools-Other: Migrate http://toolserver.org/~Emijrp to Tool Labs - https://phabricator.wikimedia.org/T62887#1202585 (10TTO) a:5TTO>3None [11:50:11] 10Tool-Labs-tools-Quentinv57's-tools: SUL Info results incorrect for specific user - https://phabricator.wikimedia.org/T85642#1202595 (10Aklapper) [11:51:44] 10Tool-Labs-tools-Quentinv57's-tools: editcount tool gives a significant different/wrong number of edits - https://phabricator.wikimedia.org/T67741#1202598 (10Aklapper) @Cyberpower678: Still working on this (as you are set as assignee)? Still high priority? [11:52:21] 10Tool-Labs-tools-Quentinv57's-tools: Lost connection to MySQL server at 'reading authorization packet', system error: 0 - https://phabricator.wikimedia.org/T63469#1202603 (10Aklapper) p:5Triage>3Normal [11:52:23] 10Tool-Labs-tools-Quentinv57's-tools: Fatal error: Call to a member function fetch_assoc() on a non-object in /data/project/quentinv57-tools/public_html/tools/globalcontribs.php on line 280 - https://phabricator.wikimedia.org/T73069#1202600 (10Aklapper) p:5Triage>3Normal [11:52:25] 10Tool-Labs-tools-Quentinv57's-tools: Global Sysop Statistics are not working - https://phabricator.wikimedia.org/T72185#1202599 (10Aklapper) p:5Triage>3Normal [11:52:27] 10Tool-Labs-tools-Quentinv57's-tools: Local Stewards Log's hyperlink that links to an account (who's userrights are changed) is broken - https://phabricator.wikimedia.org/T67226#1202602 (10Aklapper) p:5Triage>3Normal [11:52:29] 10Tool-Labs-tools-Quentinv57's-tools: Empty query in /data/project/quentinv57-tools/public_html/tools/sulinfo.php on line 207 - https://phabricator.wikimedia.org/T72874#1202601 (10Aklapper) p:5High>3Normal [11:52:36] 10Tool-Labs-tools-Quentinv57's-tools, 7I18n: SUL info tool fails with unicode usernames - https://phabricator.wikimedia.org/T67144#1202605 (10Aklapper) p:5Triage>3Low [12:41:29] 10Tool-Labs: Register labs-announce at Gmane - https://phabricator.wikimedia.org/T94647#1202663 (10scfc) The list is now up at http://news.gmane.org/gmane.org.wikimedia.labs.announce; I've asked Lars to import the missing message. [12:44:47] 10Tool-Labs: Clean up list of projects on Tool Labs home page and add Tomcat tools - https://phabricator.wikimedia.org/T51937#1202664 (10Ricordisamoa) Also, don't uWSGI-based tools need to be added too? [14:09:39] 10Tool-Labs, 5Patch-For-Review: Convert updatetools.pl into a puppetized Python service with monitoring - https://phabricator.wikimedia.org/T94858#1202788 (10scfc) Table `users`: - All current users in the Tools project (excluded users are removed) who: -- have a username not starting with `tools.`, -- have `... [15:37:14] YuviPanda: Ping me once you're around. [15:39:39] andrewbogott: Okay to merge Add mholloway to bastion-only and researchers. (71a71cd)? [15:39:55] Coren: yes! sorry [15:42:19] andrewbogott: Do you have a random precise instance in test I can reboot without issue? [15:43:12] sure, reboot util-abogott [15:43:20] it’ll kick me off irc but I can live with that :) [15:43:58] You have your bouncer in labs? Wouldn't that make it hard to communicate in case of labs issues? :-) [15:44:43] Coren: yes, but it means I notice them right away [15:45:09] !log testlabs Rebooting util-abogott (idmap fix test) [15:45:11] Logged the message, Master [15:48:17] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precice instances - https://phabricator.wikimedia.org/T95555#1203229 (10coren) Idmap disabling works, pending a reboot or remount of all NFS filesystems (T95556). Next phase is turning off... [15:48:38] * Coren welcomes andrewbogott back. [15:48:52] looks fine to me [15:49:17] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Schedule reboot of all Labs Precise instances - https://phabricator.wikimedia.org/T95556#1194417 (10coren) [15:49:19] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precice instances - https://phabricator.wikimedia.org/T95555#1203239 (10coren) [15:49:53] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precice instances - https://phabricator.wikimedia.org/T95555#1194404 (10coren) [16:01:01] 6Labs, 10Tool-Labs: Local hacks in /data/project/admin/toollabs on the web proxies - https://phabricator.wikimedia.org/T95821#1203290 (10Andrew) thank you! [16:19:16] 6Labs: Abolish use of ec2 ids - https://phabricator.wikimedia.org/T95910#1203350 (10Andrew) 3NEW a:3Andrew [16:19:54] 6Labs: Remove puppet and salt keys on instance deletion - https://phabricator.wikimedia.org/T95911#1203362 (10Andrew) 3NEW a:3Andrew [16:21:15] 6Labs: Abolish use of ec2 ids - https://phabricator.wikimedia.org/T95910#1203350 (10Andrew) [16:21:28] 6Labs, 10Continuous-Integration: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1203374 (10Krinkle) [16:21:59] 6Labs, 10Continuous-Integration: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1194719 (10Krinkle) >>! In T95569#1194807, @yuvipanda wrote: > Should I just delete all the data under the integration project, and let it start again from s... [16:22:13] 6Labs: Abolish use of ec2id - https://phabricator.wikimedia.org/T95480#1191859 (10Andrew) [16:22:50] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1203384 (10coren) This order now demonstrably works with Precise, with both the unmount-everything and the just-reboot methods allowing NFS-by-uid. Next up is schedule a resta... [16:26:16] andrewbogott: The only difference is that NFS now uses user ids instead of user names - it's essentially a noop unless you have more than one instance accessing the same files with diverging uids - and that can only happen for apt-managed users that aren't part of base. [16:26:41] yep, should be better [16:26:49] especially if we can use normal user auth on labstore* now :) [16:28:13] Not yet - there are a few steps left before that. But it's the necessary first step. [16:30:01] Now I have to run through every precise instance and either (a) umount all NFS mountpoints or (b) reboot them. The former is unlikely to work on any instance being actually used, but I'm thinking it's worth a salt run to attempt it if only to reduce the number of instances that need a reboot. [16:32:16] Coren: I saw your comment on the wikiviewstats no db connection ticket. Based on that, I'm guessing that tool needs new credentials; how do I go about getting some or can I use the creds for xtools itself? [16:32:48] T13|mobile: Can you point me at the ticket to I can refresh my memory? [16:33:22] @link [[phab:T91320]] [16:33:22] https://wikitech.wikimedia.org/wiki/phab:T91320 [16:35:15] T13|mobile: Ah, yes, it's not immediately clear what the issue was because I don't know the code, but I saw at least two sets of db credentials that contradicted each other and, as far as I could tell, the code used the wrong ones. [16:35:50] T13|mobile: You can be sure you have the right credentials by deleting the replicas.my.cnf file (it'll get regerenerated) and make certain that the code uses that one. [16:38:33] Okay, I'll work on that in the next day or so. Figured I'd ask about it since I saw you responding now. :) [16:45:04] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration: Diamond metrics for cpu.system suddenly up 100% after a reboot - https://phabricator.wikimedia.org/T95912#1203449 (10Krinkle) 3NEW [18:00:47] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:48] Bleh. [18:00:49] I think that's my fault. [18:00:49] Big flush in progress. [18:00:54] I wonder if I can increase the IO priority of nfs. [18:00:56] poor grrrit [18:00:57] !ping [18:00:57] !pong [18:00:58] Yeah, NFS is starved out of I/O bandwidth atm. Should return shortly. [18:00:59] Is what you are doing why wm-bot isn't responding? [18:01:02] It's not what I'm doing - it's what labstore1001 is doing. :-( It decided to sync all to disk and it has a LOT of cache. [18:01:02] ... or it lost a shelf again. [18:01:03] * Coren checks. [18:01:03] Doesn't look like it. *phew* [18:01:04] Ah, looks like it's recovering now [18:01:04] (gradually) [18:01:04] is this why I can't ssh into android-build.eqiad.wmflabs? [18:01:04] bearND: Certainly. [18:01:05] That server has way too much writeback buffering. [18:01:05] * Coren needs to tune that. [18:01:05] Coren: ok, thanks. I guess I'll wait a bit then [18:01:06] Coren: any ETA on the labs recovery?> [18:01:06] <[gifti]> what's up with it anyway? [18:01:06] Betacommand: Not sure - it should be running out of buffers to flush but I'm not seeing a clear way of knowing where it's at. [18:01:07] I did just up the NFS I/O priority so that will probably improve things. [18:01:07] A simple grep command stucks and I can't break it. [18:01:08] The problem with NFS. Robust, but scales very poorly. [18:01:08] What was the cause of this issue? [18:01:08] I'm seeing a bit of NFS activity now; but it's a trickle. [18:02:14] Mjbmr: not sure what the root case was, only the immediate issue. The NFS server just decided to flush all of its bufers to disk at once. [18:02:15] toollabs web requests, and shell commands like 'qstat' in on tools bastion are timing out. [18:02:15] That starved NFS out of I/O bandwidth. [18:03:50] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:08:17] NFS is working-ish now, but it has very little bandwidth. :-) [18:08:18] So very slow. [18:08:19] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:08:19] !ping [18:08:19] !pong [18:08:19] PROBLEM - Puppet failure on tools-static is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [18:12:08] /bin/sh: execle: Cannot allocate memory [18:19:24] Mjbmr: Side effect of things piling up. [18:19:24] I see. [18:19:24] I think I'm having a repeat of the previous issue with the one shelf. [18:19:24] Mjbmr: stop and wait, once everything settles down then start your bot [18:19:25] it's not a bot. [18:19:25] I also use tools project to contribute to MediaWiki. [18:19:27] This always happen to me while I'm working on something. But I'm on blaming on people this time. [18:19:27] I would like to understand who it's fixable to avoid it happen again but unfortunately I have no clue on that. [18:19:27] * how [18:19:37] Mjbmr: until Coren finishes troubleshooting there is nothing we can do [18:21:05] Mjbmr: some things are just out of our control and as such test our virtuosity. [18:21:05] well, I if it was decidable by more people, it would have more options to avoid it. [18:22:18] Mjbmr: odds are this is a hardware issue or other factor that isnt human dependent [18:22:32] ^^^ [18:23:07] yeah I know, but options are not only one. [18:24:27] Mjbmr: in some cases there is only one option. its not the optimal answer but rather a fact of life [18:24:41] Is there an ETA on stability? Just curious if it's 10 minutes or 10 hours type thing. [18:24:58] hopefully 10mins :) we don’t know yet, mostly [18:25:02] RECOVERY - Puppet failure on tools-static is OK: OK: Less than 1.00% above the threshold [0.0] [18:25:44] (Core.n and Giuseppe are on it) [18:27:13] uh oh [18:27:14] YuviPanda: Erhm, did something just break? [18:27:14] multichill: NFS [18:27:32] !log tools.gerrit-reviewer-bot Down, NFS broke [18:27:39] see also /topic [18:27:41] Logged the message, Master [18:27:42] Oh, that bot is down too ;-) [18:27:45] multichill: should pick up automagically afterwards [18:28:08] multichill: as it reads emails to get recent changesets [18:28:32] was it a sync command? [18:49:49] Eloquence: Dude! [18:49:51] :) [18:49:52] Was just curious because I'm getting spammed to death with emails from xtools and can't get in to disable mail... lol [18:49:52] multichill: about 50 minutes ago [18:49:53] can't just get more servers in case one goes down? it that so hard? [18:49:55] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [18:49:55] Mjbmr: yes, because then you need to keep those servers in sync [18:49:56] Mjbmr: after all, an empty NFS server wouldn't do much good either [18:49:56] no, separate. [18:49:57] valhallasw`cloud: though there are two nfs servers which use the same array so :) [18:49:58] technically there are more than one ;) [18:49:59] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [18:49:59] Mjbmr: are you offering a donation for equipment and wages to purchase and maintain it? [18:49:59] That's not the point. [18:49:59] JohnFLewis: yeah, but the issue is probably in the hardware itself. we’re investigating it [18:50:00] YuviPanda: stop replying to IRC and investigate and fix! :p [18:50:00] T13|mobile: two others are already on it :) I’m just relaying info :) [18:50:00] Touché [18:50:01] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [18:50:01] YuviPanda: I didn't say it wasn't, just saying there are two systems so buy won't won't help :) [18:50:01] YuviPanda: I wrote a manifest *o*o [18:50:02] oh tottally :) [18:50:02] valhallasw`cloud: \o/ [18:50:02] without that second o [18:50:02] oh well [18:50:03] valhallasw`cloud: webservice should already do that for you tho [18:50:03] oh, a puppet manifest :D [18:50:03] or whatever it's called [18:50:03] YuviPanda: we should also find a way to make all those mails to root go to phab, I think [18:50:04] valhallasw`cloud: ah :D [18:50:04] valhallasw`cloud: nice! I left a comment! [18:50:04] but maybe in a smart way so we don't get 100s of tasks on issues [18:50:04] valhallasw`cloud: root@ can be sensitive (Maybe?) so need to be careful [18:50:05] dem tabs be on purpose [18:50:05] yeah [18:50:05] valhallasw`cloud: also SO MUCH CRONS [18:50:05] but they should all be tabs :< [18:50:06] (sudoers uses tabs, so I'm staying with that convention) [18:50:06] oh fair enough [18:50:06] I wonder if there’s a define you should use intead. let me look [18:50:06] there’s a sudo module [18:50:06] :{ [18:50:07] but that's 3rd party, right? [18:50:07] and it says "This module will purge your current sudo config" :P [18:50:07] valhallasw`cloud: no, everything in modules/* is ours :) [18:50:07] oh :D [18:50:08] heh [18:50:08] oh, it's another then maybe [18:50:08] let me check [18:50:08] valhallasw`cloud: well, if it’s tabs you should make them all tabs. I see mix [18:50:08] yeah [18:50:09] working on it [18:50:09] hmyeah that's a cleaner thing to use maybe [18:50:09] extend that to options [18:50:09] but... do we use that? because then we wouldn't have error emails to begin with [18:55:13] YuviPanda: how do I figure out/set my sudo password on tools-dev? [18:55:14] you can’t :) [18:55:15] :/ [18:55:16] T13|mobile: we don’t support passwords for sudo [18:55:17] I see. [18:55:18] NFS is recovering now. [18:55:18] Who's in charge of adding wikibugs to a channel again? I forget... [18:55:18] Would like to get Tools-Labs-xTools project phab stuffs in #xTools [18:55:35] T13|mobile: *waves* [18:55:36] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:59:06] T13|mobile: or send a change request [18:59:07] valhallasw`cloud: when you have time. I can put in a phab ticket if needed. [18:59:07] T13|mobile: it's a simple change [18:59:08] Or request on GitHub [18:59:08] I know, I just couldn't remember who/where to ask. lol [18:59:09] (03PS1) 10Merlijn van Deen: Tools-Labs-xTools to #xtools [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/203877 [19:04:02] valhallasw`cloud: bad time but if you're up for merging stuff: https://gerrit.wikimedia.org/r/#/c/203564/ ;) [19:04:03] YuviPanda: also can we disable emails from tools-submit to one user? as I keep getting cronspam x2 because one user has an email that doesn't route [19:04:04] PROBLEM - Puppet failure on tools-exec-21 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [19:04:04] valhallasw`cloud: my bad 'Tool-Labs-xTools' [19:04:04] No s on tool [19:04:04] doh [19:04:05] (03CR) 10Merlijn van Deen: [C: 032] wmt: change tag name [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/203564 (owner: 10John F. Lewis) [19:04:06] (03Merged) 10jenkins-bot: wmt: change tag name [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/203564 (owner: 10John F. Lewis) [19:04:06] (03PS2) 10Merlijn van Deen: Tools-Labs-xTools to #xtools [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/203877 [19:04:06] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:04:07] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:04:13] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [19:04:38] (03CR) 10Merlijn van Deen: [C: 032] Tools-Labs-xTools to #xtools [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/203877 (owner: 10Merlijn van Deen) [19:04:51] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 94494 bytes in 3.237 second response time [19:05:07] (03Merged) 10jenkins-bot: Tools-Labs-xTools to #xtools [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/203877 (owner: 10Merlijn van Deen) [19:05:20] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [19:05:38] !log tools.wikibugs Updated channels.yaml to: 4500101f021b8eec83899848932edaee98bd680a Merge "Tools-Labs-xTools to #xtools" [19:05:42] Logged the message, Master [19:06:52] TIL: there's a command 'timeout' that does exactly what you think it does [19:11:41] 6Labs, 6operations, 10ops-eqiad: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1204203 (10Cmjohnson) The disk has been replaced. The tracking n... [19:12:41] 6Labs, 6operations, 10ops-eqiad: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1204204 (10Cmjohnson) 5Open>3Resolved [20:15:04] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:06] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:07] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:07] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:08] RECOVERY - Puppet failure on tools-exec-21 is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:09] Is bastion.eqiad.wmflabs sick? I'm having troubles getting to several different vms in deployment-prep [20:15:09] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:10] Confirmed with other projects as well. Error is "fork failed: Resource temporarily unavailable" [20:15:10] YuviPanda: ^ Am I using the wrong bastion host name? "ProxyCommand ssh -a -W %h:%p bastion.eqiad.wmflabs" [20:15:11] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:11] * bd808 sees the topic now [20:15:17] so what's the status? [20:15:28] NSF back up yet? [20:15:29] Or NFS [20:15:31] T13|mobile: that will be announced here whenever it is up [20:15:31] not much more to do but waiting [20:15:34] My tool has been down for 50 minutes. Anyone know if the current issues will soon be sorted out? [20:15:34] danmichaelo: Coren is working on it. There will be an announcement when it's ready. [20:15:35] 6Labs, 10Continuous-Integration: Continuous integration should not depend on labs NFS - https://phabricator.wikimedia.org/T90610#1204249 (10Krinkle) p:5Low>3High [20:16:00] Thanks! Btw. perhaps it would be a good idea to move the IRC logs off Tool Labs, since it would be very useful to be able to check them when Tool Labs is down to see what's going on :) [20:17:06] danmichaelo: you mean wm-bot's logs? [20:17:18] yeah [20:18:24] While you're normally right, it wouldn't help in this case. The alternative location would be bots and that is affected as well. [20:19:16] I might be able to be a bit optimistic now... [20:19:36] 6Labs, 10hardware-requests, 6operations: eqiad: (6) labs virt nodes - https://phabricator.wikimedia.org/T89752#1204471 (10RobH) 5stalled>3Resolved [20:19:51] Oh yeah Coren? YuviPanda suggested it would be 10 more minutes 90 minutes ago. [20:20:04] :p [20:20:08] I also said I didn't know :) [20:20:27] I said "suggested", didn't I? :p [20:22:08] Things now seem to be waking up slowly. [20:22:27] !ping [20:22:27] !pong [20:22:41] Much better there. :) [20:26:03] things are coming back up to normla [20:26:05] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, 10Tool-Labs-tools-Article-request, and 9 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1204487 (10mmodell) We discussed this in the phab meeting today and it's on the radar. @chasemp has built a tool that can... [20:31:04] (spoke too soon!) [20:32:08] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, 10Tool-Labs-tools-Article-request, and 9 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1204499 (10Technical13) Oh my. This is quit a mess. I wasn't aware of this ticket when I created Tool-Labs-xTools and st... [20:35:09] Why does my name have a space after it? Odd. [20:35:26] Oh, looks like they all do. [20:47:48] YuviPanda: -submit crumbled over the load. Ima restart it. [20:49:23] ok [20:49:24] it’s also a SPOF for data (cron!) [20:49:24] Best practice sez: "Don't rely on the spooled crontab, always use 'crontab' on a file" but yeat. [20:49:24] I think cron info should live in the service manifests too [20:49:41] Well, I'd stuff the crontabs there for sure. [20:49:53] Not sure I'd want to replicate the whole functionality. [20:50:26] oh no, it’ll still use cron [20:50:39] just not the current crontab / ssh hack [20:51:18] you should just edit your service.manifest file and it should be ok [20:52:15] 6Labs, 6operations, 10ops-eqiad: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1204625 (10Andrew) labvirt1004 is working fine now. Thank you! [20:54:27] YuviPanda: My food just got here. Ima go eat real quick then brb. [20:54:34] Coren: alright [20:56:14] 6Labs, 10Continuous-Integration: Purge graphite data for deleted integration instances and nonexistent metrics - https://phabricator.wikimedia.org/T95569#1204631 (10yuvipanda) 5Open>3Resolved a:3yuvipanda All done now. FTR, the way to do this is: # Move the appropriate metrics (found in `/srv/carbon/w... [21:11:09] !log tools restart portgranter on all webgrid nodes [21:11:12] Logged the message, Master [21:12:27] Coren: all webservices are up and running :) [21:25:53] Coren: poke when back? [21:30:30] 10Tool-Labs: Make admin www relocatable - https://phabricator.wikimedia.org/T95808#1204824 (10yuvipanda) I (finally) took a look at the current admin code, and I think it should basically be cleaned up heavily - much better PHP or move to python. [21:41:16] YuviPanda: Almost back - done eating now need the smoke and tea. I'll be there in ~5m [21:41:24] Coren: cool [21:46:02] * Coren grumbles [21:46:16] Minimize window != close window [21:47:25] heh [21:49:53] I see the graphs are actually settling down somewhat [21:51:40] should we do the incident report? https://etherpad.wikimedia.org/p/labs-nfs-20140413-nfsoutage [21:51:46] Coren: all webservices that should be up seem up! \o/ [21:54:57] Coren: I’ve to go see HR again, I’ll be back in 5mins :( [21:57:55] YuviPanda: I have like a bazillion house things to do - I'll spend time on the incident report later tonight, in the meantime Ima keep an eye on it. [22:01:46] Coren: alright! [22:04:45] Coren: okay. I rm replica.my.cnf like you said, how long does it usually take to be recreated? [22:09:53] T13|mobile: A minute or two, but I haven't restarted that job while we were waiting of labstore to revive properly. It's good enough now, so lemme start it [22:10:30] Thanks. :) [22:11:39] * creds for tools.wikiviewstats (s51359) added to /srv/project/tools/project/wikiviewstats/replica.my.cnf [22:12:41] Cool. I'll work on that now. [22:23:18] PROBLEM - Host tools-webproxy-jessie is DOWN: CRITICAL - Host Unreachable (10.68.17.147) [22:25:28] YuviPanda: Is that you? ^^ [22:25:38] um, I thought we had deleted it ages ago [22:25:46] * YuviPanda checks [22:26:22] and I can’t find it on NovaInstance anymore [22:26:26] so not fully sure what’s that from [22:29:05] YuviPanda: phantom instance! [22:45:34] 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration, 3Continuous-Integration-Isolation: Support dedicating a specific virt node to a specific nova project - https://phabricator.wikimedia.org/T84989#1205215 (10hashar) p:5Normal>3Low Lowering priority, at the start I guess we can afford having our sm... [22:53:35] 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration, 3Continuous-Integration-Isolation: Support dedicating a specific virt node to a specific nova project - https://phabricator.wikimedia.org/T84989#1205248 (10Andrew) Just now chase and I have confirmed that the proper mechanism to direct particular VMs... [23:00:35] Coren, YuviPanda: I want to call your attention to the new ‘wmf_scheduler_hosts_pool’ setting that I added in https://gerrit.wikimedia.org/r/#/c/203969/2/modules/openstack/templates/icehouse/nova/nova.conf.erb [23:01:08] ah, nice [23:01:11] That lets us mark compute nodes as still running but refusing future instances. [23:01:15] nice! [23:01:19] I’ll be twiddling that a lot on the coming days. [23:01:44] OpenStack has several built-in features that should do that properly in future release, but we couldn’t get reliable behavior out of them in icehouse so I wrote a custom filter. [23:01:54] :D [23:02:25] So, if you get an uneasy sense of doom around a particular virt host, best to yank it out of that list :) [23:05:17] andrewbogott: That certainly got my attention, because that could have saved our bacon before. :-) [23:06:09] Maybe — it has limited utility, mostly I’m hoping it will let me live-migrate away from a host without it filling right back up [23:06:23] assuming live-migration works reliably :/ [23:24:03] Coren: given a filesystem on virt1010, can we deduce the mkfs command that I ran to create it? (So I can duplicate on the new hosts, and document) [23:24:35] /dev/sdb1 on /var/lib/nova/instances type xfs (rw) [23:25:16] so probably that’s just ‘mkfs -t xfs /dev/sdb1’ — any more complex than that? [23:26:29] andrewbogott: You can also use dumpe2fs to see any options that may have been used [23:27:28] dumpe2fs says ‘Couldn't find valid filesystem superblock.' [23:27:43] ...what? [23:28:52] Needs root, and gotta be run on the device not the mountpoint. Otherwise, that's a really scary error [23:29:37] Want to check my work? [23:29:45] (also, -h makes the output more useful) [23:30:00] Where and what device? [23:30:06] Oh, wait. Duh [23:30:09] "xfs" [23:30:18] oh, dumpe2fs doesn’t like xfs? [23:30:22] So yeah, dump*e2*fs won't work. :-) [23:31:04] ok, so I’m back to thinking i should just run mkfs -t xfs /dev/sdb1 [23:31:10] any reason to hesitate? [23:31:16] The xfs defaults tend to be reasonabel [23:31:21] ok [23:31:39] now, just as long as I don’t type this in the wrong window... [23:43:53] andrewbogott: Thankfully, mkfs will complain loudly if the device is mounted. :-) [23:44:04] yep [23:44:16] which it did — turns out the fs was there and puppet was just trying to mount the wrong device. [23:44:31] thanks to a hiera setting I missed. All working now. [23:47:24] That's the one think I dislike about hiera; state is now sometimes hidden outside the manifest and dependencies are not as clear as they once were [23:51:54] 6Labs, 6operations, 10ops-eqiad: labvirt100x boxes 'no carrier' on eth1 - https://phabricator.wikimedia.org/T95973#1205369 (10Andrew) 3NEW a:3Cmjohnson [23:54:18] bmansurov: now just say outload what you need and one of the watches (Coren / andrewbogott) can help you with access :) [23:54:26] *out loud [23:54:32] JohnFLewis: thanks! [23:55:12] Coren: I agree, although maybe I’ll get in the habit of looking there eventually. [23:55:24] Coren, andrewbogott: hello, could you please grant me access to tools-login.wmflabs.org so that I can set jouncebot up to remind the MobileFrontend folks to cut branches? [23:55:39] bmansurov: what’s your username on wikitech? [23:55:44] andrewbogott: bmansurov [23:55:56] well, that’s sumple enough [23:56:01] *simple [23:56:09] yes, easier for me to remember [23:56:43] bmansurov: try it now? [23:57:01] andrewbogott: thank you! [23:57:15] bmansurov: I’m about to go for dinner — is that all you need for now? [23:57:31] andrewbogott: yes, enjoy your dinner