[01:59:35] PROBLEM - Puppet failure on tools-webproxy-jessie is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:10:57] 3Wikimedia-Extension-setup, Wikimedia-Labs-wikitech-interface, Wikimedia-DNS: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#995691 (10Glaisher) [06:19:28] (03CR) 10Krinkle: [C: 031] Un-flow the config file... [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/186905 (owner: 10Awight) [06:36:44] 3Wikimedia-Extension-setup, Wikimedia-Labs-wikitech-interface, Wikimedia-DNS: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#995765 (10Dzahn) Yes, let's add the DNS entry. I once suggested to generate the .m. entries in DNS from a template for all wikis in wikimedia.org but it was... [07:00:12] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [07:01:23] 3Wikimedia-Extension-setup, Wikimedia-Labs-wikitech-interface, Wikimedia-DNS: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#995807 (10Dzahn) [07:25:07] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [09:45:24] 3Tool-Labs: webservice2 not starting - https://phabricator.wikimedia.org/T87641#995883 (10Magnus) 3NEW [11:43:10] 3Tool-Labs: Cannot start java processes using the grid engine - https://phabricator.wikimedia.org/T69588#995944 (10Nosy) any news here? my workaround is not working any longer so i have to use sge and dont know whats wrong [11:46:11] 3Tool-Labs: Cannot start java processes using the grid engine - https://phabricator.wikimedia.org/T69588#995953 (10Nosy) These are my current SGE parameters: #SGE-stuff: #$ -q task #$ -l h_vmem=4000M [15:27:59] I have problems when running a continuous bot... [15:28:32] when I write "jstart -continuous phenny.py" it returns "Your job 7657687 ("phenny") has been submitted" [15:28:46] but the bot doesn't do anything... [17:35:14] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:36:07] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:36:33] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:37:15] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:38:05] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:38:48] PROBLEM - Puppet failure on tools-uwsgi-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:39:32] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:07:52] anomie, can I borrow you for a second? [18:08:43] Cyberpower678: Maybe [18:09:00] I need access to an OAuth Consumer. [18:09:19] Hedonil published the X!'s Tools OAuth Consumer and I need to modify it. [18:09:33] Since Hedonil no longer maintains OAuth [18:09:37] *xTools [18:10:02] anomie, I was wondering what can be done in those cases. ^ [18:10:45] anomie, which does bring me to a feature suggestion for OAuth. Allow other users to access it if the user wishes. [18:14:10] Cyberpower678: In general, the solution is to create a new consumer. Many modifications require a new consumer anyway. [18:15:26] !log deployment-prep upgrading libc6 on all instances from deployment-salt [18:15:40] anomie, what happens to duplicate consumers? [18:15:49] Logged the message, dummy [18:16:53] anomie, can you remind how to create another consumer? [18:23:44] Cyberpower678: Define "duplicate" [18:23:44] Cyberpower678: https://www.mediawiki.org/wiki/Special:OAuthConsumerRegistration/propose [18:24:15] Well I'm creating another X!'s Tools consumer v1.1 [18:27:14] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [18:27:24] why I'm getting 'Error: no "print" mailcap rules found for type "text/x-python"' on labs now? [18:27:27] anomie, what's the IP Tool Labs has. I want to add that to a restriction [18:29:30] Cyberpower678: That's a good question, and probably one for Coren. It may depend on which exec host, which datacenter once codfw comes online, and so on. [18:29:31] Coren, are you here? [18:30:01] I assume you are since you are not marked away. [18:30:12] 3Wikimedia-Labs-Infrastructure: Internal DNS look-ups fail every once in a while - https://phabricator.wikimedia.org/T72076#996328 (10yuvipanda) a:5coren>3yuvipanda [18:30:50] !log rebooting tungsten [18:30:50] rebooting is not a valid project. [18:31:40] Cyberpower678: I'm here-ish. Is it urgent? [18:32:26] I have 2 things. 1st, about wikiviewstats... and 2nd what is the IP the toollabs web service has? [18:32:33] Coren, ^ [18:32:34] why everything is wrong with labs? [18:32:44] Mjbmr: for example? [18:32:54] read above [18:33:39] Mjbmr: ? [18:33:45] How much above? [18:33:51] Mjbmr: why are you using 'print' in bash? [18:33:52] andrewbogott, Mjbmr> why I'm getting 'Error: no "print" mailcap rules found for type "text/x-python"' on labs now? [18:33:53] Mjbmr: why I'm getting 'Error: no "print" mailcap rules found for type "text/x-python"' on labs now? [18:33:53] Oh I see [18:33:56] python [18:34:16] I moved the same script to a private server and it runs ok. [18:34:23] Mjbmr: that's the error you get when you 'print x.py' *in a shell* [18:34:32] you probably want to do python x.py instead ;-) [18:34:36] yeah [18:34:55] no [18:34:57] no no [18:35:02] no I didn't do that [18:35:19] oh yeah [18:35:24] I'm retarded [18:35:47] (fwiw, labs is going to restart everywhere shortly, to deal with some security patches) [18:36:01] maybe I'm just angry for other issues on labs. [18:36:08] Mjbmr: and next time, please google the error. The first hit for 'no "print" mailcap rules found' is https://stackoverflow.com/questions/14116748/need-to-assign-the-contents-of-a-text-file-to-a-variable-in-a-bash-script which immediately shows what's happening [18:36:20] https://phabricator.wikimedia.org/T87646 [18:36:31] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [18:36:41] Cyberpower678: do you mean the proxy (i.e. the server that runs the tools.wmflabs.org frontend) or the actual webservices? [18:36:44] see this https://fa.wikinews.org/w/index.php?title=Special:Recentchanges&limit=500&hidebots=0&uselang=en [18:36:45] ok, brief labs network outage coming up in 5. (sorry!) [18:37:24] valhallasw`cloud, what ever is visible to the end user that got serviced. [18:37:37] Or server if it's making a request to it. [18:37:43] Cyberpower678: labs ip range 2620:0:860:0:0:0:0:0/46 [18:37:50] Cyberpower678: that would be tools.wmflabs.org, so 208.80.155.131 [18:38:07] Cyberpower678: but I have the feeling I [18:38:14] I'm misunderstanding what you're looking for [18:38:41] Suppose my script on tools queries another server. What IP will that server see. [18:38:50] Cyberpower678: ah! that depends. [18:39:02] I need that IP to set the OAuth Consumer restriction [18:39:16] To only allow IPs from labs but no where else. [18:39:16] Cyberpower678: some exec hosts have their own public IP, some use the NAT [18:39:28] Cyberpower678: and OAuth is special, as it'd use an internal network [18:39:43] Cyberpower678: I think that's the 10.x.x.x network range [18:40:32] YuviPanda: it's db error, check db or my link [18:40:36] Cyberpower678: try curl -v https://www.mediawiki.org/wiki/Special:MyPage [18:40:38] That's a big range [18:56:48] Cyberpower678: sure, but 10.x.x.x is all internal, so that should be OK? [18:56:48] * Cyberpower678 is confused. [18:56:48] is rebooting? [18:56:49] valhallasw`cloud, I'm just going to go without restrictions [18:56:49] Mjbmr: andrewbogott just announced that [18:56:49] Mjbmr: short network outage [18:56:49] Sorry about the unannounced reboots - security fixes [18:56:50] oh, also reboots. I see :-) [18:56:51] Cyberpower678: that's what I did for gerrit-patch-uploader; makes debugging locally also easier [18:56:51] am I allowed to run bnc for my self on labs? [18:56:52] anomie, I'm getting a 502 when trying to propose a consumer. [18:56:52] tools: The fingerprint for the RSA key sent by the remote host is c7:18:d8:74:b8:6a:d3:65:3e:d8:2f:57:7b:52:e8:4e [18:56:52] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [18:56:53] Correct? [18:56:53] Nemo_bis: /which/ tools host? :P [18:56:53] Nemo_bis: also, the fingerprints are on wikitech somewhere, which is safer than trusting random irc users :D [18:56:54] Mjbmr: nope [18:56:54] valhallasw`cloud: yes, but they just changed, apparenlty [18:56:54] no bouncers, and no long running processes on tools-login / bastions please [18:56:54] ok… labs network should be back -- working ok for everyone? [18:56:55] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [18:56:55] Coren, I know you're incredibly busy and are hard at work in your Jacuzzi, but could you spare a minute for wikiviewstats tool? [18:56:55] * Nemo_bis trying [18:56:55] YuviPanda: is there anywhere? [18:56:56] More breakage coming up in an hour or so :) [18:56:56] Mjbmr: not on labs, no. You should run your personal services on your own servers :) [18:56:56] Cyberpower678: You'll have to be a bit more precise about what it is you need before I can tell you if I cna do it [18:56:56] damn [18:56:57] I need access to that tool. [18:56:57] is there anything I should not pay for wmf? [18:56:58] You said to give you a couple of days, it's been 25 [18:56:58] Coren, ^ [18:56:58] Mjbmr: donations? [18:56:58] * Nemo_bis still has stuck connection [18:56:58] Cyberpower678: Try again in a little bit, I hear ops is doing something at the moment. [18:56:58] Nope, failes. [18:56:59] no, paying for private server to run a bnc or bots. [18:56:59] https://wikitech.wikimedia.org/w/index.php?title=Special:Search&search=c7%3A18%3Ad8%3A74%3Ab8%3A6a%3Ad3%3A65%3A3e%3Ad8%3A2f%3A57%3A7b%3A52%3Ae8%3A4e&fulltext=Search&profile=all [18:56:59] Cyberpower678: Yeah, I know. Things have been hectic because of the all-hands, etc. I can actually corner one of the right people physically where I am now, so I'll try that. [19:00:44] CP678: I don't know how to add maintainers yet. [19:00:56] Nvm then. [19:01:03] :) [19:01:29] T13|detached: go to https://tools.wmflabs.org, scroll to the tool and click the 'edit maintainers' link [19:03:04] valhallasw`cloud: if it works. The list is empty. [19:03:47] CP678: doh! that probably has to do with the server reboots [19:04:00] valhallasw`cloud: can it be done from command line in bash? [19:04:18] T13|detached: no, needs to be done via wikitech (which the link on tools.wmflabs.org deeplinks to) [19:04:24] Coren: whats going on? Im seeing my bots doing something odd [19:04:25] T13|detached: but I can never find it in the actual wikitech interface :D [19:04:31] Betacommand: reboots for security updates [19:04:35] Betacommand: see /topic [19:04:39] and labs-l [19:04:49] the bots should have died then [19:04:53] (http://www.openwall.com/lists/oss-security/2015/01/27/9 is the vulnerability) [19:05:00] we rebooted the network node [19:05:06] Im getting spammed about them authing on freenode [19:05:24] Betacommand: the magic grid fixes all the things! [19:05:37] WTF??? [19:05:42] Hrmm Then we'd have to be able to log into wikitech as xtools for that to work, which doesn't exist as far as I know. [19:06:27] T13|detached: no, everyone in the 'xtools' user group has access to all groups the 'xtools' group has access to [19:06:46] I haven't rebooted any instances yet. Just the network service. [19:06:48] Oh. Okay.. trying [19:07:25] T13|detached: but, as all other times I've tried, I can't find the correct special page on wikitech [19:08:32] whats the ETA for the last of the reboots? [19:09:47] andrewbogott: ^ [19:10:07] T13|detached: https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup , then enter 'tools' as project name, click 'set filter' [19:10:13] T13|detached: and find the tool there [19:10:18] valhallasw`cloud: if a tool has access to another tool. Then users have to become tool1 and then become tool2 [19:10:35] You can't become tool2 directly. [19:10:55] CP678: dunno. The wikitech interface states Beware! Adding a service user to this group will allow all members of that group to access this group. It will also allow access to members of any group added to that group, and so on. Make sure you trust everyone and know what you are doing before selecting anything in the "Service user" section. [19:11:11] so I assume it should work on wikitech [19:11:11] And I connected again [19:11:36] I'm talking about the shell. Don't know about Wikitech [19:12:25] CP678: Sure. Didn't know that -- good to know. [19:12:26] andrewbogott: I’ve increased conntrack limit on labnet1001 as well. Have to set notrack soon. [19:16:26] andrewbogott: reboots done? [19:18:06] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [19:18:23] Betacommand: the networking reboot is done, instance reboots haven’t started yet [19:20:18] Hi. Sorry for the dumb question. If I want a public MediaWiki instance on Labs, how do I do that? [19:20:23] Are there docs somewhere? [19:20:56] Something similar to what http://mm-ch.wmflabs.org/ is/was. [19:24:01] Fiona: labs-vagrant [19:24:14] Fiona: https://wikitech.wikimedia.org/wiki/Labs-vagrant [19:24:24] Cool, will take a look. [19:26:54] Fiona: what will it host? [19:27:12] Betacommand: Just a test MediaWiki instance to showcase/prototype a new skin [19:27:15] . [19:27:26] T13|detached, did I miss anything? [19:27:34] 3Gerrit-Patch-Uploader: Gerrit Patch Uploader answers with "400 - Bad Request" after submit - https://phabricator.wikimedia.org/T87663#996480 (10Fomafix) 5Open>3Resolved a:3Fomafix Works again [19:27:36] anomie, can you approve my new consumer? [19:27:44] ok, wikitech outage coming up [19:27:58] Cyberpower678: Busy right now [19:27:58] andrewbogott: let me silence shinken [19:28:04] anomie, :( [19:28:07] Cyberpower678: no idea. Bouncing around [19:32:37] YuviPanda: whats the ETA to get the exec nodes rebooted? [19:33:13] hello [19:33:26] Betacommand: andrewbogott would know. [19:33:38] andrewbogott: are you restarting project by project/ [19:33:40] or host by host? [19:33:45] Betacommand: host [19:33:49] um… YuviPanda host [19:33:56] andrewbogott: ok. [19:34:01] so things will be a bit staggered, all exec nodes shouldn't be out at the same time. [19:34:05] Cyberpower678: you can test with your own user account even if it's not approved [19:34:09] andrewbogott: hmm, ok [19:34:19] maybe it’s a good idea to have shinken *up* :) [19:35:55] ok, wikitech is back. Sessions are almost certainly ruined, so folks will want to log out and in again [19:36:32] PROBLEM - Puppet failure on tools-exec-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:36:34] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:36:54] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [0.0] [19:36:56] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:37:05] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:37:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:37:55] PROBLEM - Puppet failure on tools-static is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:38:00] PROBLEM - Puppet failure on tools-exec-13 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [19:38:06] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:39:47] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:39:47] PROBLEM - Puppet failure on tools-webgrid-04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:39:53] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [19:40:11] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:40:13] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [19:40:30] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [19:40:32] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [19:41:22] PROBLEM - Puppet failure on tools-webgrid-03 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [19:41:33] YuviPanda: do you know why we're getting so many "failed: Can't connect to MySQL server" messages? That shouldn't be anything that I've broken [19:41:57] andrewbogott: on where? wikitech? [19:42:01] emails [19:42:23] valhallasw`cloud, I need to get it up ASAP. I'm getting nagged everywhere that OAuth isn't working. [19:42:39] * Cyberpower678 switches over to Windows. [19:42:56] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:43:00] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:45:19] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:49:21] * Cyberpower678 has switched to Windows [19:50:01] PROBLEM - Host tools-shadow is DOWN: CRITICAL - Host Unreachable (10.68.16.10) [19:50:11] PROBLEM - Host tools-mail is DOWN: CRITICAL - Host Unreachable (10.68.16.27) [19:50:39] PROBLEM - Host tools-exec-12 is DOWN: CRITICAL - Host Unreachable (10.68.17.166) [19:51:59] Cyberpower678: how do you feel now? [19:52:11] Mjbmr, ? [19:52:20] Windows? [19:52:28] What about it? [19:52:38] does it feel good? [19:52:50] It's windows, so not really. [19:52:58] RECOVERY - Puppet failure on tools-exec-13 is OK: OK: Less than 1.00% above the threshold [0.0] [19:53:12] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:54:29] well I feel a life on windows. [19:55:02] RECOVERY - Host tools-shadow is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [19:56:04] andrewbogott: is it safe to start launching tools again? [19:56:29] Betacommand: no, I've only just begun rebooting hosts. It'll be a couple hours before things are all done [19:56:39] So far, tools-exec-12 is the only safe one :) [19:56:57] RECOVERY - Host tools-mail is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [19:57:10] andrewbogott: why not just reboot them all and be done with it? [19:57:28] Betacommand: when a host reboots the instances are switched to 'shutoff' state by default. [19:57:36] And then I have to launch them by hand. But only the ones that were up before. [19:57:55] And also I need to eat lunch in five minutes because I can't concentrate :( [19:58:17] RECOVERY - Host tools-exec-12 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [20:00:16] I'm trying to understand what was the bug cause of rebooting. [20:00:25] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:28] Mjbmr: http://www.ubuntu.com/usn/usn-2485-1/ [20:00:37] we emailed labs-l. you should subscribe! [20:01:41] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:01:53] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [20:01:58] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [20:02:36] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [20:02:58] RECOVERY - Puppet failure on tools-static is OK: OK: Less than 1.00% above the threshold [0.0] [20:03:08] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [20:03:12] I saw the link, I said the bug, more details about the bug. [20:03:49] Mjbmr: there's a CVE reference at the bottom of that link.... [20:04:01] ok [20:04:54] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [20:05:14] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:06:22] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:06:32] RECOVERY - Puppet failure on tools-exec-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:06:51] RECOVERY - Puppet failure on tools-uwsgi-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:07:15] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [20:08:00] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [20:08:04] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [20:09:43] RECOVERY - Puppet failure on tools-webgrid-04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:09:47] RECOVERY - Puppet failure on tools-exec-catscan is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:05] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:15:35] YuviPanda: Any ETA on tools-db being back up? [20:17:38] Any "timetable" (hours, days, ArbCom-case; that's used as a timeframe now) on when tools-db being back up? [20:21:22] anomie: yeah looking for / at it now [20:21:26] No eta yet [20:22:06] Ok. I'll try to make due with external tools for the time being then. [20:25:35] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [20:25:41] andrewbogott: 10.64.20.3 [20:25:54] ae2-1118.cr2-eqiad.wikimedia.org [20:27:01] ... "[...] failure [...] is OK [...]" ... [20:30:17] OK… /now/ I am starting to reboot some things. [20:31:11] heh [20:31:32] actually… I think I'll wait for Coren to resurface since he'll want to much with tools. [20:44:05] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [20:46:41] PROBLEM - Host tools-exec-06 is DOWN: CRITICAL - Host Unreachable (10.68.16.35) [20:46:53] PROBLEM - Host tools-exec-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.31) [20:53:17] RECOVERY - Host tools-exec-06 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [20:53:25] anomie: tools-db fixed for no w [20:53:29] YuviPanda, andrewbogott: isn't bastion.wmflabs.org still supposed to respond to ping? [20:53:31] YuviPanda: Thanks [20:53:35] ssh: connect to host bastion.wmflabs.org port 22: Network is unreachable [20:53:37] PING bastion.wmflabs.org (208.80.155.129): 56 data bytes [20:53:39] Request timeout for icmp_seq 0 [20:54:09] Krenair: Krinkle we’re rebooting all instances, see /topic [20:54:53] alright so tools-db is up, yay, but no bastion [20:55:01] So first irc.wikimedia.org goes down, and kills all countervandalism tools. Bots been down for over an hour. and still can't access the server to repair services. [20:55:21] Well orchestrated. [20:55:26] Krinkle, Krenair, yep I just rebooted the bastion. [20:55:43] I need to reboot the hosts that the labs instances live on. So I don't have much choice about what order they go. [20:55:50] Their organization on the virt hosts is largely random [20:58:55] Hm.. ideally bastion would go first so that instances are responsive as long as possible on both sides of the reboot [20:59:25] Yes, but the bastions are (wisely) distributed amongst the virt hosts. And since I have to reboot the entire virt host, they're going down in arbitrary groups. [20:59:36] It's not really possible to reboot all of virt1001 but keep a single VM untouched :( [23:31:59] andrewbogott: YuviPanda: CVN is back up. Thanks :) [23:32:33] <^d> YuviPanda: Where are we on virt1009? [23:33:21] ^d: andrewbogott would know [23:33:47] ^d: it's broken [23:34:46] <^d> Well shitz. [23:35:16] YuviPanda: I restarted all of wikibugs [23:35:24] legoktm: is it working now? [23:35:29] legoktm: is tools-redis up? [23:35:56] apparently not. [23:36:13] tools-redis:6379> ping [23:36:13] PONG [23:36:43] ^d: it's broken in a very interesting way! [23:37:06] * ^d takes a hammer to virt1009 [23:37:13] <^d> I'll break it in some interesting ways! [23:37:51] Actually right now it's stuck at a grub prompt with no keyboard access. But if we get past that, /then/ it will be broken in an interesting way