[00:13:48] !log Upgraded elasticsearch to 1.3.2 on deployment-logstash1 [00:13:48] Upgraded is not a valid project. [00:13:55] !log deployment-prep Upgraded elasticsearch to 1.3.2 on deployment-logstash1 [00:13:57] Logged the message, Master [00:33:26] where is the documentation about how to create a new project? [00:36:55] dMaggot: There is a page on wikitech somewhere .... looking [00:37:26] dMaggot: https://wikitech.wikimedia.org/wiki/Category:New_Project_Requests [00:38:07] how was I supposed to know I needed to add a page in that category to request a new project? [00:38:17] Magic! -- https://wikitech.wikimedia.org/w/index.php?title=New_Project_Request/&action=formedit [00:38:27] I just searched for "new project" [00:39:03] we should put a redirect in just "New_Project_Request" [00:39:05] bd808: I did that too, I'm asking where is it documented as "Step 1: create a page in this category" [00:39:07] that's where i looked as well [00:39:11] because all the other pages are below it [00:39:43] https://wikitech.wikimedia.org/wiki/Help:Contents#Requesting_A_New_Project [00:39:44] dMaggot: https://wikitech.wikimedia.org/wiki/Help:Contents#Requesting_A_New_Project [00:40:26] mutante: that, thanks [00:40:26] eh, Special:FormEdit/New_Project_Request is different from New_Project_Request/&action=formedit [00:40:28] that needs better SEO [00:41:07] putting it two clicks away from the front page is hiding it [00:41:16] hehe, yea, i have no shortage of SEO people mailing us who promise "more organic searches" for Wikipedia [00:41:21] ^d: Around? How can we boost the search ranking of https://wikitech.wikimedia.org/wiki/Help:Contents#Requesting_A_New_Project for "new project" ? [00:41:38] put it in the side bar [00:41:43] under Tools [00:41:59] MediaWiki:Sidebar or whatnot [00:42:03] mutante: that makes too much sense :) [00:42:45] also, the category is still kind of full [00:42:51] old requests? [00:43:13] looks [00:43:30] I'm trying to remember what was the migration that killed my old project, what was that migration about that killed projects in old bastion? [00:43:47] I apparently don't have rights to edit https://wikitech.wikimedia.org/wiki/MediaWiki:Sidebar [00:43:55] moving out of the Tampa datacenter i suppose [00:43:59] dMaggot: [00:44:07] mutante: yes, I think it was about that [00:44:39] dMaggot: what do you need? [00:44:44] what project was it [00:44:52] That was in April or there abouts. I think the old projects that weren't actively migrated were archived [00:45:15] do you need old files or just the project created again under the same name [00:45:38] mutante: https://wikitech.wikimedia.org/wiki/New_Project_Request/Wiki_Loves_Monuments_Jury_Tool_2014 [00:46:01] mutante: I got the old files from someone here a couple of months ago, so that's covered [00:46:44] i'm not sure about the spaces in project names.. i guess it's ok? [00:46:50] looks for rules [00:47:20] mutante: if no spaces, can I have wlmjurytool2014? if you need me to move the request let me know [00:48:00] i would prefer that, lookin at existing names.. yes [00:48:05] nah, hold on [00:48:28] Is the 2014 important? Can you just have a long running project for this task? [00:48:49] bd808: my long running project last year died in April so I'm giving up on thinking these things go for long :) [00:48:55] heh [00:49:18] * bd808 migrated 4 projects and half of deployment-prep [00:49:26] it was a busy couple of weeks [00:49:27] wlmjurytools ? :p [00:49:35] it's true about the year [00:49:35] mutante: wlmjurytool then [00:50:40] !log wlmjurytool - project created - DMaggot is admin [00:50:41] Logged the message, Master [00:50:54] dMaggot: https://wikitech.wikimedia.org/wiki/Nova_Resource:Wlmjurytool [00:51:57] please click that red link [00:52:14] there's another form then at [00:52:17] https://wikitech.wikimedia.org/wiki/Special:FormEdit/Nova_Project_Documentation/Nova_Resource:Wlmjurytool/Documentation [00:52:36] also just recently saw the new version of it [00:54:49] mutante: thanks, added at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wlmjurytool/Documentation [00:56:43] :) [00:57:04] dMaggot: i love WLM, just the Android app i have installed from last year always says the servers currently have issues [00:57:29] mutante: there was a thread recently about the app, I think they are taking care of that [00:58:00] that sounds promising, cool [01:01:03] bd808: eh.. that stuff used to be easier back in old Mediawiki [01:01:05] The toolbox, which appears under the search bar, is a dynamic element and cannot be easily customized without the use of skinning extensions [01:01:18] JavaScript to put it inside the toolbox.. [01:01:34] put it can be a new section .. [01:04:03] I changed my instance's puppet config to be webserver::php5-mysql, do I need to wait 'til puppet applies that config? or should I do that manually? [01:04:09] mutante: Can you add to the Help section? [01:04:29] bd808: reload :) [01:04:35] dMaggot: You can `sudo puppet apply --test --verbose` to force it [01:04:42] https://wikitech.wikimedia.org/w/index.php?title=MediaWiki%3ASidebar&diff=124522&oldid=71445 [01:05:21] dMaggot: have patience, i see puppet breakage on the horizon [01:05:30] (also recently tried to use those webserver classes :p) [01:05:49] mutante: Nice. I purged the main page so it would show up [01:05:56] mutante: couldn't possibly be: when I tried configuring the other project (wikiviajesve) we came across every possible error so they should all be fixed [01:06:01] * dMaggot said his famous last words [01:06:29] dMaggot: :) [01:06:40] mysql is mariadb now btw :) [01:08:30] bd808: ouch, we also have this broken link for admins: [01:08:44] "Project requests" which would show pending requests but [01:08:50] https://wikitech.wikimedia.org/wiki/Special:Ask/-5B-5BCategory:New-20Project-20Requests-5D-5D-5B-5BIs-20Completed::No-5D-5D/-3FProject-20Name/-3FProject-20Justification/-3FModification-20date/format%3Dbroadtable/sort%3DModification-20date/order%3Dasc/headers%3Dshow/searchlabel%3DOutstanding-20Requests/default%3D%28No-20outstanding-20requests%29/offset%3D0 [01:09:00] a semantic search that doesnt find stuff [01:09:12] "nice" URL btw :) [01:09:34] maybe that is why it's a category now :p [01:11:17] The "Is Completed" attribute doesn't seem to exist until it is set to true [01:11:42] https://wikitech.wikimedia.org/w/index.php?title=New_Project_Request&oldid=124523 [01:11:46] and did that too [01:14:09] completed requests are also in the category.. but there are only 58 total ? [01:14:29] bd808: if I run that command I get Error: Could not parse application options: invalid option: --test [01:14:35] bd808: without --test it just hangs forever [01:14:48] dMaggot: puppet agent -tv [01:14:49] bd808: well, tbh I didn't wait forever, just several minutes [01:15:25] dih. wrong command `sudo puppet agent --test --verbose` agent, not apply [01:15:31] *doh [01:17:03] bd808: got this http://pastebin.com/t9Lzbf3H [01:17:13] bd808: curiously that's the same error I got a couple of months ago when setting up the other project [01:17:44] dMaggot: it actually finished the run though , that's pretty good :p [01:17:54] about the mounting thing, i got the same thing today [01:17:58] That's the nfs server fighting you [01:18:20] It *usually* will fix itself in 10-15 minutes [01:18:40] andrew said "race condition" [01:18:44] and rebooting instance [01:18:49] I'm surprised you could log in [01:18:55] :) [01:19:12] it's just the /data/project [01:19:17] Yeah. It's been a problem since the move to eqiad. C.oren tired to track it down several times [01:19:19] you can survive without it [01:19:26] if you dont have multiple instances [01:19:42] The log shows /home failing too [01:19:52] arr, then.. [01:19:56] which is why I was surprised he was logged in [01:20:10] eh, yes [01:20:17] oh. do you still have to check the stupid hidden checkbox for new projects [01:20:22] to enable shared sotrage [01:20:32] when i made another one today and was asked the same [01:20:35] it was already checked [01:20:41] without me selecting it [01:21:25] I hate that hidden screen [01:36:53] I remember whoever fixed that for me the last time had to check a box [01:39:03] bd808: shared storage is now checked by default, I think [01:39:07] and isn't the purpose of the puppet classes to automatically install things? I still don't have /var/www [01:39:37] that would be created by the package [01:39:46] which in turn should be installed from puppet, yes [01:41:28] ok, I have /etc/php5 but I don't have /etc/apache2 or /etc/http* [01:41:31] i'm afraid webserver:: classes are broken [01:41:36] so I don't know what was created [01:42:04] dMaggot: you can try apache::site, that is from the newer module [01:42:22] but it should be added to puppet groups .. [01:42:28] dMaggot: mutante are you trying to create a LAMP stack in labs? [01:42:30] mutante: well, before trying to change my classes, I would like to know how to fix the mount error [01:42:33] YuviPanda: no [01:42:36] oh [01:42:40] ok then, carry on :) [01:42:46] (that'd be labslamp) [01:42:58] YuviPanda: i rest my case :p [01:43:04] <^d> bd808: If it's got a template we can boost that template. [01:43:06] the puppet groups that are there dont work [01:43:12] and no others have been added [01:43:25] YuviPanda: I am trying to create a LAMP stack, yes [01:43:43] dMaggot: tried role::labs::lamp? [01:43:45] err [01:43:49] role::lamp::labs [01:43:50] apparently [01:44:33] ^d: It uses "{{#formlink:form=New Project Request|link text=Requesting a New Project}}" [01:44:44] <^d> That's not a template :p [01:44:51] smw stuff [01:44:56] oh I see, I guess the changes to webserver:: broke labslamp [01:45:01] lol [01:45:30] * bd808 nominates YuviPanda to fix all the neglected roles [01:45:40] And/or kill many of them off [01:45:49] mutante: heh, easier than reading backscroll? :) [01:46:03] bd808: yeaaah, some need killing (mediawiki_singlenode), some need fixing (LAMP) [01:46:15] YuviPanda: no, i did not even try that role.. it's just that.. none work [01:46:18] YuviPanda: ok, role::labs::lamp did the trick for many things [01:46:25] dMaggot: w00t [01:46:28] YuviPanda: it didn't install php5-cli though, but I guess that's another class [01:46:34] dMaggot: yeah, that's not a default [01:46:41] and I'm still getting the mount errors [01:46:54] well, I was getting mount errors as well, but everything *was* mounted... [01:46:58] i guess the actual solution is to just use apache::site now [01:47:02] in your own module [01:47:12] and then apply your own role, and only that [01:47:18] was the link to create a new project added to the toolbar? [01:47:22] yes [01:47:22] should I be seeing it now? [01:47:32] yes, minus caching [01:47:40] oh oh, it is Request Project in Help, isn't it? [01:48:19] yeah, it is [01:52:18] dMaggot: yes, adding into tools turned to be more complicated [02:02:32] I created a web proxy for that instance at wlmjurytool2014.wmflabs.org and I get timeouts when trying to acess it from my browser [02:02:42] the odd thing is that the server answering is an nginx server [02:02:48] makes any sense? [02:03:29] dMaggot: it does make sense, the proxy is nginx [02:03:31] YuviPanda: ^ [02:03:59] dMaggot: are you sure Apache is running ? the proxy thing worked just fine for me earlier today [02:04:02] yeah dMaggot, that means nginx (which is *.wmflabs.org) is trying to hit your instance and getting a timeout [02:05:04] mutante: apache is running, but I am also getting timeouts from bastion to that instance, so probably a misconfiguration in my apache [02:06:45] dMaggot: check the error log [02:07:05] dMaggot: aaah, also check your 'security groups' on the left sisdebar on wikitech, and open 'port 80' to tcp traffic on the default one [02:07:16] YuviPanda: that should be, yes [02:07:28] wait, i didnt do that either [02:07:58] that would only be needed if it was NOT behind the proxy? [02:08:26] if it was [02:08:41] mutante: since proxy is in a different project in labs, and by default wouldn't have access [02:08:49] you can open it up to 10.0.0.0/24 [02:09:23] ah, right [02:09:36] wow, adding a rule is the most cryptic thing I have found in wikitech [02:09:37] of course i'm still in the same project.. so that's that [02:09:39] but it did the trick [02:10:05] indeed, i see a default page [02:10:35] do you need a gerrit project now ?:) [02:10:57] mutante: if people wanted to contribute code to this tool, I would be a happier person indeed [02:11:37] dMaggot: gotta make them see they can contribute, i can help adding it later.. [02:11:50] make it so that puppet installs the apache config that is [02:12:01] and then the code itself if you wanted to [02:12:09] mutante: the code is hosted on gitorious [02:12:36] that's what i meant to change then, and get it on gerrit :) [02:12:44] cool that it's already git though [02:20:12] mutante: yeah, probably after this round it should go to gerrit [02:20:34] well, this was enough bumping cars for a day, I'll keep working on this tomorrow [02:20:37] thanks for all the help! [02:21:01] dMaggot: yw, cu later then [02:32:15] YuviPanda: eh, earlier i tried to switch the backend of my 'wikistats' proxy [02:32:20] and it appeared to work just fine [02:32:27] now i see it's down [02:33:03] oh, no, it's me :) [02:33:35] i was using socks proxy/ssh tunnel.. nevermind [02:33:42] mutante: :) [02:39:50] 3Wikimedia Labs / 3wikistats: Make a stats table for 85 W3C wikis - 10https://bugzilla.wikimedia.org/41023#c9 (10Daniel Zahn) 3Wikimedia Labs / 3tools: tools-db is down. - 10https://bugzilla.wikimedia.org/69828 (10bgwhite) 3UNCO p:3Unprio s:3critic a:3Marc A. Pelletier I'm getting this error for the past ~30 minutes: Could not connect to database: Host '10.68.17.174' is blocked because of many connection errors; unblock w... [08:02:06] good morning. [08:12:15] I have a labs account now, created a tool account, but become lalm (where lalm is the tool I created) fails after login. Whats wrong? [08:12:57] it says "sorry, a password is required to run sudo" as a response to "become lalm" [08:13:31] try logging out and log back in? [08:14:24] thanks. tried that before, but didn't work. Perhaps I was too fast ;) [08:14:29] works now, thanks [08:40:18] do I need additional access rights to be able to access the osm databases? [08:41:38] I don't see anything about OSM on https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help, so Coren is the person to ask [08:41:54] I followed the howto on https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Configuring_PGAdmin_for_OSM_access [08:42:07] but connecting to the server fails [08:42:20] (03PS1) 10Merlijn van Deen: Add 'log out and in again' when sudo fails [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155506 [08:42:22] (03CR) 10jenkins-bot: [V: 04-1] Add 'log out and in again' when sudo fails [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155506 (owner: 10Merlijn van Deen) [08:42:36] Coren: any idea what's wrong? [08:43:32] wait, there already is a note in become? [08:43:51] (03Abandoned) 10Merlijn van Deen: Add 'log out and in again' when sudo fails [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155506 (owner: 10Merlijn van Deen) [08:45:18] Coren: scfc added a more user-friendly error to become a month ago, but it's not on tools-login yet -- maybe puppet needs a kick/ [11:37:50] 3Wikimedia Labs / 3tools: tools-db is down; need to flush hosts - 10https://bugzilla.wikimedia.org/69828 (10Andre Klapper) [12:55:37] 3Wikimedia Labs / 3tools: tools-db is down; need to flush hosts - 10https://bugzilla.wikimedia.org/69828#c1 (10metatron) 5UNCO>3NEW It is node: tools-webgrid-04 which is currently blocked by tools-db. All other nodes seem to be ok atm. A db-admin has to issue the FLUSH HOSTS command: But prior to that,... [14:31:32] !log integration rebased puppetmaster [14:31:35] Logged the message, Master [16:40:05] hi, how can I add a debug port to my tool. So that the tool work on that port even when I disable the webservice. :-) [18:07:56] wants the puppetlabs mysql module in labs [18:28:32] Coren: Mind pushing the button on this issue ? https://bugzilla.wikimedia.org/show_bug.cgi?id=69828 [18:40:22] 3Wikimedia Labs / 3tools: tools: Grid spawned multiple instances of a "-once -continues" type job - 10https://bugzilla.wikimedia.org/69867 (10Krinkle) [18:40:25] 3Wikimedia Labs / 3tools: tools: Job queue spawned multiple instances of a "-once -continues" type job - 10https://bugzilla.wikimedia.org/69867 (10Krinkle) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier Both on tools.ecmabot and tools.wmfdbbot, the main application process was running multiple copies... [18:41:57] (03PS1) 10Tim Landscheidt: Cut a clean release 1.0.10 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155595 [18:44:21] 3Wikimedia Labs / 3tools: Add locking to jstart so simultaneously started jstart / jsub -once calls don't create duplicated tasks - 10https://bugzilla.wikimedia.org/60862#c1 (10Tim Landscheidt) *** Bug 69867 has been marked as a duplicate of this bug. *** [18:44:22] 3Wikimedia Labs / 3tools: tools: Grid spawned multiple instances of a "-once -continues" type job - 10https://bugzilla.wikimedia.org/69867#c1 (10Tim Landscheidt) 5NEW>3RESO/DUP *** This bug has been marked as a duplicate of bug 60862 *** [18:53:07] (03CR) 10Tim Landscheidt: [C: 032 V: 032] Cut a clean release 1.0.10 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155595 (owner: 10Tim Landscheidt) [19:26:47] !log deployment-prep Deleted salt keys for retired apache0[12] minions [19:26:50] Logged the message, Master [19:27:15] !log deployment-prep Killed rogue salt-master process on deployment-bastion [19:27:18] Logged the message, Master [19:27:49] !log deployment-prep Restarted salt-minion on deployment-jobrunner01 & deployment-videoscaler01 [19:27:52] Logged the message, Master [19:28:31] !log deployment-prep Deployed cherry-pick of Iea7217a for scap [19:28:33] Logged the message, Master [19:29:50] 3Wikimedia Labs / 3tools: Add locking to jstart so simultaneously started jstart / jsub -once calls don't create duplicated tasks - 10https://bugzilla.wikimedia.org/60862#c2 (10Krinkle) Hm.. if bug 69867 is a dupe of this, how come it was jstarted twice in such a short period of time? It runs every 2 or 5 mi... [19:33:27] Hi, whoever modified /usr/local/bin/become has made a typo on line 27 $prefiX [19:33:59] hi there ! I’m afraid I have a problem accessing my tool instance in Labs [19:34:03] when I type « become totoazero » (my toolname), I get « become: no such tool 'totoazero’ » [19:34:18] Toto_Azero: new tool? if so, have you logged out and in again? [19:34:19] though there still is the home folder in /data/project/totoazero/ [19:34:24] valhallasw`cloud: /usr/bin/become is broken :( [19:34:29] valhallasw`cloud: no no [19:34:41] oh that’s the reason :) [19:35:04] Toto_Azero / jimmyxu: try /home/valhallasw/become [19:35:14] that's a version I was playing with this morning [19:35:54] sudo -niu tools.* works [19:36:29] valhallasw`cloud: doesn’t seem to work for me, looks like it runs but doesn’t do anything [19:36:55] * jimmyxu had the feeling of losing all that's not backed up [19:36:58] jimmyxu: yes, this one works fine thx :) [19:38:08] (03PS1) 10Merlijn van Deen: become: fix typo in existence check [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155611 [19:38:26] Coren / scfc_de / etc ^ [19:40:06] 3Wikimedia Labs / 3tools: Add locking to jstart so simultaneously started jstart / jsub -once calls don't create duplicated tasks - 10https://bugzilla.wikimedia.org/60862#c3 (10Tim Landscheidt) I assume (= my reasoning for flagging bug #69867 as a dupe) that network congestion and/or load on the client, the... [19:40:57] valhallasw`cloud: Sorry, will do that immediately. [19:41:24] thanks :-) [19:41:56] (03CR) 10Tim Landscheidt: [C: 032 V: 032] become: fix typo in existence check [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155611 (owner: 10Merlijn van Deen) [19:42:11] valhallasw`cloud: Need to cut another release, though, so takes a few minutes. [19:42:40] (Though I don't understand how I could miss that, I tested that.) [19:42:50] bitrot! [19:43:19] (and although I mean that jokingly, it /is/ just a single bit that's wrong, so it could have a hardware cause) [19:47:15] (03PS1) 10Tim Landscheidt: Cut release 1.0.11 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155613 [19:47:37] !log deployment-prep Killed hung salt-call and started salt-minion on deployment-cache-bits01 [19:47:39] Logged the message, Master [19:47:57] !log quarry upgraded all text and varchar columns to utf8 [19:47:58] Logged the message, Master [19:50:14] !log deployment-prep Killed dozens of grain-ensure calls and started salt-minion on deployment-cache-mobile03 [19:50:16] Logged the message, Master [19:52:20] (03CR) 10Tim Landscheidt: [C: 032 V: 032] Cut release 1.0.11 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/155613 (owner: 10Tim Landscheidt) [19:52:57] !log deployment-prep Salt minions are broken all over beta. Hung grain-ensure calls, hung test.ping calls, downed minions [19:53:00] Logged the message, Master [19:53:49] twentyafterfour: Do you have some time to walk through the beta hosts and fix the messed up state of salt? [19:54:04] ok [19:54:07] bd808: suree [19:55:04] I've been doing: ps ax|grep grain-ensure|awk '{print $1}'|sudo xargs kill; ps ax|grep salt-call|awk '{print $1}'|sudo xargs kill; sudo service salt-minion start [19:55:16] boring [19:55:31] !log deployment-prep Fixed salt on deployment-memc02 [19:55:33] Logged the message, Master [19:55:50] pulls up my handy dandy hah [19:56:06] handy dandy whiteboard of beta [19:56:19] or is there a list of hosts I should work from? [19:56:21] twentyafterfour: If you run `sudo salt-run manage.down` on deployment-salt it will give you a list of hosts that are broken [19:56:42] But un helpfully this list is the openstack node names [19:57:21] so you have to translate them. I paste them in to the url https://wikitech.wikimedia.org/wiki/Nova_Resource:I- [19:57:44] Then you can see the Instance Name on the wikitech page [19:57:49] I'm sure there is a better way [19:58:25] valhallasw`cloud: No, that wasn't just a bit that flipped: https://gerrit.wikimedia.org/r/#/c/147096/4..6/misctools/become [19:59:22] scfc_de: ah :-) [20:04:04] Okay, I ran Puppet manually on -dev and -login, and it seems to work now. [20:05:01] scfc_de: indeed. thanks [20:05:14] !log tools Deployed release 1.0.11 of jobutils and miscutils [20:05:17] Logged the message, Master [20:12:05] !log deployment-prep List of broken salt minions can be obtained with `sudo salt-run manage.down` on deployment-salt [20:12:07] Logged the message, Master [20:15:50] Anybody who is using mediawiki vagrant here? [20:16:44] bd808: http://fab.wmflabs.org/P152 [20:16:47] !vagrant [20:18:17] bd808: you wanna start at one end and I'll start at the other? [20:18:40] bd808: or should I further automate it so we don't have to be bored? [20:19:51] bd808: I don't have access to some of (most of?) these nodes [20:20:08] twentyafterfour: what!!! [20:20:14] I can fix that [20:21:09] twentyafterfour: you are a project admin for the whole project [20:21:21] I don't know ... [20:21:35] and in the root sudoers group [20:23:49] how do I tell? I'm looking in the wiki ... [20:24:33] I'm "under_NDA" [20:24:47] yeah. that's the root sudoer group [20:25:36] project membership is on https://wikitech.wikimedia.org/wiki/Special:NovaProject and you have the "projectadmin" role [20:27:43] bd808: my fault, agent forwarding wasn't set up right [20:32:07] twentyafterfour: excellent [20:32:09] bd808: salt-minion failed to start [20:32:42] twentyafterfour: debug it! Look for other hung python/salt commands [20:32:56] yep that's what I'm ddoing [20:33:13] should salt master be running as well? [20:33:29] on say deployment-memc03 [20:34:34] no. [20:34:46] Just salt-minion [20:35:07] but I found some other hosts that were running rogue salt-master instances [20:35:37] I think I may have typoed a command to restart minions at some point and made everything crazy [20:35:49] some point being god knows how long ago [20:36:19] twentyafterfour: are you working from the top of the list down? [20:36:25] I can do a few from the bottom [20:36:49] bd808: yeah top ddown [20:37:04] ok so stop the master? that seems to be the problem, the minion is unable to authenticate the master [20:37:14] local master is overriding the normal one I guess [20:37:17] twentyafterfour: Yeah kill the master with fire [20:37:42] master should only be on the deployment-salt host [20:39:07] bd808: if you care: the script I used to generate the list: http://fab.wmflabs.org/P153 [20:39:22] nice [20:40:49] ok the master public key is still not authenticating [20:40:56] should I remove it as I'm told in the logs? [20:41:48] ah. an always broken node. Just found one on my end of the list too [20:42:06] twentyafterfour: sudo rm /etc/salt/pki/minion/minion_master.pub [20:42:25] Often missed step of https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Converting_a_host_to_use_local_puppetmaster_and_salt_master [20:43:29] bd808: that's what the log suggested, just wanted to be sure that was ok [20:44:11] salt-minion start/running, process 9995 [20:44:15] yay [20:46:58] !log deployment-prep Started salt-minon on deployment-cxserver01 [20:47:00] Logged the message, Master [20:47:44] !log deployment-prep Started salt-minion on deployment-memc03 [20:47:47] Logged the message, Master [20:48:41] !log deployment-prep Started salt-minion on deployment-cache-text02 [20:48:43] Logged the message, Master [20:48:57] !log deployment-prep Started salt-minon on deployment-db2 [20:48:59] Logged the message, Master [20:49:51] !log deployment-prep Started salt-minon on deployment-memc05 [20:49:54] Logged the message, Master [20:50:54] bd808: deployment-parsoid04 is timing out.. I guess the node is down completely [20:51:17] It may be. [20:51:35] * bd808 looks in wikitech [20:52:25] twentyafterfour: console looks bad -- https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=deployment-prep&instanceid=22e66278-971f-4ebd-bf89-aad6ac4ebae5®ion=eqiad [20:52:52] No messages since March? [20:53:25] wth [20:53:50] twentyafterfour: I got logged into it [20:54:22] !log deployment-prep Started salt-minon on deployment-parsoid04 [20:54:25] Logged the message, Master [20:54:46] !log deployment-prep Started salt-minon on deployment-memc04 [20:54:48] Logged the message, Master [20:55:18] !log deployment-prep Started salt-minon on deployment-cache-upload02 [20:55:20] Logged the message, Master [20:56:59] !log deployment-prep Started salt-minon on deployment-analytics01 [20:57:01] Logged the message, Master [20:57:43] !log deployment-prep Started salt-minon on deployment-elastic04 [20:57:46] Logged the message, Master [20:58:17] !log deployment-prep Started salt-minon on deployment-elastic03 [20:58:19] Logged the message, Master [20:58:47] !log deployment-prep Started salt-minon on deployment-elastic02 [20:58:49] Logged the message, Master [20:59:16] !log deployment-prep Started salt-minion on deployment-eventlogging02 [20:59:18] Logged the message, Master [20:59:25] !log deployment-prep Started salt-minon on deployment-elastic01 [20:59:27] Logged the message, Master [21:00:00] !log deployment-prep Started salt-minon on deployment-db1 [21:00:03] Logged the message, Master [21:00:46] !log deployment-prep Started salt-minon on deployment-fluoride [21:00:48] Logged the message, Master [21:01:15] !log deployment-prep Started salt-minon on deployment-upload [21:01:16] Logged the message, Master [21:01:46] !log deployment-prep Started salt-minion on deployment-redis01 [21:01:49] Logged the message, Master [21:01:59] bd808: that's it I think [21:02:38] no more nodes reported as down [21:02:43] want to run your list generator again and get the list of ones that didn't really start right? [21:03:52] hmmm... just dead hosts left in the list? [21:04:24] I'm not getting anything in the list [21:05:24] There are 8 i-* names left but they may all be hosts that we have deleted [21:05:56] Yeah -- https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000103.eqiad.wmflabs [21:06:27] I can clean those keys out from the master [21:08:10] ok cool [21:08:47] twentyafterfour: Thanks for your help :) [21:09:01] bd808: no prob [21:09:17] good learning, and wrote a slightly useful script [21:23:02] Those salt-minions are supposed to run on all instances? I. e., they should be monitored? [21:35:16] scfc_de: Probably, yes [21:35:24] yes they should be running [21:35:32] probably we should monitor [21:36:02] labs monitoring is a work in progress I think [21:36:24] beta is its own special hell [21:36:55] I noticed because we have been having issues doing trebuchet deploys in beta [21:37:08] i guess instead of checking for minion process on each node, it would be more effective to run something on * from the salt master [21:37:12] and then check for errors [21:37:30] bd808: scfc_de yeah graphite for labs awaiting some network xonfig by someone else. We can also setup icinga checks for projects with NDA roots easily [21:38:18] mutante: I discovered today that `sudo salt-run manage.down` will tell you the minions that the master can't reach [21:38:42] bd808: :) cool, sounds ideal for monitoring [21:38:53] But it seems to be flakey [21:39:05] prone to network issues or something [21:39:22] One run will say all of the hosts are dead [21:39:28] the next will only list one [21:39:35] such is salt [21:39:58] what if we run an actual command instead.. like "date" [21:40:04] or uname or whatever [21:45:46] mutante: The thing is that you don't get errors from hosts that don't respond. You just get silence and salt is ok with that [21:46:12] I think you can ask what hosts should have responded though [21:46:35] which I believe is what salt-run manage.down is doing [21:46:57] comparing the list of who should respond to the list of who did respond [21:49:17] !log deployment-prep Trebuchet happier after all the salt-minion restarts; still have deleted hosts showing in the expected minion list for scap deploys [21:49:20] Logged the message, Master [23:32:55] How can I find out why the webservice was stopped for my tools? Is there an error log? [23:35:49] Found it. PHP memory error. :|