[09:33:29] !log deployment-prep rebased puppetmaster [09:33:31] Logged the message, Master [10:57:03] starts to decom virt5-11 [10:57:15] removed from puppet, stopping agent [13:06:46] * YuviPanda looks for a Coren [13:07:11] I'm pretty sure I saw a Coren right around here. [13:08:38] Coren: let's look for him together! [13:09:12] Coren: back from athens? when are we going to flip the switch to nginx today? If you tell me I can move my schedule around that [13:10:12] !log deployment-prep rolling restart of Elasticsearch nodes in beta to make super sure it picked up new plugins [13:10:19] Logged the message, Master [13:10:21] YuviPanda: I have no specific preference. How are we going to go about it? The easiest way, it seems to me, is to add the class to the extant tools-webproxy then manually turn off apache/turn on nginx. [13:10:40] Coren: sounds good. Want to do it now? [13:11:03] Coren: oh, wait. ssl certs [13:11:19] Coren: we'd have to update the class with pointer to the ssl cert on tools-webproxy first. [13:11:52] * Coren nods. [13:12:11] The cert is already on the webproxy box, in /etc/ssl [13:12:19] right. just needs a name. [13:12:25] and the dynamicproxy class has a param for the ssl cert [13:13:21] !log deployment-prep done [13:13:22] Logged the message, Master [13:15:24] In /etc/ssl/private/tools.wmflabs.org.key and /etc/ssl/certs/tools.wmflabs.org.pem [13:15:37] The chain is /etc/ssl/certs/RapidSSL_CA.pem [13:16:08] Coren: hmm, it looks for .chained.pem [13:16:11] Coren: I think I'll fix that [13:30:52] brb phone [13:32:24] the chain is your certificate + RapidSSL + GeoTrust [13:32:28] puppet does a cat [13:32:54] https://gerrit.wikimedia.org/r/#/c/126008/1 [13:33:19] https://gerrit.wikimedia.org/r/111386 3 certs to combine there to get the right chain [13:33:38] i'd let puppet do it, there is already so much confusion around them [14:29:30] !log project-proxy removed some users who may not have signed an NDA [14:29:32] Logged the message, dummy [14:34:07] andrewbogott: ... what? "may not have" [14:34:43] Coren: I'm just being lazy, mostly. [14:35:13] That project used to be totally different… back in the day volunteers made the pmtpa-proxy style proxies in that project [14:35:17] now it hosts yuviproxy. [14:37:20] Hi Coren. [14:37:46] CP678|iPhone: hello. [14:38:10] Coren, did you ever see that bugs ills I opened up? [14:38:46] CP678|iPhone: I see lots of bugs. [14:39:43] Coren the user fields in archive and revision aren't indexed very. They can take up to 10s of seconds when running a query. However a second run is significantly faster. [14:41:01] Have you been using revision_userindex? [14:41:05] Yes. [14:41:07] and not just plain 'revision'? [14:41:17] And archive_userindex [14:41:58] There aren't various degrees of indexing. You're just seeing the effects of working set shift and heavy use. [14:43:37] Not really. [14:44:05] Coren, if I used rev_user_text the query runs faster. [14:47:15] Coren, if you did a query on archive_userindex, you'll see the query speeds more easily. [14:47:36] You may wish to discuss performance issues with springle [14:47:48] @seen springle [14:47:48] Cyberpower678: springle is in here, right now [14:47:54] springle, ping [16:13:23] Can someone help me with WorkBench for Mac [16:14:50] Cyberpower678: What is that? [16:15:12] marktraceur, never heard of MySQL Workbench? [16:15:39] There were multiple results for "WorkBench for Mac" in DDG [16:15:45] Cyberpower678: What issues are you having [16:15:46] ? [16:16:11] Cannot open SSH Tunnel: Error connecting SSH tunnel: Delete entries for the host from the ~/.ssh/known_hosts file [16:18:22] Cyberpower678: Does normal ssh work for you to tools-login.wmflabs.org? That error message sounds like you have the fingerprints for the pmtpa hosts still in known_hosts, but then your normal ssh should croak as well. [16:19:10] normal ssh works fine [16:19:30] As a matter of fact I just installed MySQL Workbench on my Mac software. [16:28:01] scfc_de, ^^^ [17:06:21] hi bd808 it's somewhat subtle but I think I might be seeing a performance problem in beta labs. ganglia seems to be toast, but icinga is showing disk space problems everywhere http://icinga.wmflabs.org/cgi-bin/icinga/status.cgi?hostgroup=deployment-prep&style=detail . does any of that ring a bell? [17:10:15] chrismcmahonbrb: The icinga alerts are saying that the monitoring plugin isn't installed, which seems like something that could/should be fixed. [17:24:19] thanks bd808 (I got called away for a sec) [18:25:13] YuviPanda: Sorry for the delay. Being away for >1week does that. [18:25:22] YuviPanda: I'm ready for the switcharoo whenever you are. [18:25:23] Coren: I understand, 'tis ok :) [18:25:27] Coren: I am now! [18:25:30] Coren: copy cert? [18:25:54] Yeah, doing so now. Where exactly is your config expecting it? (And Ima copy the key, not the cert of course) [18:26:17] Coren: moment [18:26:37] Coren: ssl_certificate_key /etc/ssl/private/<%= @ssl_certificate_name %>.key; [18:26:45] Coren: so /etc/ssl/private/star.wmflabs.key [18:26:46] err [18:26:48] star.wmflabs.org.key [18:29:14] {{done}} [18:29:23] Shall I reenable puppet and run? [18:30:08] Coren: yeah! [18:32:05] Hm. I see we'll run into issues with logs on this instance unless I tweak something in the long run.' [18:32:20] Coren: logs? [18:32:28] Not quite related. [18:32:32] Coren: ah, ok! [18:32:33] Just something I noticed. [18:32:50] you're talking about space, I guess [18:32:54] * Coren nods. [18:33:35] Allright. Now comes the last funky issue: how do we handle currently running webclients? I /could/ just restart them all after pointing them at the right spot, but that seems annoyingly disruptive to me. [18:33:46] s/clients/servers/ [18:33:53] Coren: I can't think of any other way, though :( [18:34:07] Coren: only other way is to have this a different machine and then switch DNS [18:34:19] Coren: if not we'll have to do the client restart every time we switch (if we switch again) [18:34:25] Yeah, doesn't sound much better. [18:35:45] Coren: so are we gonna switch this or build a new machine? [18:35:55] I say switch. [18:36:04] To the currently existing tools-webproxy [18:36:15] Coren: do we have a system in place to restart the currently running lightys? [18:36:28] I can just have the grid do that. [18:37:20] Coren: ok! [18:37:25] Coren: let's switch then! do you want to switch? [18:37:57] Yeah, I should be okay. All I need to do is switch what class the webproxy is to yours, in theory. [18:38:26] Coren: indeed. and kill apache [18:38:30] Coren: so nginx can take up 80 [18:38:40] Coren: you should switch roles, stop apache and then force puppet run [18:39:41] I'm actually going to run puppet first, make sure everything is okay, before I bring down apache -- that'll just make nginx fail until I start it by hand. [18:40:23] Coren: won't installing nginx try to start it, and that step will fail because apache has 80, and that'll cause a failed dependency preventing the rest of the role from working? [18:41:04] Possibly. I'll just bring down apache a bit earlier then. It'll still be a faster puppet run [18:41:32] Coren: ok! [18:45:10] 2014/04/16 18:44:59 [emerg] 8385#0: SSL_CTX_use_PrivateKey_file("/etc/ssl/private/star.wmflabs.org.key") failed (SSL: error:0B080074:x509 certificate routines:X509_check_private_key:key values mismatch) [18:45:13] Dafu? [18:46:13] Coren: hmm, that makes no sense :| [18:46:41] * Coren grabs the cert off of the general proxy. [18:47:59] Identical. What the? [18:49:34] * Coren curses. [18:50:34] * YuviPanda has no idea about ssl certs, so keeps quiet [18:50:48] Coren: hmm, isn't puppet the one that's going to munge and install the cert? [18:50:57] It has, and it did. [18:51:15] ah, hmm [18:52:35] And I can confirm that I have the same cert and key that is currently on dynamicproxy-gateway [18:52:42] Which is our general proxy. [18:52:50] andrewbogott: mutante|away ^ [18:52:59] Coren: hmm, nginx version? [18:53:09] should be 1.5.0, picked up from labsdebrepo [18:53:16] What's up? [18:54:12] i A 1.5.0-1~ppa1~precise 1500 [18:54:26] Coren: that sounds right [18:54:39] andrewbogott: do you remember how you got the certs to behave on dynamicproxy? [18:54:59] No, but I remember it was easy. [18:55:00] andrewbogott: I'm hitting a brick wall on the tools-webproxy; I'm using the same cert and key as dynamicproxy-gateway but nginx refuses to start complaining of a key mismatch. [18:55:12] Like, waited unttil puppet was throwing a consistent error, then copied a file over from the old proxy [18:55:17] I don't know if I copied a cert or a key [18:55:27] I'll have a look [18:55:29] andrewbogott: It would have been a key. [18:56:25] !log deployment-prep Migrating memc04 and memc05 to self master/salt {{bug|64010}} [18:56:32] Logged the message, Master [18:56:56] hm, puppet is still happy on dynamicproxy-gateway, so it's not a puppet regression... [18:57:25] Well, we wouldn't know that for a fact until we restarted nginx actually. [18:57:37] But I doubt it'd have randomly munged the certificate. [19:00:33] I just looked at the actual key material, and it definitely does /not/ match the certificate. [19:00:52] Beware: restarting nginx on the general proxy will probably have it fail. [19:01:18] I'm guessing there's a mixup caused by the pre- vs post- heartbleed certs. [19:01:34] yep, broken on the general proxy too [19:01:38] It looks like puppet is installing the old cert. [19:02:02] And both have the (correct, manually installed) new key. [19:02:43] So, explain to me why you think it's the key and not the cert that was copied over? [19:02:56] andrewbogott: Because puppet never touches the key. [19:03:23] Coren, tools died [19:03:36] Cyberpower678: see /title [19:03:38] Absolotely no connection to server. [19:03:39] Cyberpower678: I also emailed labs-l [19:03:41] Also because -rw-r--r-- 1 root ssl-cert 1679 Apr 9 13:26 /etc/ssl/private/star.wmflabs.org.key [19:03:52] Apr 9 = when RobH replaced. [19:03:54] Oh. Ok [19:04:00] * Cyberpower678 whistles. [19:04:22] But the new cert apparently never made it into puppet. [19:05:09] So if RobH copied over, and restarted nginx; the next puppet run will have munged it back (but nginx wouldn't care until restart) [19:06:08] So you're saying that it /should have/ been the key, not that it /was/ the key, right? [19:06:21] Because you're describing a process where I (or rob or whoever) might have copied over the cert and it would appear to work [19:06:23] until a reboot [19:07:00] andrewbogott: No, I'm saying that the key is probably okay (from the date) but that the new cert derived /from/ that key got overwritten by puppet. [19:07:20] ok... [19:07:45] I definitely haven't touched any of this post-heartbleed, so anything I know about how it was set up is probably not useful [19:07:54] Does the puppet run copy a cert or generate it locally from the key? [19:10:01] Reasonator is down as well [19:10:06] should it be ? [19:10:39] GerardM: yes, see /topic [19:10:52] GerardM: all of tools is down atm. [19:11:15] andrewbogott: puppet can't sign a cert. :-) But it does build teh chain. [19:11:26] Ah! I got the right cert from Rob, and it's definitely different. [19:11:36] * Coren fixes. [19:11:44] So, given that we don't want to ever put the proper cert in puppet... [19:12:00] change the puppet code to only copy the cert if the file doesn't exist, maybe? [19:12:59] ... why would we care about putting the cert in puppet? [19:13:21] certs are public, by definition. They are sent on every SSL negociation. Only the key is secret. :-) [19:13:30] ok [19:13:35] my ignorance on this subject is boundless [19:14:21] A cert is just the public part of the keypair, signed by some CA. [19:15:08] * YuviPanda should read up about ssl at some point in the future [19:16:25] Ah; puppet doesn't regenerate the chained cert if it already exists. [19:16:56] ... and, nginx is up on tools. [19:17:29] * Coren fixes the general proxy. [19:18:06] hmm, 503 [19:18:51] YuviPanda: Normal; I hadn't yet restarted the webservices. [19:18:57] Hi YUvi [19:19:19] fixed on dynamicproxy-gateway [19:19:39] ah ok! [19:19:41] * Coren restarts the webservices [19:19:42] Qcoder00: hi [19:20:04] YuviPanda : Can I have a word with you in en on something? [19:20:11] Qcoder00: sure! [19:20:25] YuviPanda: the daemon that listens to the clients for granting ports doesn't seem to be running? [19:20:45] Coren: puppet should've auto started it. service proxylistener start? [19:21:24] It doesn't look like there's an ensure => running because it didn't try. Starting it by hand seems to have worked, though. [19:21:36] ah, guh. right. [19:25:38] Coren, you restarted the web server right? [19:26:46] Cyberpower678: It's in progress. Should return shortly. [19:27:34] The qmaster seems to have taken issue with my massive restart. :-) [19:30:46] The webservices are gradually getting started (gridengine seems to be pacing the restarts) but those that have are working [19:30:53] \o/ [19:31:00] https://tools.wmflabs.org/paste/ [19:31:18] Coren: hmm, so http://tools.wmflabs.org/paste isn't work [19:31:19] ing [19:31:28] Note the trailing / [19:31:37] Coren: yeah, is that something I should fix at the nginx level? [19:31:43] Coren: or was this previous behavior too? [19:31:50] either way it shouldn't be returning a 503 [19:31:51] for that [19:32:07] YuviPanda: It may be more complicated than you expect. It's only returning a 503 because the admin webservice isn't back up yet. [19:32:14] ah, right [19:32:15] (That's the one that handles "everything else") [19:32:16] let's wait it out hten [19:32:22] *then [19:32:39] 100 webservices left to restart. [19:32:51] (out of 251) [19:33:07] woo [19:34:59] 49 left [19:35:18] Of /course/ admin will be the last. :-P [19:35:22] hehe [19:35:47] * Qcoder00 hears the sound of turbines starting [19:35:49] ;) [19:36:18] Restarting some 250 jobs is still an ordeal. [19:36:28] Qcoder00, sounds more like jet engines. :p [19:36:36] 8 left, admin still not there. Hah! [19:37:00] 2 left. Guess who's missing. [19:37:01] :-) [19:37:09] *finally* [19:37:59] Hm. Start*ing* not start*ed*. :-P [19:38:06] Are we up yet? [19:38:33] Qcoder00, my tools are. :D [19:39:07] Ah; on the plus side this had the side effect of rebalancing the webservices between the two webgrid nodes. [19:39:22] and so is https://tools.wmflabs.org/ [19:39:54] tools is up but running at redcuced power [19:42:19] Qcoder00: what do you mean reduced power? [19:42:33] Slow [19:43:39] is it slow for others too? [19:43:52] the first ever hit to a tool might be slow, but should pick up after that [19:43:56] Qcoder00: Some lingering slowness is expected given that every webservice has been restarted and is currently setting up. Also, first query to a lighttpd running fcgi incurs the overhead of firing up PHP itself, which is often nontrivial. [19:44:27] https://tools.wmflabs.org/csbot/foo.php is pretty much instantaneous, as expected. [19:51:56] Seems to be working fine. [19:52:02] \o/ [19:56:09] Coren, dschwen wma.wmflabs.org may require some kicking still [19:56:44] Eloquence: Hm, I haven't touched the general proxy myself. andrewbogott, did you play with it while we were sorting the certs? [19:56:46] * Coren goes check. [19:57:40] Coren, a few minutes ago when you said 'fixes the general proxy'... [19:57:56] andrewbogott: Ah, post fix then; it should be okay. [19:57:58] * Coren debugs. [19:58:46] http://eqiadproxytest.wmflabs.org/wiki/Main_Page is working, which makes me think the proxy is ok. [19:58:51] Eloquence: AFAICT, the issue isn't proxy-related; the proxy answers but the backend server doesn't. [19:59:08] Also beta works. [19:59:09] http://tools.wmflabs.org/reasonator/ still broken as well [20:00:24] mh ok .. dschwen may be able to take a look at WMA when he's around .. it's embedded in the chrome for articles with geocoords so significant user-facing breakage [20:00:45] Eloquence: Ah; interesting. That webservice isn't started for some reason -- and the count was 251 before restarting so it had already died for another reason. Restarting it. [20:01:40] Just restarting the webservice gets /something/ up, but clearly not fully working. [20:02:14] Eloquence: Lemme go see if I can see something obviously wrong with wma. [20:04:22] Does anyone know what project this is? [20:04:48] GerardM: ^ [20:05:20] Also, I think there's a dependency between it and reasonator, which might explain why the latter is also ill. [20:05:24] Coren: it's trying to load http://tools.wmflabs.org/magnustools/resources/dist3/css/bootstrap.min.css [20:05:26] and 404ing [20:05:30] magnustools is started? [20:05:47] YuviPanda: Doesn't look like. [20:05:52] Coren: that might be it? [20:06:03] it's loading a lot of JS from there too [20:06:14] Coren, wma = wikiminiatlas [20:06:37] YuviPanda: Starting it seems to have solved the reasonator issue. [20:06:45] Coren: woot! [20:06:50] GerardM: reasonator should be back up now [20:08:19] http://tools.wmflabs.org/reasonator/?&q=16 shows pretty data, so it definitely looks back to full health. [20:09:40] YuviPanda: Is there an easy way to query your proxy to figure out the instance behind an url? [20:10:10] Coren: redis-cli, keys * should give you list. then you can view each key to figure out [20:11:59] YuviPanda: Naice. [20:12:15] Coren: :D [20:12:57] Eloquence: Kicking the apache behind wma seems to have done the trick. [20:13:45] Ah, the underlying problems seems to be a full /var [20:15:03] Which, in turns, is caused by an exploding error log. [20:15:11] Coren: as an added bonus, http://spdycheck.org/#tools.wmflabs.org :) [20:15:58] Booo! "Out-of-Date SPDY Protocol Support" :-P [20:16:16] Coren: yeah, we can easily fix that by updating the deb to 1.5.10 [20:42:45] huh [20:42:50] Hi, I'm here! [20:43:22] WMA works fine for me [20:43:39] oh, slowly catching up [20:44:27] so was wma broken or not? [20:44:30] * dschwen is confused [20:44:40] dschwen: it was broken, then Coren fixed it [20:44:58] my instance was broken, or the proxy in front of it? [20:45:37] dschwen: instance. /var was out of space [20:46:26] ugh, shit [20:46:34] must have been the log files [20:46:39] yeah [20:46:44] thanks a bunch! [20:46:56] I've already diealed down apache logging big time [20:47:41] dschwen: :) [20:54:50] I see that I also have to work on my php errors [20:55:10] I get megabytes of warnings about undefined indexes in arrays/hashes [20:55:20] time for some if array_key_exists [21:02:22] dschwen: :) [21:03:03] I should probably log to /data/project/... anyhow (at least stuff like per tile rendering logs) [21:03:43] the /var partition is too small log that amount of data there [21:04:13] "Web server down for a few minutes for maintenance" ..... Hmmm. Few minutes, huh? [21:05:00] Josve05a: should be back up a while ago? [21:07:15] Hmmm....On 2 diffrent tools I have I have problems... https://tools.wmflabs.org/checkwiki & https://tools.wmflabs.org/citations-dev/doibot.php?edit=toolbar&slow=1&user=Josve05a&page=List_of_roads_named_after_Mahatma_Gandhi but not on anyother.... [21:08:07] Josve05a: try restarting them? [21:08:10] Josve05a: webservice restart [21:08:51] they are not my tools, and i don't know "squat" about anything... [21:09:44] and https://tools.wmflabs.org/xtools/ (which has all the great edit counters) [21:10:17] Coren: ^