[09:04:11] $ ssh jzerebecki-test2.eqiad.wmflabs [09:05:00] 2 lines: if you have problems\nPermission denied (publickey). [09:05:08] worked fine yesterday [09:05:14] 2nd time this happens, [09:05:40] that an instance works for me for a day and then is broken the next day. [09:06:25] no need to salvage them as I don't have anything important on them, but [09:07:15] Coren, andrewbogott_afk: any idea what causes that and how to not have that happen in the future? [09:08:15] both were puppetmaster-self; also i made sure this time that puppet was disabled on that instance. [09:41:39] !log bots upgrading wm-bot to latest version, let's hope it's gonna work :o [09:41:39] Logged the message, Master [09:52:28] <_joe_> !log deployment-prep cherry-picked I6ec53da483bebfa375eba2383cbf60123ff1ce26, it work [09:52:31] Logged the message, Master [09:52:33] <_joe_> *s [09:52:35] <_joe_> damn [09:59:34] <_joe_> btw, beta has several broken symlinks, I won't touch anything there [12:52:43] jzerebecki: Did you let puppet run correctly from the normal tree /before/ you switched them to puppetmaster::self? [13:07:47] Coren: yes, for both instances. and logging in with ssh worked still after many puppet runs with the -self variant. [13:10:21] jzerebecki: AFAICT, it's your LDAP settings that get mangled. [13:11:31] Specifically, your nslcd gets pointed to uri ldap://127.0.0.1/ [13:12:06] Though forcing a puppet run now has fixed it. [13:13:34] (puppetmaster::self is a finnicky and brittle thing that really shouldn't be used) [13:18:05] nscd caches passwd entries from LDAP up to an hour; broken ldap may not show immediately. [13:35:03] Coren: thank you. yep according to puppet reports that is exactly what happened. though i wonder how that happened. especially as another run with the same state of the puppet.git fixed it. for things that puppet-compiler can't do I don't know any better way than that to test patches for operations/puppet.git, do you? [13:37:12] according to the reports puppet didn't change that file to localhost and I didn't do that manually. [13:37:44] And yet, somethings clearly did. Huh. [13:51:40] <_joe_> the puppet oompah loompahs probably [13:52:53] <_joe_> btw, I do remember having issues with an nscd package update in labs previously [13:55:05] nscd just caches, it wouldn't be messing with the LDAP config. For that matter, it wouldn't even know about it strictly speaking; there's the entire resolver between the two. [13:57:43] <_joe_> Coren: it was pointing to the wrong ldap master [13:57:48] <_joe_> or something [13:57:59] <_joe_> I don't honestly remember [14:22:48] YuviPanda: ping! having some issues with your mwoauth lib [14:23:11] YuviPanda: https://tools.wmflabs.org/paste/view/3dae6474 [14:24:31] YuviPanda: "requests.utils.to_native_string is available since requests 2.0.0" -> currently installed: ii python-requests 1.2.3-1 Python HTTP for Humans. [14:27:56] !bot What's the Answer to The Ultimate Question of Life, the Universe, and Everything? [14:27:56] http://meta.wikimedia.org/wiki/WM-Bot | troubleshooting bots -> https://labsconsole.wikimedia.org/wiki/Nova_Resource:Bots/Documentation#Troubleshooting [14:28:16] no [14:28:46] !What's the Answer to The Ultimate Question of Life, the Universe, and Everything? [14:31:27] 42 [15:12:59] 3Wikimedia Labs / 3deployment-prep (beta): Broken links make bits answer with 403s in beta - 10https://bugzilla.wikimedia.org/70445 (10Giuseppe Lavagetto) 3NEW p:3Unprio s:3normal a:3None On beta, after a cache purge in bits, all static assets respond with 403 forbidden, that is due to the files bein... [15:17:36] hedonil: lol I think it gave you reasonable answer haha [15:20:53] petan: yeah, from bot's perspective an arid, straight forward answer [15:21:21] which makes him look like a ci ;-) [15:31:36] petan: btw. Do you have any performance indicators for your numerous bots? messages per bot per day or something [15:31:55] you mean wm-bot? [15:32:25] petan: all bots you herd [15:32:28] no [15:35:00] petan: too bad, as they don't make any wiki-edits, I have no couter for them atm -> http://tools.wmflabs.org/directory/?view=bot [15:35:12] * hedonil will find some metrics, though [15:37:45] !channels [15:37:45] | #wikimedia-labs-nagios #wikimedia-labs-offtopic #wikimedia-labs-requests [15:38:02] hmm [15:38:58] http://bots.wmflabs.org/~wm-bot/dump/%23wikimedia-labs.htm [15:38:58] @info [15:39:17] ?? [15:39:57] is ssh to labs instances currently a bit laggy? [15:41:22] half frozen most of time from Europe [15:41:24] jzerebecki: It's snappy from here (North-American East coast) [15:41:41] mosh does tend to help a lot from EU [15:42:40] oh, labs has mosh installed? [15:42:55] that just was more than a 10s freeze. anyway is fine again. [15:42:58] greg-g: All the labs bastions do. [15:43:11] Coren: neat, but not prod, 'cuz security. [15:43:22] usually it is very snappy for me from eu [15:47:04] greg-g: also not prod because mosh doesn't support proxycommand or ssh agent forwarding [15:48:40] greg-g: also, did you get the icinga alert? :) [15:48:48] am going to make puppet flap on bastion, see if there's an alert [15:49:56] !log deployment-prep deliberately fucking up puppet to see if icinga complains [15:50:00] Logged the message, Master [15:50:35] I think mosh is on the prod bastion... [15:50:46] bd808: it is, but all you can do is get to bastion [15:50:48] and that's it [15:51:12] bd808: also see my log about betalabs puppet [15:51:23] I saw and I approve [15:51:26] cool [15:51:32] doing a git reset would fix it, fwiw [15:51:36] I just added random text into site.pp [15:52:54] bd808: wtf, puppet is stupid [15:53:00] it doesn't count syntax errors as 'fail' [15:55:56] wat [15:56:03] now it doesn't write last_run_summary properly [15:56:09] if it failed on a syntax error?! [15:57:50] ok that is useless [15:57:51] YuviPanda: I did! [15:58:01] doesn't give me errors on fails due to conflicting resources either! [15:58:05] * YuviPanda kicks puppet [15:58:52] greg-g: cool. is kinda useless now, tho. I need to add another metric now [15:59:11] :) [16:00:10] !log deployment-prep unfuck puppet on deployment-salt, puppet is stupid and does not properly report failed events on last_run_summary.yaml if there's a syntax error or a resource conflict. So I've to read last_run_report and do things with *that* instead now [16:00:13] Logged the message, Master [16:00:27] greg-g: other checks you wanted that would be easier to add? [16:00:33] do you have that bug handy? [16:00:55] * greg-g looks [16:01:11] https://bugzilla.wikimedia.org/show_bug.cgi?id=70141 [16:01:13] YuviPanda: ^ [16:02:13] 3Wikimedia Labs / 3deployment-prep (beta): Setup monitoring for Beta cluster - 10https://bugzilla.wikimedia.org/51497 (10Greg Grossmeier) a:3Yuvi Panda [16:46:58] 3Wikimedia Labs / 3wikitech-interface: Wikitech: Performing content actions results in PHP strict warning by MWSearch outputted on the page - 10https://bugzilla.wikimedia.org/70436#c1 (10Krinkle) Got the same warning just now when deleting a page. [16:47:56] 3Wikimedia Labs / 3wikitech-interface: Wikitech: Performing content actions results in PHP strict warning by MWSearch outputted on the page - 10https://bugzilla.wikimedia.org/70436#c2 (10Krinkle) Created attachment 16385 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16385&action=edit Screenshot of e... [16:56:23] YuviPanda: Oh. I thought you would already know about those puppet status report gotchas. [16:56:58] YuviPanda: There is something in the prod checks that deals with that. That's what causes the EPIC FAIL icinga notices [16:57:18] But I've never looked to see how it is handled [16:57:20] bd808: yeah, I should check. This uses graphite/diamond to check since icinga isn't directly running on them [16:58:06] I just fell down a different rabbit hole and I'm trying to climb back out now [16:58:40] It was multi-facepalm worthy -- https://bugzilla.wikimedia.org/show_bug.cgi?id=70446 [17:05:57] 3Wikimedia Labs / 3deployment-prep (beta): Broken links make bits answer with 403s in beta - 10https://bugzilla.wikimedia.org/70445 (10Greg Grossmeier) p:5Unprio>3High [17:54:11] 3Wikimedia Labs / 3deployment-prep (beta): Broken links make bits answer with 403s in beta - 10https://bugzilla.wikimedia.org/70445#c5 (10Bryan Davis) 5PATC>3RESO/FIX Change to docroot synced and varnish cache purged with `sudo varnishadm ban req.url '~' /` on deployment-cache-bits01 [17:54:38] !log deployment-prep Purged varnish cache on deployment-cache-bits01 -- sudo varnishadm ban req.url '~' / [17:54:41] Logged the message, Master [17:55:55] what's wrong with geohack? https://tools.wmflabs.org/geohack/geohack.php?params=48.59_N_10.66_E [17:59:27] Coren, your time to shine [18:00:04] You have several option, including unblocking me [18:09:16] Dispenser: ?? [18:14:35] Dispenser: ?? [18:33:06] There's a redirect in .lighttpd.conf (since Nov 23 2013!): ""^/geohack/(.+?)/(.+?)\?(.+)$" => "/geohack/geohack.php?language=$1¶ms=$2&$3"" that's not in sync with the URL. Riofstou, where did you get that URL from? [18:34:57] Problem also without parameters: https://tools.wmflabs.org/geohack/ [18:35:27] Riofstou: No, I mean where on wiki was that link? [18:36:16] Doesn't matter... just take the link from an article of your choice. [18:36:29] https://en.wikipedia.org/wiki/New_York_City -> https://tools.wmflabs.org/geohack/geohack.php?pagename=New_York_City¶ms=40.7127_N_74.0059_W_type:city_region:US-NY [18:36:41] Riofstou: Okay, I see. [18:37:51] The url.redirect shouldn't match anyhow, so that's not the problem. [18:39:13] At 15:08Z, lighttpd dumped core. I'll stop the webservice, move logs & core aside, and start anew. [18:39:32] access.log = 29G. Wow. [18:40:57] YuviPanda: Before I go any further, the webservice is already stopped, but the URL says 404. Shouldn't -webproxy make that "Webservice is not enabled"? [18:45:07] scfc_de: good point, but that might be because lighty has stopped serving but the socket is still being held open by whatever holds it open? [18:45:21] "HGETALL prefix:geohack" => "tools-webgrid-02:4030"; so probably one of the old entries. Still, that should make it gateway error. [18:45:27] Let's see if that port still answers. [18:46:01] Indeed, there's a lighttpd listening there. [18:46:50] What's the magic word to see what process is listening on port 4030 again? [18:49:03] "sudo lsof -i :4030" => tools.copyvios [18:49:39] Betacommand, YuviPanda: He'd sooner resign then admit any fault and reinstate my account. So its tongue in check. [18:49:49] err/ [18:49:50] ? [18:49:53] he resigned from what? [18:50:22] Dispenser: What in blazes are you talking out? [18:50:47] !log tools geohack's lighttpd dumped core and left an entry in Redis behind; tools-webproxy: "DEL prefix:geohack"; geohack: "webservice start" [18:50:51] Logged the message, Master [18:51:00] Riofstou: Does it work again for you? [18:51:12] yes, thank you! [18:53:29] Really Coren your on a ban first ask questions later. Anyway I don't have time for this [18:53:59] o_O. Yet time enough to come here and... I'm not sure what you did. *shrug* [18:54:41] Only [18:54:53] ...to help Riofstou with GeoHack [18:56:11] Also, IIRC there were a lot of questions and plenty of opportunity for you to fix that before your ban. [18:57:50] Coren: On tools-webgrid-02, as tools.geohack, I run "gdb /usr/sbin/lighttpd core.bak" => "Reading symbols from /usr/sbin/lighttpd...(no debugging symbols found)...done." => "warning: Can't read pathname for load map: Input/output error." Does that mean that no meaningful diagnosis is possible? [18:59:26] scfc_de: Probably, though I don't remember having ever seen that error. It might be a truncated core. [19:30:42] 3Wikimedia Labs / 3tools: Tool Labs: Node.js and npm broken due to outdated certificate (install minor update to fix certificate) - 10https://bugzilla.wikimedia.org/70120#c9 (10Ed Summers) Can tools labs easily be upgraded to Trusty? [22:20:00] !log tools Stopped 12 webservices for tool "meta" and started one [22:20:04] Logged the message, Master [22:22:25] !log tools Deleted stale nginx entries for "rightstool" and "svgcheck" [22:22:28] Logged the message, Master [23:37:49] beta now also has disk space monitoring [23:37:53] bd808: [23:38:08] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=labmon1001&service=Monitor+for+low+disk+space+on+%2Fvar+for+beta+labs [23:38:28] oh, btw, i _could_ make icinga-wm join this channel [23:38:33] and just output the beta stuff [23:44:12] mutante: Maybe to #wikimedia-qa instead of here? chrismcmahon what do you think? [23:45:02] mutante: also that's awesome and will help diagnose many dumb problems. \o/ [23:45:10] bd808: also, https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dsh [23:45:17] that is re: servers being out of sync :) [23:45:33] that link might be useful for deployers [23:45:57] * bd808 shakes fist at mw1163 [23:46:11] * bd808 bookmarks that search too [23:46:17] hah, but it has a scheduled downtime [23:46:33] and thanks for pointing it out , i wanted to see an actual broken one :) [23:46:52] better than search for string would be proper service group [23:47:10] but this works :p [23:57:45] bd808: mutante go for it. I would not mind seeing icinga on any IRC channel