[00:57:27] 3Wikimedia Labs / 3tools: Upgrade awk to 4.1.1 - 10https://bugzilla.wikimedia.org/71273 (10nejuje6tpztluvolq) 3UNCO p:3Unprio s:3normal a:3Marc A. Pelletier Upgrade GNU Awk to latest 4.1.1 - many new features. Current installed version is 3.1.8 from 2010. [01:15:15] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [01:15:15] 9338 ganglia 20 0 99368 29m 3436 S 2 0.2 1689:55 /usr/sbin/gmond --pid-file /var/run/gmond.pid [01:15:17] Is that normal? [01:15:24] (for a labs instance, like cvn-app5) [01:16:00] it seems to be consistently floating at the top with a lot of vmem usage [01:16:06] though the unit is missing [01:17:05] USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND [01:17:06] ganglia 9338 2.0 0.1 99368 29952 ? Ssl Jul30 1689:57 /usr/sbin/gmond --pid-file /var/run/gmond.pid [02:41:18] wikitech admins available? [02:42:26] See recentchanges and you'll find SAL (prod) became a mess [02:43:42] I tried to move them, but failed [03:09:12] Hey, would someone be able to help me change my full name/username on Labs/Code Review/&c.? I changed my surname (back) a while ago. [03:34:35] iirc you need to file a bug [07:53:52] 3Wikimedia Labs / 3Infrastructure: Replica MySQL: Wiki ViewStats databases completely missing! - 10https://bugzilla.wikimedia.org/71043#c7 (10Sean Pringle) We have a full backup of p50380g50769__wvs2 and p50380g50769__wvs2ds. The loading processes were paused and adjusted to avoid the blocking table locks,... [08:50:00] still nobody with wikitech admin flag? [10:21:36] 3Wikimedia Labs / 3tools: Upgrade awk to 4.1.1 - 10https://bugzilla.wikimedia.org/71273#c1 (10Andre Klapper) s:5normal>3enhanc Thanks for taking the time to report this! Once all servers are upgraded from Ubuntu Precise to Trusty, there will be version 4.0.1 instead of 3.1.8 (see bug 63899) and we would... [11:29:38] I will let you know when I see andrewbogott around here [11:29:38] @notify andrewbogott [11:36:52] <_joe_> !log deployment-prep updated hhvm to fix most bugs, also cherry-picked https://gerrit.wikimedia.org/r/#/c/162839/ [11:36:55] Logged the message, Master [12:40:40] hi who should I ask to approve an OAuth application? One of my users has submitted a bug to bugzilla about it - https://bugzilla.wikimedia.org/show_bug.cgi?id=71257 [12:40:44] It's this one: https://www.mediawiki.org/w/index.php?title=Special:OAuthListConsumers/view/22282bcadc4a854a9d8bf5b350b8e4e9&name=&publisher=&stage=0 [12:42:43] Reedy: ^? [12:44:53] 3Wikimedia Labs / 3tools: Can't send email from tools-exec-07, -14 or -15 - 10https://bugzilla.wikimedia.org/71097#c1 (10Brad Jorsch) This is still going on, and I discovered tools-exec-07 has the same issue. The remaining tools-exec servers (-01 to -06 and -08 to -13) are not currently affected. [13:21:26] YuviPanda: I don't think I've got rights to do it now unless I use my staff account.. [13:21:33] Reedy: oh, I see [13:21:44] and I guess you don't want to use your staff account :) [13:22:00] It's not exactly a staff action... ;) [13:22:15] what, why not? [13:22:30] lol [13:22:32] :P [13:22:35] Is it? [13:22:38] it is! [13:22:39] I've really nfi [13:22:43] jdi [13:32:34] YuviPanda: so, what's first? [13:32:40] hey andrewbogott [13:33:02] andrewbogott: https://gerrit.wikimedia.org/r/#/c/162244/ [13:34:38] andrewbogott: want to just follow the chains from there? [13:34:46] yep [13:34:59] andrewbogott: cool. [13:35:33] This all looks a lot better, btw [13:35:53] andrewbogott: :D [13:36:06] andrewbogott: yeah, and I've started moving things into a icinga module as well, refactoring as I go :) [13:40:43] andrewbogott: can you also do find /etc/nagios-plugins and find /usr/lib/nagios/ and paste output on neon? [13:41:02] Hm… YuviPanda is it easy for you to rebase that patch chain? I've hit a bump [13:41:14] andrewbogott: yes, it is easy for me to rebase the chain [13:41:28] andrewbogott: want me to? [13:41:38] I think you should rebase and resubmit. I don't know why gerrit let me merge the first few and is now refusing [13:41:59] andrewbogott: ah, right [13:42:12] I hope it doesn't need a rebase after each one is merged :) [14:23:24] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (22.22%) [14:35:01] 'tools.tools'? webproxy again? [14:35:17] Coren: yeah [14:35:26] Coren: df and du don't agree on /var [14:35:28] which is confusing [14:36:09] That's expected. du sums up the number of blocks allocated, df tells you how many are in the free chain. There are tons of reasons why those two can diverge - df is the one you want to monitor [14:36:39] Coren: let me check again, I remember them being quite a bit out of whack [14:37:30] Coren: https://dpaste.de/0jgP [14:38:24] Coren: out of whack in the sense, /var reports a 1.5G ./log, but /var/log reports it as being 1.7G, and also, /var/log is on a separate mount, but seems to be counted by df for /var [14:39:35] df cannot possibly count other things than the real filesystem. [14:39:41] Coren: aha! I see what's happening [14:40:18] du, otoh, is not a good source of metrics. What it does is recursively add the reported 'number of blocks used by file'; it can't count most kinds of metadata, and can cross filesystem boundaries. [14:40:49] Coren: right, but the discrepancy here was caused by an older /var/log that didn't have its contents fully cleaned out [14:40:53] It's nice when you want to get an estimate or 'where is my space going?' but doesn't have value as a metric [14:41:06] Coren: I checked that with a mount with bind, but I previously was checking it by mounting / rather than /var [14:41:28] There thou goest. :-) [14:42:01] Coren: cleaned up now [14:42:46] Coren: that should prevent the flapping [14:43:14] !log tools cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now [14:43:16] Logged the message, Master [14:43:57] Coren: less flapping now :) [14:44:23] Coren: there's also a Great Merging going on in -operations, and hopefully in the coming weeks I can have a generated hosts.cfg for tools/labs, so we can have better formatted error messages [14:50:21] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [14:50:31] Coren: ^ yay. [14:58:11] just wondering, is Labs vulnerable to CVE-2014-7169? [14:58:32] Coren: ^ [14:59:10] Revi: is that the bash env leak? [14:59:23] yeah [14:59:33] https://bugs.launchpad.net/bugs/cve/2014-7169 [14:59:55] Revi: Not in general; instances which run bash shell scripts in unauthenticated environments where user input can end up in env variables might be. [14:59:56] Revi: that should be handled -- most labs instances apply security patches automatically, and I also forced a package upgrade yesterday [15:00:28] Revi: Also what andrew said (we probably duplicated work there because I also forced package upgrades) :-) [15:00:29] ok, thanks! [15:01:06] hmm, looks like there's two CVE for bash? :o [15:01:33] 2014-6271 and 2014-7169 [15:01:42] 3Wikimedia Labs / 3deployment-prep (beta): Convert puppetmaster sync cronjob to Jenkins job - 10https://bugzilla.wikimedia.org/71305 (10Bryan Davis) 3NEW p:3Unprio s:3enhanc a:3None The beta cluster uses a cron job introduced in to fetch the latest operati... [15:02:24] There are; and they are intrisically related. It's fundamentally the same bug, only the first fix was incomplete in some patches. [15:02:44] ok [15:02:47] really thanks! [15:03:17] But again, the vulnerability was fairly bad but requies a fairly stringent set of conditions to be usable, which did not apply to anything we do (but might have been an issue in some user-configured instances) [15:03:45] or people running bash scripts over the web with CGI [15:03:51] (like the recent labs-l question) [15:04:13] "user-configured instances" :-) [15:04:36] The better response to that question isn't "patch batch" but "don't do that - it's dangerous". :_) [15:04:42] bash* [15:43:09] 3Wikimedia Labs / 3deployment-prep (beta): Convert puppetmaster sync cronjob to Jenkins job - 10https://bugzilla.wikimedia.org/71305#c1 (10Greg Grossmeier) p:5Unprio>3Normal (+1 to alerting both #-qa and #-operations) [15:45:40] 3Wikimedia Labs / 3tools: Upgrade awk to 4.1.1 - 10https://bugzilla.wikimedia.org/71273#c2 (10nejuje6tpztluvolq) The distribution of 4.0.1 to 4.1.1 is almost twice the size (4MB gzip) so there was a lot of changes between 2012 (4.0.1) and 2014 (4.1.1), largely C extensions. The only specific reason I can pro... [15:47:27] Coren: so, I propose to install a second cert on neptunium (ldap-codfw as well as ldap-eqiad) and then you can fuss with neptunium until you're confident that you can switch back and forth between them. At the moment no one is using that server so if you break it we can just start over. [15:48:28] That sounds good. I've been looking at the current setup and I gots me an hypothesis or two on how to proceed. [15:54:20] Coren: also we need a test to answer the question 'which cert is neptunium serving?' [15:54:23] do you know how to do that? [15:55:08] openssl s_client works fine for that [15:57:02] It even does SNI at need. [15:58:08] hm… openssl s_client -connect ldap-eqiad.wikimedia.org:4444 is not returning what I expected... [15:59:05] Add -showcerts to see the whole chain [16:00:31] does it look right to you? All I see is a self-signed cert [16:01:35] Yep. [16:02:03] (Yep I see the same thing you do, not yep it looks right) [16:03:09] So I guess 4444 isn't necessarily right since that's the admin interface. 389 doesn't work at all though [16:03:19] I'm trying on virt1000 which we know to work -- seems the same [16:03:34] So I suspect our test is invalid, maybe ldap:// does something different [16:03:50] The admin interface my get a self-signed cert by default and that doesn't harm anything. [16:04:22] Yeah, I don't think it's broken… just that this test isn't useful [16:05:14] ... no. That works fine with known working SSL ports; tools.wmflabs.org for instance. [16:05:58] Sorry, i'm not following. Probably don't know enough to know what questions to ask. [16:06:17] We need a test that answers the question "What cert is the ldap server on host XXX providing" right? [16:06:43] Wait, it works fine on virt1000. What are you seeing? [16:07:08] Acceptable client certificate CA names [16:07:09] /serialNumber=SwZniI9OuBaPfOVfL2HbNXkqTTN5FDSP/OU=GT85361712/OU=See www.rapidssl.com/resources/cps (c)14/OU=Domain Control Validated - RapidSSL(R)/CN=virt1000.wikimedia.org [16:07:09] /C=US/ST=California/L=San Francisco/O=Wikimedia Foundation/CN=Wikimedia CA [16:07:13] Is what I'm getting. [16:08:09] Oh! "389 doesn't work at all though". Normal, since ldapS is 636. :-) [16:08:33] ah, starttls redirects to a different port? [16:08:38] Anyway, you're right, I see the correct behavior on virt1000 [16:08:53] So neptunium just isn't fully set up yet I guess… despite having seemed to totally work last night :/ [16:09:15] No, starttls renegotiates, but that requires protocol-specific knowledge. If you just want to test the certs, you want to connect to SSL directly. :-) [16:10:38] ok! So, using 636 everything looks correct on all three servers: virt1000, ldap-eqiad, ldap-codfw [16:10:52] So, lemme get this other cert installed [16:11:17] * Coren needs lunch. [16:11:20] BBIAB [16:17:33] Coren: ok, both certs should be installed on neptunium now. Break away! [16:25:39] 3Wikimedia Labs / 3tools: Upgrade awk to 4.1.1 - 10https://bugzilla.wikimedia.org/71273#c3 (10nejuje6tpztluvolq) In case anyone reading this in the future, the executable is here: /data/project/ext-lnk-discover/gawk-4.1.1/gawk (on Tools) [16:53:53] 3Tool Labs tools / 3[other]: hikebikemap utf8 miscoding - 10https://bugzilla.wikimedia.org/71173#c7 (10kakrueger) I am not exactly sure what happened to cause this issue in the first place, but I am hoping it should be fixed now. It might still take a couple of days to rerender all of the tiles that broke,... [17:07:30] !log integration Disabled Jenkins slave integration-slave1006.eqiad.wmflabs to see if it is causing false failures {{bug|71314}} [17:07:33] Logged the message, Master [17:16:00] !log integration Added BryanDavis (self) as project member and admin [17:16:02] Logged the message, Master [17:19:34] !log integration Added BryanDavis and Ori.livneh to default sudo policy [17:19:36] Logged the message, Master [17:20:01] bd808: oh, nice [17:20:18] bd808: ah, got made cloudadmin? :) [17:20:21] ssh without sudo is useless [17:20:37] YuviPanda: Yeah I guess I'm a cool kid now [17:20:52] :D [17:22:22] !log integration Forced puppet run on integration-slave1006. No changes applied which doesn't bode well for fixing the Jenkins failures. [17:22:24] Logged the message, Master [17:25:30] !log integration Restarted nslcd on integration-slave1006. Lots of "error writing to client: Broken pipe" in syslog [17:25:31] Logged the message, Master [17:36:26] !log integration Disk usage for / on integration-slave1006 at 90% vs 54% on integration-slave1001. Not sure where the difference is. [17:36:29] Logged the message, Master [17:44:03] !log integration Deleted 1G of /tmp/mw-ocg-latexer*/ files on integration-slave1006 [17:44:05] Logged the message, Master [18:36:01] I'll look at the virtX NTP issues in a bit, once prod is sorted. their clocks can't drift that far that fast :) [19:05:54] Coren: shall we work on ldap certs now? Or do you want to point me in the right direction and leave me to it? [19:10:04] I'm feeling a bit under the weather atm and was thinking of picking that up later today possibly after a nap. I've been reading documentation with eyes glazing over for the past hour and am pretty sure I am no less stupider now than before I started. [19:10:52] ok [19:11:43] This task is vaguely coupled with the tampa shutdown, hence my being a bit jumpy about it. But there's definitely time for a nap :) [19:12:07] andrewbogott: if you want to do something else, we can either merge more patches, or talk about host generation on labs (I've a fairly good, new idea). Or you can continue with ldap stuff :) [19:13:35] When you say 'host generation'... [19:14:29] andrewbogott: generating hosts.cfg file [19:14:30] andrewbogott: for shinken [19:14:35] andrewbogott: based on projects/instance names [19:14:55] andrewbogott: I realized there's a generate() function in puppet, which runs a specific piece of code on the *puppetmaster*, rather than the host running it. [19:15:14] andrewbogott: so if we used that, and then queried OS api from a script, we can use that to generate the hosts.cfg file pretty easily... [19:16:11] yooo [19:16:20] anybody know if there is a puppet var in labs for the current labs project? [19:16:25] ottomata: yes! [19:16:40] ottomata: $::instanceproject [19:16:50] danke! [20:21:39] 3Wikimedia Labs / 3tools: Upgrade awk to 4.1.1 - 10https://bugzilla.wikimedia.org/71273 (10Andre Klapper) [20:23:50] My labs-vagrant instance is giving me a "No wiki found" page, and I've no idea how to proceed. http://education.wmflabs.org/ [20:24:56] ragesoss: Probably this file permissions issue -- https://lists.wikimedia.org/pipermail/labs-l/2014-September/002949.html [20:25:09] ragesoss: Fix with -- sudo chmod -R o+rX /srv/vagrant [20:25:42] so, log into that box and then run that. [20:25:45] got it, thanks! [20:28:10] bd808: that worked, thanks much! [20:28:36] Least I could do since I broke it in the first place ;) [20:28:42] :D [21:53:36] Coren: well, I've implemented the steps I /thought/ were necessary, and listed them here: https://wikitech.wikimedia.org/wiki/Ldap_rename doesn't work though :( [22:08:11] 3Wikimedia Labs / 3Infrastructure: CatScan doesn't load (times out) - 10https://bugzilla.wikimedia.org/71336 (10Nemo) 3NEW p:3Unprio s:3major a:3None Since at least yesterday, I'm unable to load * http://tools.wmflabs.org/catscan2/ * https://tools.wmflabs.org/catscan2/catscan2.php or whatever variant... [22:08:54] https://bugzilla.wikimedia.org/show_bug.cgi?id=71336 [22:31:38] Does anyone know how git deploy / trebuchet is supposed to be used in deployment-prep? [22:32:40] RoanKattouw: https://wikitech.wikimedia.org/wiki/Trebuchet#Using_Trebuchet_in_Labs [22:33:17] well, deployment-prep already has a deployment server -- deployment-bastion.eqiad.wmflabs [22:33:21] so just use it as you would tin [22:39:53] Well that's what I thought [22:39:59] But I get all sorts of permissions errors [22:40:02] * RoanKattouw reads docs [22:40:31] catrope@deployment-bastion:/srv/deployment/mathoid/mathoid$ git deploy start [22:40:32] Failed to write lock file [22:40:34] Failed to start deployment [22:45:24] RoanKattouw: coming from: https://github.com/trebuchet-deploy/trigger/blob/master/trigger/drivers/trebuchet/local.py [22:45:47] see lines 158-160 [22:46:01] so .git/{deploy,lock} i think [22:48:19] !log deployment-prep Fixed permissions of deployment-bastion:/srv/deployment/mathoid/mathoid/.git/deploy (needed g+w) [22:48:22] Logged the message, Mr. Obvious [22:49:14] RoanKattouw: That's some bug we have in labs. Once the perms are fixed for a new repo they seem to stay [22:50:02] salt-call -G deployment_target:*myrepo* test.ping [23:05:23] Anyone awake, I am in Tasmania, Australia, Is there a hardenning guide for an instance running apache on fulltext.eqiad.wmflabs.org ? [23:05:40] Is this the correct place to ask? [23:06:08] for: https://wikitech.wikimedia.org/wiki/Nova_Resource:Full-text-reference-tool/Documentation [23:08:35] RoanKattouw: https://gist.github.com/atdt/7948709d47f0f9557f73 <-- the conclusion of repeated frustrations [23:08:42] a little evil but.. :) [23:12:11] fsainsbu: This is the rght place, but, can you explain what you mean by 'hardenning guide'? [23:14:17] well default apache on ubuntu is a bit security wise loose. different groups have differents standards, checklists, have wiki got one? [23:18:42] fsainsbu: not really -- since the labs instances are based on an image that we built and apply our production puppet classes (mostly) they should be generally pretty secure. [23:18:50] Unless you actively break that by installing other things :) [23:19:03] ok, thanks [23:19:41] fsainsbu: if you encounter specific security issues, please let us know! [23:20:18] Other issue is a puppet script to install from a git repositry a piece of js, have you an example for a beginner with puppet? [23:22:42] I googled http://livecipher.blogspot.com.au/2013/01/deploy-code-from-git-using-puppet.html need to check which puppet modules exist [23:22:58] sel answered I hope [23:29:03] fsainsbu: there's a git class in our puppet code, git::clone that should probably work for you [23:37:55] "there is a job named 'cron-tools.deltaquad-bots-2' already active" [23:38:03] anyone know how to fix that? ^^ [23:38:27] my inexperience with jsub isn't helping