[00:00:07] Betacommand: The outfit owning tweetmeme. Omnious name. [00:00:14] Ah [00:00:19] thanks [00:01:32] @Coren, Should it look like this? "cp may31-jun14-revisions.txt /data/project/tantan-www/public_html" [00:08:44] tantan: Yes, that should work, and looking at the directory, it did :-). [00:13:01] @scfc_de it does not seem to me. My shell id is sayantan-13 and I cannot see anything here "http://tools.wmflabs.org/sayantan-13-www/public_html". The cp is failing too :'( [00:17:11] tantan: "and they should appear ." [00:19:11] there's some trick for project-specific SAL logging, right? [00:19:26] @scfc_de @Coren, i was doing the most stupid mistake possible. [00:19:29] !log deployment-prep [00:19:29] is it !log message ? [00:19:34] cool, thanks [00:19:38] thanks for being patient. [00:20:54] !log integration Killed stuck beta-update-databases-eqiad job [00:20:56] Logged the message, Master [00:21:02] * ori is still crafting a log message [00:21:54] !log deployment-prep beta broke due to I433826423. app servers load prod apache confs from /etc/apache2/wikimedia. temp fix: locally hack apache2.conf to load /usr/local/apache2/conf/all.conf; disable puppet. [00:21:56] Logged the message, Master [00:22:17] !log deployment-prep Killed stuck beta-update-databases-eqiad job ( stuck for over 60m waiting for executor; deadlock?) [00:22:18] Logged the message, Master [00:26:36] !log integration Manually triggered beta-update-databases-eqiad and watched it succeed [00:26:40] Logged the message, Master [00:30:57] !log deployment-prep removed local l10nupdate user from deployment-jobrunner01 and deployment-videoscaler01 [00:30:59] Logged the message, Master [00:50:04] 3Wikimedia Labs / 3deployment-prep (beta): mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - 10https://bugzilla.wikimedia.org/65591#c5 (10Bryan Davis) $ ldapsearch -x uid=mwdeploy \* + # extended LDIF # # LDAPv3 # base (default) with scope subtree # fi... [00:54:56] noone complained the shared pywikibot is broken on tools? [00:56:54] https://bugzilla.wikimedia.org/show_bug.cgi?id=67488 [01:04:35] 3Wikimedia Labs / 3deployment-prep (beta): mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - 10https://bugzilla.wikimedia.org/65591#c6 (10Bryan Davis) From deployment-salt: $ getent passwd|cut -d: -f7|sort|uniq -c 4519 /bin/bash 11 /bin/false 18 /bin/sh... [03:02:10] !ping [03:02:10] !pong [03:02:42] !ping [03:02:42] !pong [03:02:44] hmm [03:02:46] !python [03:02:46] There are multiple keys, refine your input: pythonguy, pythonwalkthrough, [06:32:22] 3Wikimedia Labs / 3tools: Install rake for Tools Labs - 10https://bugzilla.wikimedia.org/68208 (10OverlordQ) 3NEW p:3Unprio s:3enhanc a:3Marc A. Pelletier Intention: Use rake to manage software tasks Reproducible: Always [08:38:21] 3Wikimedia Labs: In wikibooks - Miscellaneous - 10https://bugzilla.wikimedia.org/68210 (10Ara Housepian) 3UNCO p:3Unprio s:3minor a:3None Intention: just Reading Steps to Reproduce: 1. http://en.wikibooks.org/wiki/Main_Page in CHrome browser 2. click on Miscellaneous it will redirect you to the correc... [09:48:17] Your webservice is scheduled: np_load_avg=95.830000 (= 95.830000 + 0.50 * 0.000000 with nproc=2) >= 2.00 [10:22:19] 3Wikimedia Labs: In wikibooks - Miscellaneous - 10https://bugzilla.wikimedia.org/68210#c1 (10Tim Landscheidt) 5UNCO>3RESO/INV Thanks for your report! Unfortunately, this is the bug tracker for the Wikimedia Labs project which is unrelated to the issues you are seeing. Please report them at https://en.wik... [10:43:05] 3Wikimedia Labs / 3tools: Install rake for Tools Labs - 10https://bugzilla.wikimedia.org/68208 (10Tim Landscheidt) 5NEW>3ASSI a:5Marc A. Pelletier>3Tim Landscheidt [10:44:34] 3Wikimedia Labs / 3tools: Install rake for Tools Labs - 10https://bugzilla.wikimedia.org/68208#c1 (10Tim Landscheidt) Do you need rake only for interactive use (tools-login/tools-dev), or do you want to use it in grid jobs/web applications as well? [12:16:46] (03CR) 10Tim Landscheidt: become: Make more user-friendly (031 comment) [labs/toollabs] - 10https://gerrit.wikimedia.org/r/147096 (https://bugzilla.wikimedia.org/68156) (owner: 10Tim Landscheidt) [12:21:22] (03PS4) 10Tim Landscheidt: become: Make more user-friendly [labs/toollabs] - 10https://gerrit.wikimedia.org/r/147096 (https://bugzilla.wikimedia.org/68156) [12:32:08] 3Wikimedia Labs / 3tools: Shared version of pywikibot on tools.wmflabs.org is broken - 10https://bugzilla.wikimedia.org/68215 (10Yann Forget) 3NEW p:3Unprio s:3critic a:3Marc A. Pelletier The shared version of pywikibot on tools.wmflabs.org is broken. https://bugzilla.wikimedia.org/show_bug.cgi?id=... [12:42:55] Coren: has something recently changed with webservers and job submissions ? [12:50:31] Betacommand: Nothing should have. What's up? [12:53:33] Coren, someone's been asking in -tech about https://bugzilla.wikimedia.org/show_bug.cgi?id=67488 [12:53:49] Any idea who would be the right person to revert this on tools? [12:54:20] oh, right, he just made https://bugzilla.wikimedia.org/68215 as well [12:54:20] Coren: I had a CGI script that submitted a job if the last run was more than 10 minutes old, Now its still showing the age but not submitting the job. Wanted to check server settings before I started debugging [12:56:01] Krenair: It appears valhallasw is away; technically, probably any root can do that, the question is which version to revert to for good :-). As the repo seems to be auto-updated, fixing this in pywikibot seems much more prudent. [12:56:06] Betacommand: Nothing changed that I can see, and as far as I can tell the grid is healthy. [12:56:26] Coren: np [12:56:59] Krenair: Valhallasw is the one who maintains that, I think, but it pulls automagically from the repo. [12:57:12] At least at interval. [12:57:43] Betacommand: It looks as if tools-webgrid-{03,04} are not submit hosts. Let me fix that. [12:58:27] (Don't know where I took tools-webgrid-04 from :-).) [12:58:51] !log tools Made tools-webgrid-03 a grid submit host [12:58:51] scfc_de: grr, new it was something on the server end :P [12:58:54] Logged the message, Master [12:59:13] Coren: I always discover the tools bugs [13:00:36] Betacommand: Ah; hm. When those were added, that was probably overlooked. You never saw the change because your webservice probably ran on -01 or -02 until recently. [13:01:21] Betacommand: Now if you had provided a meaningful error message instead of "not submitting the job" ... I'm pretty sure it said something about not being allowed to submit a job or so. [13:01:26] Coren: Just my luck to have yet another bug discovered [13:01:51] scfc_de: I wasnt sure what was happening [13:02:18] Betacommand: Did you check out bigbrother yet? [13:03:33] Coren: I havent had a chance but looks very useful, wont really have time to play with it until Sunday at the earliest [13:06:50] scfc_de: http://pastebin.com/9dCGG6LQ was the expected output, I was only getting the first line, and was checking about any tools environmental/policy changes before I started the debug process to find out why it wasnt submitting correctly [13:09:17] that auto-magically upgrade of pywikibot is a bad idea, from time to time it'll break, but the recommended setup is to use the shared version [13:15:43] Betacommand: Tested just now: If run on a non-submit host, jsub complains: "Unable to run job: denied: host "tools-dev.eqiad.wmflabs" is no submit host." in *.err. If you want it on stderr, you can use the option "-stderr". What I meant is: This isn't called debugging, it's reading the output :-). [13:16:36] phe: I would rather say that pywikibot needs to review its review process if that's broken :-). [13:18:46] scfc_de: looks like tools-webgrid-04 does exist and does have the submit issue [13:23:30] Betacommand: Indeed?! Some days ... [13:23:59] scfc_de: I checked the .err log :P [13:24:09] Betacommand: See! :-) [13:24:15] !log tools Made tools-webgrid-04 a grid submit host [13:24:17] Logged the message, Master [13:26:57] Coren: just noted an access spike from the 360Spider UA. Doing some quick research it looks like you might want to block it at the server level [13:38:18] scfc_de: shouldn't the submit host config be on puppet? [13:38:27] or is this also one of those weird SGE things? [13:40:30] YuviPanda: My plan is to rename the submithost_* bits in puppet to bastionhost_*, make a new submithost_* = bastion + webnode, then make a Puppet rule on tools-master: qconf -as $SUBMITHOST if $SUBMITHOST not in qconf -ss. [13:40:47] ah [13:40:48] nice [14:26:03] gifti: Your host tools-exec-gift is again overloaded (> 180). Could you fix that, please? [14:29:45] !log tools admin: Set up .bigbrotherrc for toolhistory [14:29:48] Logged the message, Master [15:11:00] scfc_de: um, ok, what's so bad about it at all? puppet not run? any impact on others? [15:19:48] gifti, http://tools.wmflabs.org/?status show "tools-exec-gift Load: 9313%" and the only task seems unable to start since two days [15:20:10] hubby [15:22:18] well, that's only poor display, it actually runs [15:23:49] some monitoring update broken ;( [15:24:20] it doesn't handle array jobs [15:31:34] gifti: No, it doesn't impact others (AFAICS), but it's clearly "not right". What do you mean by "it doesn't handle array jobs"? [15:32:49] oh, the status page shows no memory/cpu for dwl3 on exec-gift and none of the 200 sub-jobs (which also on some level is not quite right) [15:35:44] hm, the status page takes very long to load for me … [15:44:21] to fix the load issue, i would need a better domain throttling algorithm: atm i set a redis key with the expiry of the throttle time (1 second) for the domain if unset and check the url, if the key is "in use" i reappend the url to the list. when it comes to the end of the list and the domains get fewer (1 atm) the rotation is very fast and the load explodes [15:46:01] i could just wait for the throttle time, but then i cannot check other urls in the meantime for that job [15:52:33] gifti: you could use a two queue system, or a retry counter [15:54:06] If Counter is > 3 add to slow queue system that only checks every 3 seconds or something [15:58:15] Betalabs down? Oh, DNS problems. [15:58:17] Christ [16:05:24] marktraceur: still? Can you give me an example? [16:05:47] andrewbogott: http://en.wikipedia.beta.wmflabs.org/ [16:06:26] marktraceur: http://en.wikipedia.beta.wmflabs.org/wiki/Special:Version works for me [16:06:32] Werd. [16:06:38] I think there may be some apache issues [16:06:51] yeah, seems like this must be something internal to beta [16:06:57] Ori temp hacked the apache config last night [16:07:09] https is broken, and the redirect from / [16:07:12] http://en.wikipedia.beta.wmflabs.org/ sure is a graceful failure message :) [16:07:48] ok, i think i might not be faster at all with my approach, so i switch to simple throttling, beginning with the next run, scfc_de [16:08:26] marktraceur, andrewbogott: Giuseppe has been refactoring the apache config in operations/puppet and yesterday it got to a point where it nuked our custom apache configs in beta. [16:09:22] puppet became sentient yesterday? [16:09:31] gifti: Thanks! [16:09:56] marktraceur: I'm playing tic tac toe with it now :) [16:10:33] Next up, thermonuclear war [19:53:18] andrewbogott: Can you raise quota for integration labs projects so I can build 2 more m1.large instances? [19:53:35] Krinkle: Yep, just a second... [19:53:49] (the error message "Failed to create instance." was cryptic, I'm only guessing it's the quota being the problem) [19:54:18] going ot migrate integration-slaves for Jenkins production (which we host in labs) to Trusty. Currently 3, will add 2 with trusty, then rotate the ohters away and evnetually delete those instances. [19:54:25] But need the resources while migrating/testing. [19:56:34] Krinkle: try now? [19:57:27] andrewbogott: works :) [19:57:50] Thx [19:59:36] !log integration Setting up integration-slave1004 to be the first Trusty-based (w/ nodejs 0.10) Jenkins slave [19:59:38] Logged the message, Master [20:10:47] hi [20:10:48] https://bugzilla.wikimedia.org/show_bug.cgi?id=68215 [20:10:59] this is critical, broken since 13-07, and needs a urgent fix [20:11:55] ^ who takes care of the shared pywikibot source? legoktm? [20:14:02] Krinkle: be aware that trusty has a very different version of PHP (5.5.9-1ubuntu4.3) vs precise (5.3.10-1ubuntu3.10+wmf1) [20:14:25] 5.5? [20:14:26] I don't know if that will break anything or not, but it might [20:14:31] Hm.. [20:14:39] it won't break stuff, but itll make things work that shouldnt. [20:14:45] :) [20:15:24] people will start sneaking in traits [20:22:18] yannf: is there a fix in the upstream version of pywikibot? It would be better to move forward to a fixed version than back... [20:22:34] andrewbogott, no idea [20:23:18] I believe a fix might take days [20:23:44] but some important tools are broken now [20:45:07] yannf: I can't tell where pywikibot comes from… it sounds like John Vandenberg is on top of it though. [20:45:13] Is that bug that you just created different from the one it links? [20:45:44] andrewbogott, I was told to create a new one [20:45:52] why, by whom? [20:46:04] one releates to the code, one to the version installed [20:47:00] https://bugzilla.wikimedia.org/show_bug.cgi?id=68215 Product: WM labs [20:47:23] https://bugzilla.wikimedia.org/show_bug.cgi?id=67488 Product: pywikibot [20:47:26] Ah, I see [20:57:06] bd808: should I be running labs-vagrant on precise or on trusty these days? I can't get a clean run on either atm [20:58:16] andrewbogott: It *should* work on either, but on precise you have to manually checkout the precise-compat branch [20:58:27] ok... [20:58:41] bd808: happen to have the syntax for that handy? [20:58:55] Let me figure it out... [20:59:02] (also, couldn't the puppet class that manages labs-vagrant select the proper branch?) [20:59:34] I have to go [21:00:02] andrewbogott: YuviPanda|zzz started to work on making the role check out the right branch but ran into some sort of difficulty [21:00:13] ok [21:00:16] andrewbogott, actually I asked on behalf of phe [21:00:32] who runs the bot [21:01:46] andrewbogott: cd /vagrant && git checkout -b precise-compat precise-compat # not tested [21:03:13] bd808: that seems to be working better… so far [21:03:16] thanks [21:19:49] 3Wikimedia Labs: proxy'd labs MediaWiki instance times out contacting itself to runjobs - 10https://bugzilla.wikimedia.org/63338#c2 (10Andrew Bogott) So, I set up an instance with mediawiki named proxytest-singlenode with a proxy at proxytest-singlenode.wmflabs.org. When logged into that instance, I can do th... [21:20:11] spagewmf: ^ ? [21:38:04] 3Wikimedia Labs: WMFLabs: Delete instance failed to "remove its DNS entry" - 10https://bugzilla.wikimedia.org/62770#c1 (10Andrew Bogott) This is a race that causes occasional failures in the ldap deletion. I've worked on it a fair bit but have made no progress. [21:39:43] petan: Do you still want a wm-bot project as per https://bugzilla.wikimedia.org/show_bug.cgi?id=55691? Or may I close that bug? [21:40:04] if it can stay in bots project, I don't mind :P [21:40:13] I created instance "wm-bot" there [21:40:28] maybe one day when we have "rename project" feature we can just rename bots to wm-bot [21:41:34] 3Wikimedia Labs: Labs proxy seems to be running horribly slowly - 10https://bugzilla.wikimedia.org/62483#c6 (10Andrew Bogott) 5NEW>3RESO/FIX This seems fixed to me. Closing ending a new report. [22:35:04] 3Wikimedia Labs: Initial instance creation leaves a non-Puppet-controlled, dysfunctional gmond process behind - 10https://bugzilla.wikimedia.org/64216#c1 (10Andrew Bogott) Looks to me like this only happens on Precise instances. Since I'm resolved not to build new Precise images, I'm inclined to mark this as... [22:38:07] 3Wikimedia Labs: Initial instance creation leaves a non-Puppet-controlled, dysfunctional gmond process behind - 10https://bugzilla.wikimedia.org/64216#c2 (10Tim Landscheidt) 5NEW>3RESO/WON a:3Andrew Bogott Makes sense (having to trust you for the situation with Trusty instances, though :-)). [22:38:50] andrewbogott: Hm.. It seems hashar didn't document self hosted puppetmaster setup for integration (which is used in production). Maybe you have an idea of how it can be updated? I'm running 'sudo puppet agent -t' on integration-slave1004 but it keeps throwing an error that was fixed in operations/puppet a few minutes ago. [22:39:11] I updated the git repo of operations/puppet on integration-puppetmaster to latest upstream. and ran the puppet agent there which went without errors. [22:39:26] What do you mean by 'which is used in production'? [22:40:03] The instances of the integration project are used by our production Jenkins setup to execute jobs. [22:40:21] The steps you describe sound correct to me. [22:40:22] But, I will look [22:41:16] On integration-slave1004.eqiad.wmflabs 'sudo puppet agent -t' throws "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[imagemagick] ..." this was a regression fixed moments ago. [22:41:33] Show me the patch that fixed it? [22:41:35] On integration-puppetmaster.eqiad.wmflabs, I've pulled through his fix in /var/lib/git/operations/puppet. [22:42:00] andrewbogott: https://gist.githubusercontent.com/Krinkle/0d3eca90e2b79fd6034d/raw/ [22:42:09] | * | f678ce2 - contint: fix duplicate definition of imagemagick (20 minutes ago) [22:42:43] Krinkle: The catalog is compiled for each instance separately, so a Puppet run on puppetmaster could succeed, while failing on slave1004. [22:42:53] Yes [22:42:58] (Because only slave1004 triggers the duplicate definition.) [22:43:02] Yep [22:43:03] …which of those patches is the fix? [22:43:11] The one I mentioned [22:43:35] scfc_de: Yep, but slave1004 should pull its definitions from the master on each run, right? [22:43:47] Or do I need to purge a cache somewhere? [22:44:11] Krinkle: No, AFAIK there's no caching, especially with -t. [22:44:17] the error didn't happen on the puppetmaster because it didn't include the class there [22:44:51] But I did verify that the latest puppet run on the master applied this patch and others (because of other changes that it included) [22:45:20] k [22:45:48] e.g. to rule out the directory I updated the git repo in is unrelated and not used by puppet [22:46:11] Of course there's the possibility that the fix doesn't fix it :-). [22:46:13] Krinkle: It appears that the class that causes that error was applied before role::puppet::self [22:46:17] so the latter was never applied there [22:46:19] is that possible? [22:47:03] I presume that the imagemagick patch isn't actually merged in gerrit... [22:47:10] It is [22:47:46] andrewbogott: the instance is brand new, after I created it, I reconfigured it to enable puppet-self and role-ci-labs [22:47:58] then I logged on to force a puppet run and encountered the dupe paackage, then I had ops fix it [22:48:01] Right -- but since there's an error in role-ci-labs, it can't compile… [22:48:07] hence never applied role::puppet::self [22:48:21] So if you turn off role-ci-labs it should be able to switch puppet masters [22:48:23] and rerunning pupet again produced the same error. So I figured, it's probably because it has its own puppet master, so I updated the oprations git repo on our own puppet master to include tha latest patches [22:48:34] aha [22:49:08] andrewbogott: The patches we have don't affect this class though, so applying it from straight production/labs is fine too. Is there something keeping those from applying to the instnace? [22:49:20] the plain role-ci-labs class should be fine now in production. [22:50:22] I guess the default puppet master labs instances use doesn't auto-update from git? [22:51:08] It does [22:51:21] but, it's not clear to me that Daniels' patch will work -- it presumes a particular order of operations. [22:51:55] andrewbogott: Sure? I think the auto-update is still something bd808 (?) is working on, but not deployed. [22:52:14] what do you mean by 'default puppet master labs instance'? [22:52:45] scfc_de: I have something that "mostly works" for beta. Still not sure if it's generically useful [22:53:36] andrewbogott: I understood Krinkle to mean standard role::puppet::self; and that does not auto-update AFAIK. [22:53:49] ... [22:54:08] The integration puppetmaster I updated myself [22:54:20] I mean the puppetmaster labs uses by default [22:54:27] Yep, that's virt1000 [22:54:40] It is always in sync with the gerrit upstream [22:54:48] does that have a fixed git clone that someone needs to update or does it receive updates from production? [22:54:54] Ah, okay. That makes sense. [22:55:32] I've disabled the ci labs role, re-ran puppet, (no errors), re-enabling now. [22:56:19] Krinkle: My advice would be to create instance, wait for first puppet run, apply role::puppet::self, forve puppet run, sign key on puppetmaster, force puppet run, and then continue to apply additional roles. [22:56:39] "sign key on puppetmaster" ? [22:57:06] * bd808 finds command [22:57:06] I've done all except that. And of course the extra step of first updating the operations/puppet repo on our puppetmaster [22:57:24] Exiting; no certificate found and waitforcert is disabled [22:57:42] (after I re-enabled the ci-labs class) [22:58:04] Yeah. Now login to the puppet master and run: sudo puppet ca list [22:58:15] you should see a new cert waiting to be signed [22:58:25] sign it with: sudo puppet ca sign INSTANCE_NAME.eqiad.wmflabs [22:59:13] It only lists one ( i-000004b1.eqiad.wmflabs (SHA256) ...:..:..) [22:59:33] Ah, I see. [22:59:45] It's listing the ones not yet signed, not all the ones signed. [22:59:49] yes [22:59:51] (else there'd bee three) [22:59:55] (or four, including the new one) [22:59:56] OK [23:00:51] * bd808 thinks that role::puppet::Self should setup auto-signing [23:01:15] forcing a new puppet run on slave1004 did all kinds of chmod changes [23:01:18] Notice: /File[/var/lib/puppet/lib/puppet/provider/database_user]/mode: mode changed '0755' to '0775' [23:01:18] Notice: /File[/var/lib/puppet/lib/puppet/provider/database_user/mysql.rb]/mode: mode changed '0644' to '0664' [23:01:19] etc. [23:01:27] and then the same duplicate resource error [23:01:52] Ha, yeah [23:02:05] andrewbogott: bd808: Not only did the patch assume a certain load order [23:02:12] it assumed the wrong one [23:02:29] the error clearly says mediawiki/packages.pp is trying to redefine it [23:02:42] * Krinkle facepalm [23:07:20] Krinkle: if you want to write the equivalent patch for packages.pp I'll review [23:10:26] Done [23:11:22] Krinkle: ok, you will need to re-update the integration puppet master of course :/ [23:11:36] yeah [23:12:41] Ugh.. [23:12:42] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[php-apc] is already declared in file /etc/puppet/modules/contint/manifests/packages.pp:170; cannot redeclare at /etc/puppet/modules/mediawiki/manifests/packages.pp:12 on node i-000004b1.eqiad.wmflabs [23:12:45] Why can't it tell me that ahead of time [23:15:35] :( [23:39:21] Krinkle: did you test that in integration? [23:39:28] You can cherry-pick onto the puppet master there [23:39:34] (It looks good to me...) [23:39:43] andrewbogott: I'll test it there, thanks [23:40:00] This is the first I'm using the integration puppetmaster (I didn't know hashar put one in place) [23:40:08] It's nice having a place to test puppet stuff