[00:01:47] nm, found it anyway. [00:10:20] huh, this time puppet ran cleanly three times in a row before producing that same failure [00:14:48] Well, Coren, I have to go, but I am much less stuck than I was before. Thank you! [00:45:26] !log services Joined project to help debug Trebuchet setup [00:45:30] Logged the message, Master [00:52:51] !log services - raised instance quota to 15 [00:52:52] Logged the message, Master [01:51:06] Coren: Just watched your talk with Ryan at Wikimania. very good :) [01:58:02] Wait, Hong Kong's? [03:44:20] YuviPanda|zzz: congrats! [06:50:18] fwilson: :) ty [07:40:59] Hello, the recovering mentioned in the topic is still running ? Jsubing a process, the job stops after some times, without anything in my .err file, should I dig or just wait (the same process worked fine yesterday) ? [08:05:57] ok, it seems to be ok, my job is running [10:09:24] 3Tool Labs tools / 3Other: Shared version of pywikibot on tools.wmflabs.org is broken - 10https://bugzilla.wikimedia.org/68215#c7 (10John Mark Vandenberg) 5NEW>3RESO/FIX This should have been fixed a while ago. Reopen if there is a problem. [11:52:22] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (22.22%) [11:53:03] bah [11:53:05] THAT THING [11:53:33] * YuviPanda goes to find fixes [11:57:03] Coren: andrewbogott_afk something on tools-webproxy is fucked up [11:57:22] /dev/vda2 1.9G 1.8G 0 100% /var [11:57:43] but du tells me [11:57:43] https://dpaste.de/yjYb [11:57:54] it also has /dev/mapper/vd-logfile--disk 7.9G 551M 7.0G 8% /var/log [11:58:06] I checked out old /var before the mount with a mount --bind, and the log there is empty [11:58:31] and there du tells me it's only 335M [11:58:35] so I'm not sure wtf is going on [11:59:53] BAM [11:59:54] found it [12:00:01] nginx had kept open file handles to the log files [12:00:05] and those were showing up in du [12:00:06] err [12:00:06] df [12:00:07] and not du [12:00:22] an nginx restart fixed things. [12:02:44] either way, icinga should be happy soon [12:31:49] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [13:57:56] 3Wikimedia Labs / 3tools: Dumps not updating again. - 10https://bugzilla.wikimedia.org/72154 (10Andre Klapper) [15:23:28] Coren: I just started my webservice again. However i wasnt getting a "No web service" error message, just a blank page [15:38:57] Your webserver was probably up but broken because of yesterday's NFS outage. :-( [15:50:34] Coren: I did a webservice stop [15:50:44] Webservice not running. [15:52:54] Hm. Annoying. It means that because of the outage the mechanism to notice didn't work right. Hm. [15:54:20] It's an edge case, but it's annoying because it's not clear if there are many others stuck that way. [17:31:15] bd808: you've a malfunctioning cronjob on deployment-prep [17:31:17] mv: cannot stat ‘/tmp/hhvm.*.core’: No such file or directory [17:31:18] mv: cannot stat ‘/var/log/hhvm/stacktrace*’: No such file or directory [17:31:21] over and over again... :) [17:32:40] YuviPanda: fun [17:32:55] like, once every two months [17:32:59] and coming from your account :) [17:33:01] That sounds like the old job I put in to sweep hhvm stack traces [17:33:10] bd808: just got added to the root alias, now I've a billion emails [17:33:15] heh [17:33:47] bd808: so instead of ignoring I'm tryihng to do a level of cleanup [17:34:01] YuviPanda: Good for you. Which hosts, do you know? [17:34:19] bd808: looks like deployment-mediawiki01 and 02, at least [17:34:28] yeah that seems likely [17:34:33] I'll check it out [17:34:39] cool [17:34:47] */2 * * * * /home/bd808/cleanup-hhvm-cores [17:35:19] ah, heh :) [17:36:39] YuviPanda: I added "&>/dev/null" to swallow errors when there aren't any cores to clean up [17:36:54] yay, cool! [17:36:55] thanks bd808 [17:40:45] bd808: hmm, did you add something to labs-vagrant that sets up git update via cron? [17:41:13] * bd808 can't remember... [17:42:34] YuviPanda: Not in the role that I can see -- https://github.com/wikimedia/operations-puppet/blob/production/modules/labs_vagrant/manifests/init.pp#L42-L48 [17:42:38] hmm [17:42:38] ok [17:43:08] There are ways to do that but they are tricky [17:43:13] yeah [17:43:20] but I think spagewmf has a cronjob on hosts he likes [17:43:24] because folks may have local changes. [17:43:41] I have a cron job for beta's puppetmaster that could be adapted [17:44:21] YuviPanda: This thing is pretty cool and could be made more general -- https://github.com/wikimedia/operations-puppet/blob/production/modules/beta/files/git-sync-upstream [17:44:43] bd808: indeed, one of spagewmf's crons is erroring out because it's not on a branch and git pull has no idea what to do [17:45:12] I use that in beta and some other self-hosted puppetmaster to track operations/puppet [17:45:17] oooh, nice [17:45:28] indeed, having a 'auto rebase' option for self hosted puppetmasters would be nice [17:45:58] https://github.com/wikimedia/operations-puppet/blob/production/modules/beta/manifests/autoupdater.pp [17:46:20] wrong link [17:46:36] YuviPanda: here it is -- https://github.com/wikimedia/operations-puppet/blob/production/modules/beta/manifests/puppetmaster/sync.pp [17:47:24] bd808: created a phab task [17:47:28] The script expects a certain workflow, namely that you commit local hacks and then rebase them on the production branch [17:48:01] bd808: yeah, should probably do something slightly different, like abort if there are uncommited local changes [17:48:12] It stashes! [17:48:16] And the restores them [17:48:20] *then [17:48:39] yes, but if it conflicts on a stash pop, then we abort everything, put things back to the way they were [17:48:55] yeah. What's missing is reporting of that [17:49:06] it writes a log but that is all [17:49:43] yeah [17:49:50] I'm sure something better can be built but it works 99% of time in beta [17:49:50] bd808: I created a phab task [17:49:53] so I won't forget [17:50:02] * bd808 nods [17:55:13] dr0ptp4kt: heya! [17:55:18] dr0ptp4kt: zero errors on betalabs! [17:55:24] Cron /usr/share/varnish/zerofetch.py -s "http://zero.wikimedia.beta.wmflabs.org" -a /etc/zerofetcher/zerofetcher.auth -d /var/netmapper [17:55:29] Exception: Bad response code 401 from API request for zeroportal [17:55:31] YuviPanda: oh great [17:55:39] dr0ptp4kt: I can forward you the entire traceback if you want [17:56:21] dr0ptp4kt: running about every 5 minutes :) [17:56:23] YuviPanda: mind sending that to bblack and yurik as well? i think that might be brandon's script, actually [17:57:04] YuviPanda: thanks for spotting and reporting [17:57:10] dr0ptp4kt: :D [17:57:37] dr0ptp4kt: yw! :) [17:59:50] dr0ptp4kt: I've emailed 'em both, included you as well [18:00:03] dr0ptp4kt: I'm clearing out cronspam that most people seem to filter, surprising how many errors we have there :) [18:06:21] petan: huggle instance has a cronjob that's erroring [18:06:25] petan: Cron cd /home/wl/wl && git pull [18:06:32] error: Your local changes to the following files would be overwritten by merge: [18:06:33] src/whitelist.php [18:06:33] Please, commit your changes or stash them before you can merge. [18:10:14] aw [18:12:53] ROFLXD [18:12:54] * Copyright (c) 2014, Huggle Corporation [18:12:56] * All rights reserved. [18:14:33] lolwut [18:14:53] it's some kind of a joke [18:15:04] it's right under GNU GPL license :P [18:16:53] :) [18:19:59] !log mediawiki-core-team Updated sul-test with labs-vagrant provision; labs-vagrant git-update. Deleted prior hhbc cache that had filled up /var/run [18:20:01] Logged the message, Master [18:20:24] bd808: It's surprising how many crons fail and nobody seems to notice :) [18:20:44] Because they all go to an alias that nobody reads :) [18:20:54] indeed :) [18:21:06] About 60% of cronspam is actionable, someone was happy I told them [18:21:39] I only knew about that php one because some host I setup in labs was sending root email to me. [18:21:56] bd808: heh :) [18:22:32] It was apparently ieg-dev. Something I did there when messing with exim rules [18:23:12] hehe [18:32:24] bd808: using your labs powers, can you restart android-build instance in the mobile project for me? [18:32:37] bd808: wikitech is in the 'logout/login' stage again, and my phone isn't behaving. [18:32:48] YuviPanda: I can try [18:33:36] YuviPanda: Just need a reboot? [18:34:00] bd808: yeah [18:35:39] gawd I hate the wikitech interface for this stuff [18:35:45] andrewbogott: is there a way to see 'instances i created'? [18:35:46] for aude [18:35:48] horrible error messages [18:35:52] bd808: yeah, HORIZON, HORIZON! [18:36:19] YuviPanda: you can, from the commandline... [18:36:31] or looking at https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000004b6.eqiad.wmflabs [18:36:36] …maybe [18:36:38] i am 99% sure i made it [18:36:48] aude: if you're just wondering about that one instance, I can check [18:36:54] yeah [18:37:06] it was probably to have labs vagrant and hhvm [18:37:14] * aude just uses vagrant now [18:37:28] and can easily setup a new labs vagrant if/when we need [18:37:30] yep, you created it [18:37:32] ok [18:37:35] * aude deletes [18:37:44] thanks [18:38:16] YuviPanda: wikitech hates me. I added myself as an admin in the mobile project, logged out, logged back in, but it keeps saying "You cannot complete the action requested as your user account is not in the project mobile." [18:39:20] bd808: :( [18:39:39] Strangely Special:NovaResources says I'm in the admin group for that project [18:39:49] * bd808 blames Obama [18:42:50] bd808: are you 'BryanDavis' on wikitech? [18:43:01] andrewbogott: Yeah [18:43:18] I just removed myself from there [18:43:39] try now? [18:44:39] andrewbogott: Nope. logout, login and still "You cannot complete the action requested as your user account is not in the project mobile." [18:44:59] andrewbogott: Can you try https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=reboot&project=mobile&instanceid=f630a15b-3420-43f3-9ba7-8730ffcda445®ion=eqiad for YuviPanda? [18:45:27] Can you tell me a bit more about what you're trying to accomplish? [18:46:47] YuviPanda wanted that host rebooted and couldn't log into wikitech because of 2fa issues [18:46:55] yeah... [18:47:03] well, 'issues' as in my phone is dead... [18:47:08] What project and instance? [18:47:25] android-build in the mobile project [18:47:26] andrewbogott: 'mobile' and 'android-build' [18:47:32] andrewbogott: also, how do you find out who created an instance? [18:47:44] view history? [18:47:53] hmm? [18:47:58] there's history for instances? :) [18:48:38] YuviPanda: via the commandline. https://wikitech.wikimedia.org/wiki/OpenStack#novaenv.sh and 'nova show ' [18:48:47] aaah, cool! [18:48:48] thanks! [18:48:55] I rebooted that instance btw [18:49:14] Yeah, history seems to track the puppet config changes but not instance creation [18:49:14] andrewbogott: thanks! :) [18:49:33] * bd808 slinks back to mw-v puppet hacking [18:49:41] :D [19:00:41] petan: can you make that cronjob redirect output to /dev/null? [19:14:19] andrewbogott: hmm, cronspam from virt1000 [19:14:21] Cron [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete [19:14:25] PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/fss.ini on line 1 in Unknown on line 0 [19:14:56] andrewbogott: hmm, fss.ini etc seems to be a common thing [19:15:06] Yeah, that sounds like something that every wiki must be throwing constantly [19:15:09] Reedy: ^ do you know what fss is? Is that fast string search? I remember something about us building it... [19:15:17] it is! [19:15:18] and we do [19:15:39] This is a PHP extension for fast string search and replace. It is used by [19:15:39] LanguageConverter.php. It supports multiple search terms. We use it as a [19:15:39] replacement for PHP's strtr, which is extremely slow in certain cases. [19:15:39] Chinese script conversion is one of those cases. This extension uses a [19:15:39] Commentz-Walter style algorithm for multiple search terms, or a Boyer-Moore [19:15:40] algorithm for single search terms. [19:15:44] I see [19:15:52] Reedy: do you know what's up with that error? [19:15:57] and also what's up with that cron? [19:16:28] Which error? [19:16:31] <^d> Error is obvious. [19:16:35] The comment? [19:16:36] <^d> Can't use # as comments in php.ini files [19:16:40] <^d> Have to use ; [19:16:54] It's already fixed in git it seems [19:17:08] https://github.com/wikimedia/mediawiki-php-FastStringSearch/blob/master/debian/fss.ini [19:17:15] ^d: yeah, but it's using ; [19:17:28] <^d> Obvs. not ;-) [19:17:35] ^d: Reedy and last touched on May [19:17:49] Which means it's not been rebuilt I guess [19:17:59] Hmm [19:17:59] -- root Mon, 28 Jul 2014 14:55:39 +0000 [19:18:06] YuviPanda: What version is on virt1000? [19:18:08] <^d> Nope. I've seen it a couple of places. [19:18:16] <^d> I think it was on the integration slaves too [19:18:17] 1.1-3 [19:18:24] I'm not sure, want me to check? [19:21:55] Well, I can't ;) [20:56:49] (03PS1) 10Yuvipanda: Just use system python [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/171900 [20:57:07] (03CR) 10Yuvipanda: [C: 032 V: 032] Just use system python [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/171900 (owner: 10Yuvipanda) [20:58:14] Coren: so… now puppet is installing that package on every other run, and removing all of its files in between. It's pretty great. [20:58:23] I'm going to just switch to Trusty and see if the package is any less broken there. [20:58:43] Maybe it is. I thought of a possible (ugly) workaround if that doesn't help. [20:59:48] I got it to settle down and tolerate changes to the config files. But now I'm trying to modify the branding .png (as indicated in the documentation) and apparently when I change that it panics, removes everything from its 'static' directory, and then bails out. [21:00:13] gj, package [21:00:26] Yeah, that package is done all wrong. [21:00:37] It'd probably make _joe_ cry. :-) [21:00:38] yep! [21:13:53] !ping [21:13:53] !pong [21:44:41] !log mathoid ran vagrant on mathoid2 to reduce cronspam [21:44:42] mathoid is not a valid project. [21:49:00] Hello, I can't figure how to kill a jsub process, who can tell me the command ? [22:00:53] (03PS1) 10Yuvipanda: Add my labs root key [labs/private] - 10https://gerrit.wikimedia.org/r/171965 [22:01:01] mutante: ^ wanna +1/+2? [22:01:07] I guess this doesn't need puppet merge [22:02:00] hmm, I guess this is trivial enough [22:02:04] (03CR) 10Yuvipanda: [C: 032 V: 032] Add my labs root key [labs/private] - 10https://gerrit.wikimedia.org/r/171965 (owner: 10Yuvipanda) [22:02:05] where do i check the key? [22:02:11] how do i know it's your labs key [22:02:27] it's not the prod key, right [22:02:51] YuviPanda: This SMW query against wikitech will tell you all the hosts that are using the labs-vagrant role -- http://tinyurl.com/labs-vagrant [22:03:19] bd808: heh, I've also all the cron stuff coming in [22:03:25] YuviPanda: no, it doesn't need puppet merge in labs/private [22:03:32] mutante: indeed, only way to know is that I uploaded it to gerrit via ssh via my other key :) [22:03:53] It might be possible to fix many/most of them with a well crafted salt command. [22:04:11] right. I haven't touched salt at all... [22:04:54] The fix is to copy /vagrant/puppet/modules/php/files/sessionclean to /usr/lib/php5/sessionclean [22:05:47] salt '*' cmd.run 'cp /vagrant/puppet/modules/php/files/sessionclean /usr/lib/php5/sessionclean' would be the caveman command [22:06:15] It will work or fail everywhere and failures don't matter [22:06:50] hmm, we *should* have ways of running salt commands based on roles applied [22:08:47] salt '*' grains.ls -- will tell you what grains exist [22:09:25] Coren: ok, you better tell me what your idea for a workaround is… same oscillating behavior on Trusty [22:10:08] bd808: hmm, I keep getting 'salt not found'. [22:10:16] bd808: I've a feeling we've no salt setup on labs in general... [22:10:19] outside of deployment-prep [22:10:37] It's there somewhere. There is a default labs wide salt server [22:10:56] dunno where though because I don't have the secret decoder ring :) [22:11:03] hmm [22:11:25] But there don't seem to be grains for puppet classes on the beta hosts at least [22:12:00] yeah, but we can write some! [22:12:02] I think, at least... [22:12:29] /etc/salt/minion seems to point to virt1000.wikimedia.org as the salt master [22:12:48] and labcontrol2001.wikimedia.org [22:14:32] YuviPanda: there's salt labs-wide, the master is on virt1000 [22:14:44] apergos has been tidying it up lately, should be in OK shape [22:14:50] oooh, cool [22:14:53] except it uses ids rather than instance names, which is kind of a drag [22:15:05] it sucks. :) [22:15:11] also that [22:15:14] SOmebody should fix that [22:15:19] But I'm sure Yuvi can fix it [22:15:42] :D [22:16:08] I'm going to file a bug [22:17:12] When you change it you'll have to sign new keys for all hosts and remove the old ones [22:17:16] bd808: andrewbogott https://phabricator.wikimedia.org/T1154 [22:17:34] I'm not touching salt today, I think :) am cleaning up all the other non labs-vagrant things first... [22:17:41] Ideally we'll have cron we can actually unfilter [22:17:47] woot [22:18:09] useful error notifications are ... useful [22:18:22] yeah [22:29:27] YuviPanda: there's a script over on virt1000 which can tell you: which nova instances are not deleted but not responsive to test.ping, which instances are not deleted but there's no salt key; which salt keys are still lying around for deleted instances and should be tossed (it can also toss them for you) [22:30:00] I have just checked it, deleted a pile of keys, and there are 5 non respnsive instances right now that are 'active', 4 of which have issues, the last one I don'tkknow about, it's new since the last time I checked [22:30:02] the end [22:30:43] remember that deployment-prep has its own salt master [22:32:10] ok! :) [22:32:30] dataset fixes have been updated, btw [22:32:31] https://gerrit.wikimedia.org/r/#/c/170492/ [22:32:33] details, well at least a mention of the script is on the salt installation page on wikitech [22:32:36] because i see both you :) [22:32:39] thanks mutante [22:33:25] did we ever find out of the linter really really has to have that horrible wrapping? [22:33:29] *if the [22:34:38] do you mean the one from line 59 in https://gerrit.wikimedia.org/r/#/c/170492/2/modules/dataset/manifests/cron/pagecountsraw.pp ? [22:35:04] i'm not sure, i agree it's not super nice to look at [22:35:05] and the one in nfs.pp too [22:35:28] we should fix the linter... [22:35:37] it should probably be fixed a different way, true [22:35:39] should be 2 lines [22:36:26] I can comment on the patchset again about the linter but since I already said my bit, mybe we can just have [22:36:42] john l look at it or something [22:36:58] ok [22:37:10] what is he on irc? [22:40:23] JohnFLewis [22:48:32] !log Wikidata-build upgraded php5-fss on -jenkins1,2,3 to prevent cronspam [22:48:33] is not a valid project. [22:48:38] !log Wikidata-build upgraded php5-fss on -jenkins1,2,3 to prevent cronspam [22:48:39] Wikidata-build is not a valid project. [22:48:44] !log Wikidata-builds upgraded php5-fss on -jenkins1,2,3 to prevent cronspam [22:48:44] Wikidata-builds is not a valid project. [22:48:59] !log wikidata-build upgraded php5-fss on -jenkins1,2,3 to prevent cronspam [22:49:02] Logged the message, Master [23:03:17] !log huggle silenced cron for git pull on huggle [23:03:21] Logged the message, Master [23:03:24] petan: ^ [23:07:55] 3Wikimedia Labs / 3wikistats: WikiStats update cronjob failing - 10https://bugzilla.wikimedia.org/73146 (10Daniel Zahn) [23:08:55] 3Wikimedia Labs / 3wikistats: WikiStats update cronjob failing - 10https://bugzilla.wikimedia.org/73146#c1 (10Daniel Zahn) @wikistats-live should not be the actual live instance anymore. it should now be @wikistats-petcow. i'll check what's up with the old instance. (also moved out of Analytics component, in...