[00:41:09] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools-webgrid-01.diskspace._var.byte_avail.value (33.33%) [03:55:32] PROBLEM - ToolLabs: Puppet failure events on labmon1001 is CRITICAL: CRITICAL: tools.tools-webgrid-01.puppetagent.failed_events.value (33.33%) tools.tools-exec-14.puppetagent.failed_events.value (33.33%) tools.tools-exec-07.puppetagent.failed_events.value (22.22%) tools.tools-exec-12.puppetagent.failed_events.value (33.33%) tools.tools-dev.puppetagent.failed_events.value (22.22%) tools.tools-exec-08.puppetagent.failed_events.value (22.22%) [07:00:24] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [07:02:54] RECOVERY - ToolLabs: Puppet failure events on labmon1001 is OK: OK: All targets OK [09:01:25] YuviPanda: webserver needs restart? tools.wmflabs.org/imagemapedit/ime.js [09:25:35] Hello, the webservice of my tool "fengtools" was shutted down earlier abruptly: 2014-11-12 09:19:55: (server.c.1512) server stopped by UID = 0 PID = [09:25:53] How can I know who did it, and why? [09:32:04] Coren: ping? [09:32:54] It hosts the Reflinks rewrite which many people rely on. [10:38:07] Zhaofeng_Li: is it the same for http://tools.wmflabs.org/imagemapedit/ime.js ? please report a bug and mail labs-l [10:43:26] Nemo_bis: Not the same error, though I've also experienced it recently. [10:44:36] Probably the server has run out of sockets. [10:45:30] Will file a bug later. [10:50:04] Hello, [10:50:26] has something changed on toollabs ? I can't access /shared/pywikibot/core anymore [10:59:55] 3Wikimedia Labs / 3deployment-prep (beta): Beta cluster centralauth accounts points to no longer existing wikis - 10https://bugzilla.wikimedia.org/63396#c9 (10Antoine "hashar" Musso (WMF)) From a mail Bryan Davis sent me a while ago: When folks reported not being able to log in to deployment.beta and Chad a... [13:06:46] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools-webgrid-04.diskspace._var.byte_avail.value (11.11%) WARN: tools.tools-webgrid-01.diskspace._var.byte_avail.value (100.00%) [14:14:43] 3Wikimedia Labs / 3deployment-prep (beta): Wrong ownership/permissions for /data/project/upload7/wikipedia/commons and many subfolders - 10https://bugzilla.wikimedia.org/73309 (10Gilles Dubuc) 3NEW p:3Unprio s:3normal a:3None /data/project/upload7/wikipedia/commons and many of its subfolders are owne... [14:14:57] 3Wikimedia Labs / 3deployment-prep (beta): Wrong ownership/permissions for /data/project/upload7/wikipedia/commons and many subfolders - 10https://bugzilla.wikimedia.org/73309 (10Gilles Dubuc) [14:19:40] 3Wikimedia Labs / 3deployment-prep (beta): Wrong ownership/permissions for /data/project/upload7/wikipedia/commons and many subfolders - 10https://bugzilla.wikimedia.org/73309#c1 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/DUP The bug has been reported as bug 73206. We need a apache:apache user:group to... [14:19:41] 3Wikimedia Labs / 3deployment-prep (beta): File upload area resorts to 0777 permissions to for uploaded conent - 10https://bugzilla.wikimedia.org/73206#c3 (10Antoine "hashar" Musso (WMF)) *** Bug 73309 has been marked as a duplicate of this bug. *** [14:20:40] 3Wikimedia Labs / 3deployment-prep (beta): File upload area resorts to 0777 permissions to for uploaded conent - 10https://bugzilla.wikimedia.org/73206#c4 (10Antoine "hashar" Musso (WMF)) Per duplicate bug 73309, this blocks Bug 73229 - beta labs: "error while storing the file in the stash.' [14:35:58] Zhaofeng_Li: Almost certainly killed by the grid for running out of memory. You may have a leak. [15:21:39] https://tools.wmflabs.org/tools-info/optimizer.php seems to be non-functional. :( [18:00:31] Nemo_bis: hey! still need that restart? [18:02:31] Nemo_bis: restarted anyway [18:04:23] thanks [18:04:57] looks better [18:14:10] !ping [18:14:10] !pong [18:16:44] !log deployment-prep cherry picking https://gerrit.wikimedia.org/r/#/c/172776/ on labs puppetmaster to see if it fixes issues in the cache machines [18:16:47] Logged the message, Master [18:45:16] tonythomas: Are you around today? We could work on your puppet woes if you are [18:45:36] andrewbogott: I am :) [18:45:38] let me log in [18:47:36] tonythomas: When I tried to update your puppet checkout I ran into conflicts in mariadb. Does that instance have custom changes in the mariadb module? [18:48:02] I remember adding the bounce_records table [18:48:08] but nothing more than that [18:48:13] we can stash ? [18:48:32] yes -- I stashed and rebased and now I cannot apply the stash [18:48:35] due to conflicts. [18:48:38] Maybe you want to have a go? [18:49:08] andrewbogott: yup [18:49:10] checking [18:50:18] I gave [18:50:19] git checkout production . [18:50:23] git reset --hard origin/production [18:50:29] git pull --rebase [18:50:45] now try 'git stash apply' [18:50:46] !ping [18:50:46] !pong [18:50:46] goddamit [18:50:51] because your changes are already in the stash [18:51:02] oh no [18:51:07] I ran sudo puppet agent -tv [18:51:24] ah. same error - let me try git stash apply [18:51:36] CONFLICT (submodule): Merge conflict in modules/mariadb [18:52:37] yep, that's where I got as well. [18:52:59] git status in mariadb/ gives HEAD detached at 02cb5f6 [18:53:04] is it a submodule ? [18:53:09] I think it is, yes [18:53:19] I'm not sure how 'stash apply' handles that... [18:53:26] you might try a submodule update first [18:53:32] although I think that won't matter [18:55:29] I still get a detached head :\ [18:55:50] If I could only get what branch this mariadb/ tracks to [18:55:52] I'm not sure that's necessarily wrong... [18:56:00] its master [18:56:16] well, let me try... [18:56:30] On branch master [18:56:34] :) [18:57:01] ok, is that any better? [18:57:07] let me run again [18:57:28] Oh, I meant, stash-wise [18:57:53] I don't think we're to the point of looking at the actual puppet failure yet… want to get the git checkout in a stable up-to-date state first [18:58:15] true. let me checkout master mariadb/ [18:59:11] modules/mariadb: needs merge [19:00:15] probably the thing to do is print out the contents of the stash and the reapply by hand. [19:00:23] Lemme see if I can figure out how to print... [19:00:29] or reflog [19:00:31] okey [19:00:32] use the reflog [19:00:48] will be simpler... [19:00:54] want me to give it a shot? [19:01:02] YuviPanda: sure... [19:01:05] YuviPanda: of course :) [19:01:07] which host? [19:01:12] mediawiki-verp [19:06:01] btw, tonythomas, the main lesson here is that you should always commit local changes into a local patch rather than just leaving files edited. [19:06:20] Git is happier dealing with patches than local changes, and it also requires you to document what you're doing for future visitors. [19:06:53] andrewbogott: tonythomas ok, I give up too... The stash itself has the changes to the submodules, and that's just complicated.... [19:06:55] andrewbogott: true. I will do that [19:07:09] another lesson is always run 'git submodule update' if 'git status' shows any submodules there [19:07:13] YuviPanda: do you know how to get the stash to enumerage the changes? [19:07:13] otherwise you're in for a bad time... [19:07:30] andrewbogott: git stash show [19:07:33] YuviPanda: actually, 'git submodule update' silently destroys any local changes to a submodule [19:07:37] including if there are commits. [19:07:40] It's pretty dangerous [19:07:47] 'show' only shows what files are changed, not what the changes are [19:08:16] there's a param, [19:08:17] let me find [19:08:21] thx [19:09:16] andrewbogott: --patch [19:09:26] andrewbogott: git stash show stash@{1} --patch [19:09:54] ah, so the stash is very tiny [19:10:04] and doesn't actually show any changes /in/ mariadb… hope I didn't clobber them. [19:10:05] yeah [19:10:07] just one line diff [19:10:14] Anyway, tonythomas, want to apply that change by hand and commit it? [19:10:15] andrewbogott: git stash list also lists all the stashes. [19:10:21] Then we can look at the actual problem :) [19:10:40] andrewbogott: yeah. let me take a look [19:11:00] * tonythomas never remember opening modles/* though !! [19:11:30] andrewbogott: also, git stash --untracked is both super awesome and super painful at the same time :) [19:14:59] I dont know whats happening - but suddenly the SSH have become too slow :\ [19:36:17] !ping [19:36:17] !pong [19:44:33] tonythomas: was that a passing thing or is the instance still hard to access? [19:45:10] andrewbogott: Its alright now - but still the repo is muddled up [19:45:22] ok… what's left to do, repowise? [19:45:26] the gits stash show --patch https://dpaste.de/icei/raw [19:45:57] !ping [19:45:57] !pong [19:46:03] Yeah, looks like just that one line in templates/mail/exim4.minimal.erb is the only thing we need to care about. [19:46:12] So, make that change (if you think you still need it) and commit... [19:46:16] and then we'll figure out why puppet doesn't run [19:46:48] andrewbogott: making the change [19:48:03] andrewbogott: strange - but the value in exim4.minimal.erb is route_list = * <%= @wikimail_smarthost.join(':') %> [19:48:11] thats what we want, right ? [19:48:27] I don't really know anything about the context of that instance. [19:48:34] Only that you (or someone) changed it as some point [19:54:29] andrewbogott: I changed it to 10.68.17.78 earlier - but right now it shows route_list = * <%= @wikimail_smarthost.join(':') %> and git still complains [19:54:49] 'git still complains'? [19:55:28] andrewbogott: yeah - according to the original operations/puppet config - route_list = * <%= @wikimail_smarthost.join(':') %> [19:55:32] right ? [19:55:45] right... [19:56:07] and git stash still shows that [19:56:13] oh, yeah [19:56:28] well, we haven't done anything with the stash. It just contains that patch, and will continue to. [19:56:43] So don't worry about the stash, really -- just go ahead and get that puppet repo in the state you want it. [19:57:21] okey :) [19:58:09] root@mediawiki-verp:/var/lib/git/operations/puppet# git pull --rebase [19:58:09] Current branch production is up to date. [19:59:06] still - I get this error :\ andrewbogott : it looks like I'm getting bit late - [19:59:18] I will try a bit more tomorrow - when I'm more awake [19:59:30] or in the worst case - kill this machine and make a new one [19:59:43] 3Wikimedia Labs: Warning: session_write_close(): write failed: No space left on device (28) in /data/project/magnustools/public_html/php/oauth.php - 10https://bugzilla.wikimedia.org/73324 (10Quiddity) 3NEW p:3Unprio s:3normal a:3None Created attachment 17102 --> https://bugzilla.wikimedia.org/attachm... [19:59:44] "Current branch production is up to date." that's not an error, that just means that things are up to date [20:00:12] andrewbogott: ha :) I was pointing out exactly that one [20:00:26] strange - as the sudo puppet agent -tv still fail [20:00:55] That probably relates to the node definition [20:00:59] which is set on wikitech... [20:01:12] oh true. [20:01:36] might be that role::mediawiki-install::labs is broken. Let me check on another instance... [20:01:48] mediawiki-install is just singlenode, isn't it? [20:04:20] yep [20:09:45] Ym, tonythomas, including webserver::php5-mysql as well as role::mediawiki-install::labs seems wrong. The latter already installs a web server &c. [20:09:58] It's not obvious to me why that would produce the error you're seeing, but I'd expect it to produce /some/ errors [20:10:24] andrewbogott: my mistake ! let me remove that and run [20:10:47] it would need manual removal right ? [20:10:53] as I am on puppetmater ! [20:11:07] just uncheck the box on wikitech [20:12:10] removed. running again [20:13:03] same :( looks like it requires a bit more work. will ping ya next day if I hit with something [20:13:10] got to run ! thanks andrewbogott YuviPanda ! [20:13:17] cool [20:13:37] :) [20:21:50] andrewbogott: hmm, nuria__ is in the admin group for deployment-prep, but can't sudo. do you have time to check? [20:21:53] * YuviPanda still has 2FA issues. [20:22:07] Yep, I'll look. [20:22:14] YuviPanda: 2fa issues = your phone is broken? [20:22:20] andrewbogott: yup. [20:22:21] 'causeI can reset your creds if that helps [20:22:24] I've backup codes [20:22:39] so it's fine, but considering the amount of log out / in we've to do... [20:22:46] andrewbogott: I don't have a new phone yet, should get it tomorrow [20:23:18] YuviPanda: what's an example of a host that nuria__ can't sudo on? [20:23:30] deployment-eventlogging02 [20:23:37] 'k [20:26:53] thanks andrewbogott [20:27:20] nuria__: you can log in, right? Just not sudo? [20:27:27] andrewbogott: correct [20:29:15] hmm, I need to fuck up a host now to make it generate an alert... [20:29:18] * YuviPanda cackles [20:29:46] nuria__: what is your wikitech login? [20:29:53] andrewbogott: nuria [20:30:44] nuria__: try now? [20:31:06] !log tools disabling puppet on tools-exec-07 to test shinken [20:31:09] Logged the message, Master [20:31:21] andrewbogott: btw, puppet hasn't run on that host for a month due to https://bugzilla.wikimedia.org/show_bug.cgi?id=73263 [20:31:25] andrewbogott, YuviPanda \o/ sudo working now [20:31:29] thank you [20:32:01] YuviPanda: so, the problem with deployment-prep is that for some reason sudo is governed by the 'under_NDA' sudo group. I added nuria to that group. [20:32:08] aaah [20:32:12] yes, I suspected something like that... [20:32:19] it was discussed a while ago, I think. [20:32:22] privacy policy, etc... [20:32:34] (with lots of disagreements on current policy, but no outcomes) [20:33:10] 3Wikimedia Labs / 3deployment-prep (beta): Puppet failure on deployment-cache-bits01 - 10https://bugzilla.wikimedia.org/73263#c5 (10Yuvi Panda) Whelp, that patch only addressed the tip of the iceberg. Our varnish code deeply entangles with itself our ganglia code, and labs has no gmond on each instance. This... [20:53:57] 3Wikimedia Labs / 3deployment-prep (beta): Puppet failure on deployment-cache-bits01 - 10https://bugzilla.wikimedia.org/73263 (10Greg Grossmeier) p:5Unprio>3High [21:36:16] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools-webgrid-04.diskspace._var.byte_avail.value (100.00%) WARN: tools.tools-webgrid-01.diskspace._var.byte_avail.value (100.00%) [21:42:02] bah [21:42:06] df and du disagreeing again... [21:42:06] wtf [21:42:54] often that's because a file has been deleted from the pathname tree so du can't see it, but the inode is still allocated because an app still has an open filehandle on the deleted file (which will be deallocated-on-close) [21:43:20] yup [21:43:23] (that and also, some filesystems are slow to update free-space accounting after a large delete. apparently there's no real gaurantee on that) [21:43:24] but I can't figure out what that is... [21:43:31] that's the cause of random alerts on tools-webproxy [21:43:46] but tools-webgrid-01, I deleted archived diamond logs and pacct files [21:44:05] restarting diamond shouldn't make a differenc (archived logs only were deleted) and it didn't... [21:44:12] if it's NFS who knows. if it's XFS, I've actually seen this bug before with free-space accounting being very slow. [21:44:23] oh [21:44:24] ignore me [21:44:37] /var/tmp has a 600MB footprint [21:44:41] aha [21:44:43] a coredump [21:45:04] I was running du on /var/log while partition is /var [21:45:42] !log tools removed coredump from tools-webgrid-01 to reclaim space [21:45:44] Logged the message, Master [21:46:46] oh yeah didn't ori merge up a coredump thing today? [21:47:06] he did? [21:47:09] for lighty? [21:47:36] in general, some kernel settings to drop cores in specific places with specific filenames and such, instead of random/who-knows [21:47:46] !log tools removed coredumps from tools-webgrid-04 to reclaim space [21:47:48] Logged the message, Master [21:47:51] hmm, they should probably go somewhere not /var [21:47:57] considering labs instances already have stupidly small /vars [21:48:03] https://gerrit.wikimedia.org/r/#/c/171206/ [21:48:20] ^ this puts them in /var/tmp/core now, whereas before it was wherever it happened to be [21:48:22] aaagh, bah, yes, that's what's causing these now. [21:48:29] labs has tiny /var partitions [21:57:36] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [21:57:44] that's better [22:03:30] StupidPanda: so where were cores going before commonly on labs? / ? [22:03:37] bblack: I'm not sure. [22:03:50] I think they went to /tmp [22:04:07] which, while bad, isn't *as* bad as /var, since it is at least 8G [22:04:38] well maybe we can put some lab conditionals in ori's change, to use a different directory and maybe sweep them faster as well [22:04:42] yeah [22:04:54] maybe we can put them on NFS [22:05:05] and then keep the sweep the same... [22:07:49] !log tools enabled puppet on tools-exec-07 [22:07:51] Logged the message, Master [22:09:30] bblack: created a phab task https://phabricator.wikimedia.org/T1259 [22:46:27] 3Wikimedia Labs / 3deployment-prep (beta): Puppet failure on deployment-cache-bits01 - 10https://bugzilla.wikimedia.org/73263#c6 (10Greg Grossmeier) Yuvi: Mukunda can help out with this now that his time is starting to be less PHABRICATORPHABRICATORPHABRICATOR [22:50:11] !Ping [22:50:14] !ping [22:50:14] !pong [22:54:40] 3Wikimedia Labs / 3deployment-prep (beta): Puppet failure on deployment-cache-bits01 - 10https://bugzilla.wikimedia.org/73263#c7 (10Yuvi Panda) Ah, indeed :) All help is welcome! :) _joe_ and alex have also offered to help, since this involves ganglia a fair bit and alex is working on fixing our ganglia cod... [22:56:25] 3Wikimedia Labs / 3deployment-prep (beta): Puppet failure on deployment-cache-bits01 - 10https://bugzilla.wikimedia.org/73263#c8 (10Mukunda Modell) I would think the ideal solution is to decouple the monitoring from the varnish module in puppet. But if it's heavily tangled then that might not be an easy task.