[00:05:21] thanks bd808 [00:05:33] andrewbogott: np. thanks for working on it [00:10:09] andrewbogott: bd808 so I need to create new instances to get this, right? [00:10:21] I'll need to delete and recreate teh k8s fleet once this goes out [00:10:32] YuviPanda: or apply and reboot [00:10:38] andrewbogott: oh, just puppet? [00:10:48] yeah, it’s a puppet run and a reboot [00:10:55] a followup patch will add the setting to new images without a reboot [00:10:59] ah nice [00:11:12] andrewbogott: i we're going to build debian images again, I want to upgrade kernel too to 3.19 [00:11:42] ok [00:12:20] andrewbogott: want me to file a bug? [00:12:48] YuviPanda: sure [00:12:58] ok [01:24:36] 6Labs, 10Wikimedia-Labs-General, 6Developer-Relations: Community-maintained projects on Labs are hard to track - https://phabricator.wikimedia.org/T64837#1926643 (10Qgil) High priority but assigned to nobody? [01:40:54] 10Wikibugs: wikibugs is confused by closing tasks by email - https://phabricator.wikimedia.org/T123344#1926718 (10matmarex) 3NEW [01:44:19] 10PAWS: user's user-fixes.py not taken into account - https://phabricator.wikimedia.org/T121160#1926730 (10yuvipanda) 5Open>3Resolved a:3yuvipanda This is done now. [04:45:06] 10Tool-Labs-tools-Erwin's-tools: Unknown MySQL error when opening Erwin's tools - https://phabricator.wikimedia.org/T123355#1926970 (10Natuur12) 3NEW [08:55:24] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Enable memory cgroups for default Jessie image - https://phabricator.wikimedia.org/T122734#1927161 (10faidon) Are you sure `cgroup_enable=memory` is needed nowadays? An `lxc-checkconfig` on an unmodified jessie system says "Cgroup memory controller: enabled".... [09:06:24] 10Tool-Labs-tools-Erwin's-tools: Unknown MySQL error when opening Erwin's tools - https://phabricator.wikimedia.org/T123355#1927166 (10Nemo_bis) Yesterday there were problems with SGE and the webservice was probably stopped incorrectly (cf. T109591). I did: `webservice stop; webservice start --release precise`... [09:06:32] 10Tool-Labs-tools-Erwin's-tools: Unknown MySQL error when opening Erwin's tools - https://phabricator.wikimedia.org/T123355#1927168 (10Nemo_bis) 5Open>3Resolved a:3Nemo_bis Yesterday there were problems with SGE and the webservice was probably stopped incorrectly (cf. T109591). I did: `webservice stop; we... [09:13:54] 6Labs, 10Tool-Labs: Remove overly-large log files - https://phabricator.wikimedia.org/T122508#1927172 (10Nemo_bis) You could just run `xz -9` on all of them so that nothing gets lost. LZMA reduces them by 3 orders of magnitude, see example with default 7za: ``` $ for file in *7z; do 7z l $file; done | grep err... [09:26:45] 6Labs, 10Tool-Labs: Prevent overly-large log files - https://phabricator.wikimedia.org/T122508#1927177 (10valhallasw) [09:31:51] 6Labs: Access needed to mwui.wmflabs.org - https://phabricator.wikimedia.org/T123316#1927181 (10scfc) https://mwui.wmflabs.org/ is indeed served by the instance [[https://wikitech.wikimedia.org/wiki/Nova_Resource:Mwui.editor-engagement.eqiad.wmflabs|mwui]], so you should contact one of the project admins listed... [09:32:52] 6Labs, 10Tool-Labs: Prevent overly-large log files - https://phabricator.wikimedia.org/T122508#1927182 (10valhallasw) Thanks, that's good to know. I misformulated this task -- the issue is not so much the space, but rather the NFS usage that is necessary to store so much (useless) data. If I remember correctly... [09:40:16] 6Labs, 10Tool-Labs: Prevent overly-large log files - https://phabricator.wikimedia.org/T122508#1927186 (10valhallasw) [09:43:30] 6Labs: Access needed to mwui.wmflabs.org - https://phabricator.wikimedia.org/T123316#1927190 (10Volker_E) [09:45:14] 6Labs, 10Tool-Labs: Prevent overly-large log files - https://phabricator.wikimedia.org/T122508#1927194 (10scfc) The problem with compressing those log files is that the jobs writing to them need to reliably reopen them after they are moved elsewhere, and that is not an easy task (otherwise we would be rotating... [09:46:13] 6Labs: Access needed to mwui.wmflabs.org - https://phabricator.wikimedia.org/T123316#1927196 (10Volker_E) @kaldari @Mattflaschen @Halfak, could one of you put me on the admin list for the reasons outlined above. Or probably better, would you mind if this instance goes under Design? (In my naive thinking that sho... [10:30:48] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Alexrk2 was created, changed by Alexrk2 link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Alexrk2 edit summary: Created page with "{{Tools Access Request |Justification=I want to bring this web site to wmflabs.org: http://alexrk4.appspot.com/oldmapsofberlin/index.html#map=17/52.517778/13.396944/466 Th..." [11:00:41] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1927295 (10Beetstra) @valhallasw - thank you for the lengthy explanation. This bot has now been running on labs for a long time (and sometimes for long uptimes without problems - it has at least on... [11:12:29] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1927301 (10Beetstra) @valhallasw: taking the number of parsers down from 10 to 8 resulted in formation of a backlog within 10 minutes. Trying 9 .. (the parsers are the processor intensive processes... [11:14:43] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1927303 (10scfc) I know that it isn't ready yet and it probably shouldn't be beta-tested with such a complex tool, but isn't the goal of the Kubernetes setup to provide better isolation/scheduling f... [11:15:09] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Alexrk2 was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=254147 edit summary: [11:29:12] (03PS1) 10Giuseppe Lavagetto: hieradata: add conftool and etcd fake credentials [labs/private] - 10https://gerrit.wikimedia.org/r/263597 [11:29:31] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hieradata: add conftool and etcd fake credentials [labs/private] - 10https://gerrit.wikimedia.org/r/263597 (owner: 10Giuseppe Lavagetto) [13:53:41] 6Labs, 10Tool-Labs: tools.taxonbot cronjob not firing - https://phabricator.wikimedia.org/T123186#1927397 (10doctaxon) Here the next problem (tools.giftbot): TZ=Europe/Berlin 0 22,23 * * 0 [ $(date +\%-H) = 0 ] && jsub -once -j y -quiet -v LC_ALL=$LANG -mem 1g ausrufer.tcl Combined out and err file without... [13:54:23] 6Labs, 10Tool-Labs: tools.taxonbot and tools.giftbot cronjobs not firing - https://phabricator.wikimedia.org/T123186#1927398 (10doctaxon) [14:00:13] n aan.out [14:39:55] 6Labs: Access needed to mwui.wmflabs.org - https://phabricator.wikimedia.org/T123316#1927448 (10Halfak) Moving an instance is non-trivial. You could always start up a new instance within the `design` project. Long term, it looks the editor-engagement project is a bit of a monster and should probably be brok... [14:46:53] I have an instance here that is working fine but locked everyone out of SSH (on its own?). puppet seems to have stopped running as well. is there a way to recover ssh access? [14:59:43] YuviPanda: ^ any idea? :) [15:15:42] again a webgrid node frozen tools-webgrid-lighttpd-1202 no ssh possible, webinterface freeze [15:20:32] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1927496 (10Phe) same trouble but on ssh tools-webgrid-lighttpd-1202, ssh and my tools running on it freeze [15:22:31] it does seems totally hung [15:22:45] but maybe not in an nfs kind of way, considering teh state I can reboot [15:24:25] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1927497 (10Jakob_WMDE) 3NEW [15:24:31] I take that back, totally seems like nfs but makes no sense and doesn't come back so yeah [15:27:29] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1927507 (10Tobi_WMDE_SW) [15:38:56] thanks chasemp [15:48:06] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1927544 (10Phe) working now, someone restarted it as far I can see. [16:52:57] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Enable memory cgroups for default Jessie image - https://phabricator.wikimedia.org/T122734#1927686 (10bd808) >>! In T122734#1927161, @faidon wrote: > Are you sure `cgroup_enable=memory` is needed nowadays? An `lxc-checkconfig` on an unmodified jessie system sa... [16:55:26] 6Labs, 10Phragile, 6TCB-Team: Unable to access Phragile WMFLabs instance - https://phabricator.wikimedia.org/T123369#1927693 (10Addshore) p:5Triage>3High [17:09:31] 6Labs, 10Labs-Infrastructure: [Horizon] Design broken - https://phabricator.wikimedia.org/T120646#1927721 (10Luke081515) Thanks! [17:31:44] 6Labs: nfs-exports.service is failing on labstore1001 often - https://phabricator.wikimedia.org/T122250#1927781 (10chasemp) 5Open>3Resolved I don't see any of this in january at all now [18:17:26] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1927995 (10Krenair) [18:34:51] valhallasw`cloud: is the ‘tools beta’ project something you work on/know about? [18:36:20] (03PS55) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [18:38:50] andrewbogott: yes, tesing env for toollabs manifests [18:39:06] Dinner now, can explain more later [18:39:36] (03CR) 10Ricordisamoa: "PS55 adds a COPYING file and the boilerplate notice to all source files" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [18:39:51] (03CR) 10Ricordisamoa: "Apache 2.0" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [18:40:48] chasemp: ^ [18:41:17] interesting, will take a look [18:43:16] chasemp: oh, actually I was pointing you to valhalla’s response about tools beta. But yeah, that too :) [18:43:29] ha yeah I figured [18:44:19] (03PS56) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [18:46:47] (03CR) 10Ricordisamoa: "PS56 uses parentheses instead of backslashes to wrap long Python lines" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [18:57:09] chasemp: so the basic idea is to push changes to toolsbeta-puppetmaster3, run puppet agent -tv on the relevant hosts, fix the puppet change, retry [18:57:22] chasemp: it's a bit slow, though, so I only use it for larger changes [18:57:36] understood, I was wanting to poke at the master in a possibly invasive way so I thought [18:57:41] maybe a good way to do that [18:57:53] yes, toolsbeta would be the perfect place :-) [18:58:13] (I'm writing up incident docs on yesterdays SGE outage, so I'm afraid to touch anything SGE related at the moment ;-)) [18:58:36] ha gotcha [19:06:41] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1928246 (10ArielGlenn) So now the bikeshedding starts: if these pages go in a separate namespace, what do we call it? If there are only likely to... [19:11:28] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1928282 (10Dzahn) I agree that subpages and categories might be sufficient. How about https://wikitech.wikimedia.org/wiki/OIT or OfficeIT as an o... [19:12:49] YuviPanda: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160112-20160111-toollabs-SGE [19:12:55] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1928297 (10bd808) One nice feature of using actual namespaces is that it allows advanced search to explicitly include or exclude the content of a... [19:12:58] anything I forgot? (except for actually adding actionables) [19:13:12] (although I'm not sure of actionables...) [19:13:34] andrewbogott: is there any way to log which client is accessing a specific (set of) file(s)? [19:14:08] more specifically: I'm wondering whether tools-shadow is writing to the SGE database, inadvertently corrupting it [19:14:24] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1928300 (10Krenair) yes, although you can probably use intitle:OIT to mostly get the right results if it were set up using subpages [19:15:44] I suppose I could actually do that on that host [19:16:10] valhallasw`cloud: that sounds easier. I’m not sure if it’s possible from the NFS side [19:16:21] * valhallasw`cloud reads up on inotify [19:18:38] inotify and nfs don't really work together across hosts (it's a kernel thing) [19:18:42] valhallasw`cloud: <3 seems complete [19:18:58] valhallasw`cloud: that's possible. I think we should shut down tools-shadow and keep it shut [19:22:28] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1928330 (10Dzahn) {meme, src=bikeshed, above="not a strong", below="opinion :)"} [19:22:37] there is some lock file type stuff for NFS, but it's very opt-in for processes and not at all normal FS stuff [19:22:46] I hvae no clue if SGE master is doing things sanely in that way [19:22:49] if I had to guess it would be a no [19:23:10] valhallasw`cloud: can do tcpdump on the labstore host maybe, but that's dark magic [19:23:25] yeah, lemme first try inotify on tools-grid-shado [19:23:31] ok [19:23:38] valhallasw`cloud: you can also use fatrace [19:23:46] valhallasw`cloud: that lists all file operations per mount [19:25:35] huh. I'm not even sure if the shadow is running at all? [19:25:46] the newest log file is from dec 30 [19:27:23] YuviPanda: I think the host was rebooted at that time and the shadow master has to be started manually? [19:30:55] oh [19:30:58] right [19:30:59] the grid master starts [19:31:13] ok, so shadow is not a likely suspect [19:39:09] chasemp: did you tinker with the NFS setup in testlabs? Not a problem, just want to make sure that it’s not my fault there’s no shared $home on a new instance [19:39:53] I did not much w/ that, all my testing is local to the chase-nfs-testing-1 instance [19:39:56] that bing said [19:40:01] hmm. SGE is keeping open ~1100 .nfs* files. Which is odd, because SGE should be the only one removing files to begin with :| [19:40:03] it's possible I did it inadvertitently [19:40:55] valhallasw`cloud: ugh, which files? [19:41:58] 6Labs, 10Labs-Infrastructure, 10Analytics: Report page views for labs instances - https://phabricator.wikimedia.org/T103726#1928678 (10ggellerman) [19:42:17] YuviPanda: ls -a /data/project/.system/gridengine/spool/spooldb | less [19:43:00] YuviPanda: lsof shows master is keeping the file open, but maybe it's some scheduling thing where the exec node deletes it? [19:43:36] valhallasw`cloud: I thought we don't do that [19:43:52] YuviPanda: we do. The nodes have their own spool dir [19:43:59] http://gridscheduler.sourceforge.net/howto/nfsreduce.html [19:44:35] is the idea that exec nodes use a local spoolfile [19:44:46] YuviPanda: we have blue and green but not yellow [19:45:07] ...but is that spool file still on nfs? [19:46:54] the spooling database (/data/project/.system/gridengine/spool/spooldb/__db.*) is on NFS [19:47:46] should we do yellow? [19:48:16] yes almost certainly [19:48:19] I would think [19:48:46] I'm confused about the local spool directories being on NFS [19:48:58] what exactly do the spool directories do? [19:49:01] chasemp: the local one is not, the master one is [19:49:05] ok [19:49:06] store the script files, among others [19:49:11] if we do yellow we lose shadow but that's already the case [19:49:17] and we can backup the bdb files [19:49:25] does any other process besides master use the master spool? [19:49:31] other than shadow [19:49:31] qacct I think [19:49:56] qacct just reads the accounting file [19:50:10] but isn't that on NFS too? [19:50:14] yes [19:50:15] so ironically this is what I wanted to poke at the master on beta for [19:50:23] so one needs to be on maste to read [19:50:24] it [19:50:25] to look at the conceivability of no nfs for the master node [19:51:28] here is an interesting side note for you guys [19:51:35] we do all nfs hard mount atm (even for ro things) [19:51:39] ok, so we're actually not doing green [19:51:59] and we have a mount option for intr which says "hey if this locks up still allow signals that are not -9" [19:52:10] which ideally means if nfs locked up on an exec node specifically [19:52:15] the grid master could cull teh jobs [19:52:21] and redistribute [19:52:32] turns out intr has been deprecated since 2.6 [19:52:33] the working directory for jobs is also on NFS [19:52:56] including the stdout/stderr log files [19:53:05] sure, some of this only get squishy when particular exec nodes are affected [19:53:13] if all of nfs is dead it's just hopeless [19:53:54] but we have definitely seen particular exec nodes crap out on nfs and the failure modes have been weird [19:53:59] i.e. only qdel -f having any effect [19:54:02] so, as far as NFS goes: [19:54:02] all hosts have: /data/project/.system/gridengine on /var/lib/gridengine type none (rw,bind) [19:54:03] master has that plus: /data/project/.system/gridengine/spool on /var/spool/gridengine type none (rw,bind) [19:54:09] chasemp: that's not really related to NFS [19:54:14] chasemp: that's just gridengine being gridengine [19:54:28] well, with hard mount the OS ignores all signals except -9 kill [19:54:34] except with initr [19:54:35] YuviPanda: eh, no? that was the grid hanging for some unclear reason [19:54:36] intr [19:54:45] eh, the host* [19:54:57] except intr is deprecated i.e. anytime nfs hangs gridengine couldn't control the job if it wanted to atm [19:55:00] the host hanging for some unclear reason. Maybe NFS, but probably not, given that root login also didn't work [19:55:03] assuming it tries to be graceful [19:55:20] yeah the root login thing seems inconsistent [19:55:53] also, the exec host should be able to handle stopping the job, because that doesn't hit NFS (other than $HOME, maybe? hm.) [19:56:50] when you say all hosts have /var/lib/gridengine [19:56:59] what is all as tools-exec-1210 for example doesn't? [19:57:35] chasemp: it does, as far as I can see? [19:57:41] valhallasw`cloud: yeah, that's what I meant (gridengine hanging for some unclear reason), I don't think that's directly just NFS [19:57:45] mount | grep project, last line [19:57:51] YuviPanda: the *host* was hanging, not just SGE [19:58:20] YuviPanda: and hosts have been hanging more often in the last weeks [19:58:23] hmm [19:58:24] that's true [19:58:28] valhallasw`cloud: ah I understand I thought you meant it was the primary mount point, it's under /data/projects got it [19:58:29] also non-grid hosts [19:58:37] chasemp: right, it's a bind mount [19:58:41] valhallasw`cloud: we actually haven't had a hung host this and last week [19:58:43] I think? [19:58:45] oh no [19:58:48] tools-checker hung didn't it? [19:58:50] when I was sleeping? [19:58:50] I rebooted one today [19:58:52] YuviPanda: yes we have. Today, tools-exec-1201 [19:58:54] ouch [19:58:57] ok [19:58:59] I'm clearly out of touch :| [19:59:04] and 1202 also died recently, then tools-checker-01 a few days ago [19:59:10] so I don't think all of tools issues is NFS at all [19:59:13] nothing in kernel logs I guess [19:59:19] but I do know that as NFS is currently setup things will behave in insane ways [20:00:36] but I have a problem w/ making teh failure modes more sane [20:00:50] in that it shuffles the burden upwards [20:01:05] which is almost certainly a good thing but I just don't know enough to guage outcome atm [20:02:23] brb lunch [20:03:18] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1928991 (10valhallasw) Inbetween note: the first occurrence of 'Corrupted' in the messages file is ``` 12/30/2015 02:47:55| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare... [20:06:01] valhallasw`cloud: so what do the exec nodes use /var/lib/gridengine for? [20:06:17] pardon the ignorance i think I know but... [20:06:33] chasemp: I think for the SGE binaries [20:06:37] but I'm not really sure [20:06:49] but it's def not used to dispatch jobs or anything [20:07:06] no, I don't think it's used as communication [20:07:16] well that gives me hope :) [20:07:21] let me lsof it [20:07:39] sgeadmin 6307 0.0 0.0 22548 616 ? S 2015 0:00 /usr/lib/gridengine/sge_shepherd -bg [20:07:45] that's *usr* lib gridengine [20:09:18] heh [20:09:19] bash 12985 valhallasw cwd DIR 0,21 4096 87687507 /var/lib/gridengine [20:09:20] :) [20:09:36] chasemp: I'll fatrace it for a while, but it might not be used at all? Very strange. [20:09:44] yeah I agree [20:09:48] I do not get why it's there [20:10:53] ok valhallasw`cloud so a bit of food for thought, if we were to chane from teh hard mount for NFS because it's really not viable witout intr [20:11:12] the behavior would be basically that after a timeout the write operations would return an i/o failure to the process [20:11:20] probably crashing it in 99% of our cases [20:11:21] but [20:11:32] the proc would at least be controllable by sge and would be rescheduled [20:11:40] in one case on another exec node that is viable [20:11:45] in teh other where all of nfs is borked [20:11:53] then when it's not borked assuming the master stands [20:12:19] as is any processes in the middle of an IO actions will basically hang forever and respect no signals except for kill -9 [20:12:31] and hte more of that the more consistent bad VM behavior seems to increase [20:12:33] from what I can tell [20:13:34] hard mounts and sge w/o intr are a really non-deterministic state of affairs afaict [20:14:25] my fear with non-hard mounts is data corruption [20:14:34] I'm not sure how well bdb handles write failures, for example [20:15:10] and I am sure tools will not handle write failures well [20:16:09] finally, as far as I know, the sge shepherd sends kill -9 to the process if it doesn't quit after a more subtle nudge [20:16:33] as far as I understand, the shepherd typically just doesn't respond to the qdel request [20:16:41] yeah totally fair concern on data integrity [20:16:51] the bdb thing though should only affect master though? [20:16:56] I would intend to not have master on NFS [20:16:57] yes, bdb is just master [20:16:58] in this scenario [20:16:59] *nod* [20:17:43] thing is when nodes are freezing up afaict they are either hanging forever or we are forcing them to expire w/ a reboot et [20:17:45] c [20:17:53] bah. fatrace crashes after a while with 'read: Not a directory' [20:17:56] I will dig more maybe a kill -9 is happening at some point [20:18:14] but either way w/ hard mounts and kill -9 or soft mounts and proc errors [20:18:20] data integrity has to be handled at a higher level [20:18:42] this is totally ideological but the make nfs transparent to all things for all users is clearly a bad a model [20:18:52] even if nfs is reasonable, trying to hide that there are network operations happening [20:19:00] is just a very tenuous state of affairs [20:19:09] I remember now. The issue with proc errors is that the user code doesn't have to handle it, and can continue in some half-finished state [20:19:17] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1929077 (10Luke081515) [20:19:32] yeah so the zombie state is the real fear [20:19:45] on the other hand, every organisation everywhere uses some sort of networking file system for their employees [20:20:04] honestly I have never seen NFS used before in real life [20:20:08] not since like the 90's [20:20:51] but that's an aside, it should work ....much better than now regardless of anything [20:20:54] the enterprise environment probably mostly uses SMB ;-) [20:21:17] heh I've been in mostly production-y hosting env for a long time now [20:21:25] for instance, nfs is basically persona nongrata in prod :) [20:21:35] but being what it is [20:21:46] I'm honestly not sure what the best of all bad choices failure model is [20:22:08] my instinct is to the preserve the power grid even if someone's house goes dark however that may happen [20:24:56] Right. [20:25:10] I agree, killing the job should always be possible, even if NFS is dead [20:25:38] I'm not sure if we can easily test what happens if NFS is dead, though ;-) maybe iptables filter out everything to labstore [20:25:56] I can test it per VM [20:26:03] I have a test host for this now and I'm using iptables for it [20:26:07] to test failure modes [20:26:09] chasemp: so as far as the bdb corruption issue goes -- the plan would be: 1) move database off NFS, 2 [20:26:12] I could do the same for an existing exec [20:26:14] 2) reset database again? [20:26:32] as in: start with a clean slate? [20:26:54] I think that's sane but...what happens now when we wipe the slate clean to any contintuous jobs? [20:27:03] they are gone forever until someone shows back up and remembers what they were? [20:27:20] I think we lost them in december [20:27:41] is it viable to stop all things, move the db, and import jobs? [20:27:49] it might be [20:28:27] I'm mostly afraid that the DB cleanup doesn't work good enough, and the DB stays in some sort of corrupted state [20:28:38] right [20:28:42] but we can figure that out if moving off nfs doesn't solve it [20:28:44] also [20:28:48] even a dump of running jobs to a file that we can hard load via jsub? [20:28:54] are you aware of any infra/nfs changes in late december? [20:29:04] hm, that's an interesting idea [20:29:27] the last big nfs crash and recovery is the last big nfs change [20:29:39] but ldap changed somewhere in there and NFS uses it [20:30:18] that was the week before christmas iirc? [20:31:26] mid-dec https://wikitech.wikimedia.org/wiki/Incident_documentation/20151216-Labs-NFS [20:31:56] there have been various NFS lockups since then of short duration [20:32:03] or at least the canary checks have failed ocassionally [20:32:24] the part I was not expecting is...in theory our NFS load is very modest most of the time [20:32:34] but then again there is no qos of sanity enforcement [20:32:38] and lots of abusers so who knows [20:32:53] so the canary checks might have been tools-checker-02 having resource issues [20:34:23] could be, I'm really not sure [20:35:54] fwiw this kind of thing is there now (nfsd stats) http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1452285919.889&target=servers.labstore1001.nfsd.input_output.bytes-read&target=servers.labstore1001.nfsd.input_output.bytes-written&from=-72h [20:36:00] assuming the proc stuff is all sane [20:36:29] it's the failure modes being catastrophic that is the core thing [20:39:18] valhallasw`cloud: I don't know how adventurous you are feeling but I'm up for a controlled exec node nfs failure test if you are :) maybe sometimes this week even [20:39:39] curious to see if gridengine can in fact manage jobs (I don't think so?) and how hard the VM spirals etc [20:40:01] chasemp: if we can do that on toolsbeta: yes, sure!, if we can't, I think we can use one of the one-off exec nodes after checking with their user [20:40:20] we can do either, atm I'm banning via iptables on the nfs server [20:40:34] that was the most "real life" scenario from testing [20:40:41] (testing various scenarios that is) [20:43:54] 6Labs, 10Tool-Labs, 7Documentation: Document disabling scheduler (#jobs/time) overload protection temporarily - https://phabricator.wikimedia.org/T123411#1929173 (10Krenair) [20:47:53] Krenair: Doh. Thanks! [20:48:02] computers are hard, yo. [20:55:44] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1929217 (10valhallasw) Actually, there //was// something happening before that. Initially, all messages are from tools-grid-shadow, but suddenly tools-grid-master shows up: ``` 12/29/2015 20:17:54|worker|tools-gr... [21:01:00] 6Labs: Access needed to mwui.wmflabs.org - https://phabricator.wikimedia.org/T123316#1929241 (10Volker_E) @Halfak Thanks for clarifying. Isn't the LDAP user needed here? [21:01:47] Hello, labsadmin here? [21:05:42] valhallasw`cloud: hrm, do you know why we don't have any new wikibugs.log files? the `wikibugs.log` file ends with 2015-12-29 00:21:02,838 - wikibugs.wb2-phab - INFO - Shutting down [21:05:59] legoktm: they are in wb2-irc.log et al [21:06:21] wb2-irc.out ? [21:06:27] legoktm: sorry, redis2irc.log and wikibugs.log should be the ones [21:06:28] hm. [21:06:42] redis2irc.log is being used [21:06:46] the other one not for some reason [21:07:35] legoktm: wb2-phab.out [21:07:39] wrong invocation when it was started [21:07:58] 'wb2-irc': '{python} {code_dir}/redis2irc.py --logfile {home_dir}/redis2irc.log' [21:08:08] ^ with --logfile it logs to redis2irc, and logrotates [21:08:22] without it logs to stdout, which is then logged in wb2-phab by sge [21:08:33] ugh [21:08:38] it should probably just error out without --logfile :/ [21:16:35] 10Wikibugs: wikibugs is confused by closing tasks by email - https://phabricator.wikimedia.org/T123344#1929297 (10Legoktm) Hrm... ``` 2016-01-12 01:38:29,643 - wikibugs.wb2-phab - DEBUG - get_transaction_info(123322,OrderedDict([('PHID-XACT-TASK-7m5ji6p6z3lzzgk', 'PHID-XACT-TASK-7m5ji6p6z3lzzgk'), ('PHID-XACT-T... [21:19:41] 10Wikibugs: wikibugs is confused by closing tasks by email - https://phabricator.wikimedia.org/T123344#1929324 (10Legoktm) Oh, duh. ```lang=python if 'core:comment' in transactions: useful_event_metadata['comment'] = transactions['core:comment'].get('comments', 'Removed.') ``` Added in {2faf... [21:23:36] (03PS1) 10Legoktm: Don't say "Removed." if we think a comment has been deleted [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/263672 (https://phabricator.wikimedia.org/T123344) [21:30:51] (03CR) 10Merlijn van Deen: [C: 032] Don't say "Removed." if we think a comment has been deleted [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/263672 (https://phabricator.wikimedia.org/T123344) (owner: 10Legoktm) [21:31:57] (03Merged) 10jenkins-bot: Don't say "Removed." if we think a comment has been deleted [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/263672 (https://phabricator.wikimedia.org/T123344) (owner: 10Legoktm) [21:32:10] valhallasw`cloud: wanna deploy that too? :D [21:32:18] and fix the logging while you're at it? ;) [21:32:25] legoktm: eh sure [21:32:39] !log wikibugs kill jobs and restart w/ phab to fix logging & eplot Removed. patch [21:32:45] !log tools.wikibugs kill jobs and restart w/ phab to fix logging & eplot Removed. patch [21:33:59] dum dum dum [21:34:17] \o/ [21:35:02] except something si broken [21:35:25] python 3 and still encoding issues [21:35:26] fun fun fun [21:36:25] well, without log file then [21:38:09] Change on 12www.mediawiki.org a page Wikimedia Labs was modified, changed by Andrewbogott link https://www.mediawiki.org/w/index.php?diff=2014739 edit summary: [-18] [21:38:33] 6Labs, 10DBA, 10MediaWiki-extensions-ContentTranslation, 7WorkType-NewFunctionality: Replicate ContentTranslation databases on Labs - https://phabricator.wikimedia.org/T119847#1929422 (10Ricordisamoa) [21:50:19] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1929464 (10valhallasw) So, to continue, the other jumps to high ids happened at: ``` Fri, 19 Jun 2015 05:20:50 GMT Tue, 01 Sep 2015 06:18:02 GMT [sge message log starts here] Thu, 24 Sep 2015 13:56:02 GMT Sat, 2... [21:51:44] 6Labs: Access needed to mwui.wmflabs.org - https://phabricator.wikimedia.org/T123316#1929466 (10Halfak) Nope. Wikitech works just fine. We administer these things via the wiki. You've been added and should be able to connect after the next puppet run... I think :) [21:53:31] ok, time for bed [21:54:26] YuviPanda: is there a way to get uwsgi to redirect http to https always? [21:58:58] madhuvishy: mm, I don't think so [21:59:20] *maybe* the proxy sets a header indicating the original protocol [21:59:26] valhallasw`cloud: aah [21:59:44] madhuvishy: but uwsgi will always see an http connection (from the proxy) [22:00:09] valhallasw`cloud: what is the proxy? [22:00:19] madhuvishy: the X-Original-URI header might include the protocol [22:00:43] madhuvishy: assuming we're talking about tool labs, tools.wmflabs.org [22:00:57] 10Wikibugs, 5Patch-For-Review: wikibugs is confused by closing tasks by email - https://phabricator.wikimedia.org/T123344#1929491 (10Legoktm) 5Open>3Resolved a:3Legoktm [22:01:08] valhallasw`cloud: no - i'm writing a puppet module for wikimetrics [22:01:22] there's a uwsgi web server that i set up [22:01:28] right, so then you're using Special:NovaProxy, I assume? [22:01:34] ah yes [22:02:10] madhuvishy: ok, so then X-Forwarded-Proto is available I think [22:03:36] valhallasw`cloud: hmmm - atm, if i request https, it serves https [22:03:56] i was wondering if i could make it serve https if i request http [22:04:22] 6Labs, 10wikitech.wikimedia.org, 7Epic: [EPIC] Make wikitech more friendly for the multiple audiences it supports - https://phabricator.wikimedia.org/T123425#1929496 (10bd808) 3NEW [22:04:25] madhuvishy: yes. Check for X-Forwarded-Proto = 'http', and in that case serve a redirect to the corresponding https url [22:04:41] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1929507 (10bd808) [22:04:42] 6Labs, 10wikitech.wikimedia.org, 7Epic: [EPIC] Make wikitech more friendly for the multiple audiences it supports - https://phabricator.wikimedia.org/T123425#1929506 (10bd808) [22:04:48] valhallasw`cloud: is there an email alias or mailing list to email all the labs admins at once? [22:05:01] valhallasw`cloud: mm hmmm, cool, i'll do that thanks [22:05:31] or maybe a wiki page that they all monitor for discussions [22:05:43] kaldari: not for labs, but for tool labs root@tools.wmflabs.org should work [22:05:55] thanks! [22:06:06] * valhallasw`cloud is really off to bed now :-) [22:06:12] * legoktm huggles valhallasw`cloud [22:10:31] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1929518 (10Dzahn) @b808 oh, cool, so definitely, before we create multiple wikis, please let's do namespaces :) i take back any concerns, heh [22:13:12] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929520 (10bd808) 3NEW [22:16:09] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929535 (10Krenair) +1 [22:18:03] 6Labs, 10Tool-Labs, 10wikitech.wikimedia.org: Create a Tool namespace on wikitech for documentation of Tool Labs projects - https://phabricator.wikimedia.org/T123429#1929557 (10bd808) 3NEW [22:20:58] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929570 (10bd808) Potential initial content: https://wikitech.wikimedia.org/wiki/User:BryanDavis/NewPortals This is a strawman design I've put to... [22:25:09] 6Labs, 10wikitech.wikimedia.org: Exclude nova resource pages from *default* wikitech search - https://phabricator.wikimedia.org/T122993#1929578 (10bd808) [22:25:11] 6Labs, 10wikitech.wikimedia.org, 7Epic: [EPIC] Make wikitech more friendly for the multiple audiences it supports - https://phabricator.wikimedia.org/T123425#1929577 (10bd808) [22:29:56] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 15User-bd808: Create an 'OfficeIT' namespace on wikitech - https://phabricator.wikimedia.org/T123383#1929585 (10Dzahn) https://www.mediawiki.org/wiki/Manual:Using_custom_namespaces#Creating_a_custom_namespace [22:30:23] bd808, are you planning to implement https://phabricator.wikimedia.org/T123429 ? [22:30:39] generally these should be going through wikimedia-site-requests [22:31:31] just a proposal? [22:31:34] Krenair: I was going to send an email to get some discussion before adding site-requests to the taks [22:31:38] *task [22:31:41] ok [22:31:43] cool [22:31:59] I don't plan on trying to "just sneak it in" (as tempting as that is) [22:32:14] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929598 (10Krenair) I'd like to emphasise tool labs being part of labs in the design [22:47:10] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929636 (10kaldari) @bd808: Are you also considering creating a Tools namespace specifically for Tool Labs content? This would solve the problem o... [22:53:48] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929661 (10kaldari) Found the task for the Tool namespace, so nevermind :) T123429 [22:54:10] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929667 (10Krenair) >>! In T123427#1929636, @kaldari wrote: > @bd808: Are you also considering creating a Tools namespace specifically for Tool La... [22:56:39] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929671 (10bd808) >>! In T123427#1929636, @kaldari wrote: > @bd808: Are you also considering creating a Tools namespace specifically for Tool Labs... [23:02:49] 6Labs, 10wikitech.wikimedia.org: Create Portal namespace on wikitech to give a place for audience specific landing pages - https://phabricator.wikimedia.org/T123427#1929683 (10bd808) >>! In T123427#1929667, @Krenair wrote: >>>! In T123427#1929636, @kaldari wrote: >> @Krenair: I think emphasizing that Tool Labs... [23:16:06] 6Labs, 7Monitoring, 5Patch-For-Review, 7Shinken, 7Upstream: shinken.wmflabs.org redirects on https-login to http - https://phabricator.wikimedia.org/T85326#1929753 (10Krenair) a:5Krenair>3None [23:17:52] 6Labs, 10Labs-Infrastructure, 10MediaWiki-extensions-OpenStackManager, 5Patch-For-Review, 5WMF-deploy-2016-01-12_(1.27.0-wmf.10): Cannot remove all Puppet classes from a Labs instance - https://phabricator.wikimedia.org/T122733#1929762 (10Krenair) This should go live on wikitech tomorrow with the MW depl... [23:18:36] 6Labs, 10Labs-Infrastructure, 10MediaWiki-extensions-OpenStackManager, 5WMF-deploy-2016-01-12_(1.27.0-wmf.10): Cannot remove all Puppet classes from a Labs instance - https://phabricator.wikimedia.org/T122733#1929763 (10Krenair) [23:36:08] 6Labs, 10Tool-Labs, 10wikitech.wikimedia.org: Create a Tool namespace on wikitech for documentation of Tool Labs projects - https://phabricator.wikimedia.org/T123429#1929843 (10Ricordisamoa) [23:36:10] 6Labs, 10Tool-Labs, 7Documentation: Create a wiki documentation page for each tool - https://phabricator.wikimedia.org/T122865#1929844 (10Ricordisamoa) [23:37:00] 6Labs, 10Tool-Labs, 10wikitech.wikimedia.org, 7Documentation: Create a wiki documentation page for each tool - https://phabricator.wikimedia.org/T122865#1915546 (10Ricordisamoa)