[00:00:14] how about the actual deployment. should we also remove that with it? [00:00:38] just thinking that you may want deployment-prep along with deployment [00:06:15] you're talking about the production server group mutante? [00:06:39] yes, production deployment [00:06:52] just looked at that "restricted" group too [00:07:05] 10Wikimedia-Labs-General: Labs infrastructure work - https://phabricator.wikimedia.org/T41784#2450936 (10Danny_B) [00:07:26] robla: Removed RobLa from deployment-prep. [00:09:40] 10Wikimedia-Labs-General: Labs infrastructure work - https://phabricator.wikimedia.org/T41784#434871 (10Dzahn) I think we should probably not tickets years after they have been resolved. [00:10:03] mutante: I'll think about the production one; may need to ask me again later. [00:10:52] robla: *nod* [00:32:09] 06Labs, 10Tool-Labs, 07Regression: uWSGI webservice terminating unexpectedly - https://phabricator.wikimedia.org/T139020#2450974 (10D3r1ck01) @zhuyifei1999, I did all that but still have the same results (web service terminates unexpectedly). [00:47:55] matanya, zhuyifei1999_: I'm going to cycle power on labvirt1012 in a minute, which means encoding02 will be off for a bit. Should be quick. [02:00:49] 06Labs, 10Horizon, 05Continuous-Integration-Scaling: Labs project admin can not delete per project image on Horizon - https://phabricator.wikimedia.org/T110936#1590423 (10AlexMonk-WMF) ```modules/openstack/files/kilo/glance/policy.json: "delete_image": "rule:admin_or_glanceadmin", modules/openstack/files/... [05:53:22] 06Labs, 10Tool-Labs: No permission after creating a new tool - https://phabricator.wikimedia.org/T140004#2451306 (10Dalba) [06:07:31] 06Labs, 10Labs-Other-Projects: video project: move rendering instances to SSD servers - https://phabricator.wikimedia.org/T139802#2451311 (10zhuyifei1999) >>! In T139802#2450156, @Matanya wrote: > cause the stuff is not puppetized, cause puppet on labs suckssss. I'll try to build some manual (simple) puppet... [06:09:24] chasemp: yes [06:35:46] (03CR) 10Legoktm: [C: 04-2] "No, we're not going to add more channels like that." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/292554 (owner: 10Paladox) [06:58:51] 06Labs, 10Labs-Other-Projects: video project: move rendering instances to SSD servers - https://phabricator.wikimedia.org/T139802#2451508 (10zhuyifei1999) @Matanya @Andrew can you apply puppet role `role::labs::lvm::srv` to the instances? Apparently I cannot call puppet modules from `operations/puppet` with c... [07:15:13] 10Wikibugs: Wikibugs links sometimes to the creation event, not to the mentioned comment - https://phabricator.wikimedia.org/T129246#2099604 (10Nikerabbit) Another example which is not linking to the creation event [10:08:43] wikibugs> MediaWiki-User-login-and-signup, MediaWiki-extensions-CentralAuth, Wikimedi... [11:18:14] !log ores aad92ac goes to staging [11:18:17] 10Labs-project-wikistats, 10Analytics, 10Analytics-Wikistats: Design new UI for Wikistats 2.0 - https://phabricator.wikimedia.org/T140000#2452139 (10Danny_B) [11:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [11:18:41] tom29739 how's uwsgi on k8s so far? [11:18:50] I'm writing a patch that'll make webservice restarts much faster... [11:19:43] yuvipanda, it's working well. [11:20:04] pip should be much faster as well than with gridengine [11:25:22] !log ores deploying aad92ac to web and worker nodes [11:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [11:25:27] musikanimal Matthew_ around? I want to see if we can move xtools-ec to k8s [11:26:55] Hi everyone, I have an old and easy question, originally for Coren [11:27:32] The http://tools.wmflabs.org/?tool=xxx information uses .description but should use toolinfo.json [11:27:44] (In order to avoid duplicating the same info) :) [11:28:31] there's a bug for it somewhere, but I think it's unlikely to get fixed anytime soon - nobody has the bandwidth to touch the homepage just now.... [11:28:59] yuvipanda, it also appears to be working faster too (I just did a quick apachebench test): Time per request: 13.112 [ms] (mean) [11:29:11] yuvipanda: was that for me? [11:29:16] jem yup! [11:29:23] Ah, thanks :) [11:29:31] jem: If you're interested in picking it up, https://phabricator.wikimedia.org/diffusion/LTOL/ is the source of the front page [11:29:34] But I'm surprised about the "bandwith" problem [11:29:54] human bandwidth [11:29:59] i.e. time/energy [11:30:03] https://phabricator.wikimedia.org/T115650 is also related [11:30:12] Thanks, valhallasw`vecto [11:30:14] ah yes, what valhallasw said. not network bandwidth [11:30:25] That sounds more like it [11:31:19] Well, I wouldn't mind to help if it's possible [11:31:36] That particular point seems easy to fix [11:33:26] jem, it looks like it's this file: https://phabricator.wikimedia.org/diffusion/LTOL/browse/master/www/content/tool.php [11:33:43] And this line: if ( is_readable( "{$home}/.description" ) ) { [11:33:54] Yes [11:39:35] jem (IRC): https://phabricator.wikimedia.org/rLTOLbde15df2a379c33edfb8350afd2f0c7186705a93 [11:39:56] so I think it /is/ used, but read from the database and synced every now and then? [11:41:30] Hmmm [11:46:28] Ah, yes, it's working :) [11:47:18] Great, removing it from my to-do list [11:47:32] Thanks everyone [11:51:40] (I have another pending task related to OAuth, but let's bother just once a day) :) [12:10:19] 06Labs, 10Labs-project-Phabricator, 13Patch-For-Review, 07Puppet: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#2452335 (10Danny_B) [12:10:43] 06Labs, 10Labs-project-Phabricator: Login to phab-0[124].phabricator.eqiad.wmflabs is broken, even as root - https://phabricator.wikimedia.org/T130693#2452338 (10Danny_B) [12:11:01] 06Labs, 10Labs-project-Phabricator: Upgrade phab-01.wmflabs.org - https://phabricator.wikimedia.org/T127617#2452340 (10Danny_B) [12:11:21] 06Labs, 10Labs-project-Phabricator: https://phab-01.wmflabs.org returns a core exception - https://phabricator.wikimedia.org/T137270#2452344 (10Danny_B) [12:11:43] 06Labs, 10Labs-project-Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2452347 (10Danny_B) [12:12:10] 10Labs-project-Phabricator: have a phabricator test instance in labs that uses a working puppet role - https://phabricator.wikimedia.org/T139475#2452350 (10Danny_B) [12:12:54] 06Labs, 10Labs-Infrastructure, 10Labs-project-Phabricator: can't log in to phab-01.eqiad.wmflabs - https://phabricator.wikimedia.org/T125666#2452352 (10Danny_B) [12:16:43] 10Labs-project-Phabricator: Phab-02 sending old stylesheet copies - https://phabricator.wikimedia.org/T94413#2452365 (10Danny_B) [12:26:54] 10Labs-project-Phabricator: phab-01.wmflabs.org test instance's statuses are out of date - https://phabricator.wikimedia.org/T76943#2452400 (10Danny_B) [12:27:12] 10Labs-project-Phabricator: Email not working on phab-01.wmflabs.org - https://phabricator.wikimedia.org/T76427#2452401 (10Danny_B) [12:27:47] 10Labs-project-Phabricator: phab-01.wmflabs.org triggers HeraldManiphestTaskAdapter error when commenting - https://phabricator.wikimedia.org/T98586#2452404 (10Danny_B) [12:27:58] goddamit danny_b [12:28:02] 10Labs-project-Phabricator: Upgrade phab-01 to use the same version as production Phabricator - https://phabricator.wikimedia.org/T78168#2452405 (10Danny_B) [13:13:36] 10Labs-project-Phabricator: Phabricator on labs has failed cronjob - https://phabricator.wikimedia.org/T1151#2452566 (10Danny_B) [13:17:08] 10Labs-project-Phabricator: Phabricator-Labs project? - https://phabricator.wikimedia.org/T1168#2452589 (10Danny_B) [13:17:18] !log git deleting instance git-redirects-01.git.eqiad.wmflabs (I forgot to do that the other day) [13:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL, Master [13:24:08] 10Labs-project-Phabricator: Change phab-03 to 2015 redesign - https://phabricator.wikimedia.org/T103918#2452626 (10Danny_B) [13:28:10] zhuyifei1999_ around? [13:28:23] yeah [13:28:29] I want to walk you through migrating uwsgi to k8s and then use that for writing docs [13:28:34] now a good time? [13:28:49] ok [13:29:39] ok [13:29:43] * zhuyifei1999_ ssh-ing in [13:29:44] so the thing to do is to [13:29:56] 1. webservice --backend=kubernetes python2 shell [13:30:15] ok [13:30:22] 2. create a new venv in say, ~/www/python/venv.new [13:30:29] 3. Install the things you need in here [13:30:47] it says to stop first though [13:31:07] zhuyifei1999_ oh, hmm. is it ok if we stop video2commons for a while while doing this? [13:31:12] if not I can patch webservice to not need that [13:31:48] the tool is being flooded right now [13:31:55] wait [13:33:53] yuvipanda: is there a way to show "this tool in maintenance" while the webservice is down? [13:34:24] unfortunately not really... [13:34:37] but I've just made a patch that removes the requirement to take the gridengine job down [13:34:43] gimme a moment I'll deploy a temp version [13:34:52] ok [13:35:44] zhuyifei1999_ ok, use /tmp/tools/bin/webservice --backend=kubernetes python2 shell? [13:36:25] andrewbogott matanya: can you apply the puppet role for /srv to 02 & 03? 01 is severely overloaded [13:36:48] yuvipanda: which bastion? [13:36:59] zhuyifei1999_ tools-login [13:37:53] $ /tmp/tools/bin/webservice --backend=kubernetes python2 shell [13:37:53] Traceback (most recent call last): [13:37:53] File "/tmp/tools/bin/webservice", line 73, in [13:37:53] if 'backend' in tool.manifest and tool.manifest['backend'] != args.backend & args.action != 'shell': [13:37:54] TypeError: unsupported operand type(s) for &: 'str' and 'str' [13:38:21] it should be "and" right? [13:39:21] hmm [13:39:22] yes [13:39:25] I'm an idiot [13:39:32] zhuyifei1999_ try now [13:39:55] 10Tool-Labs-tools-wikibugs-IRC-bot, 10Wikibugs, 06Project-Admins: Merge wikibugs projects - https://phabricator.wikimedia.org/T75765#2452772 (10Danny_B) [13:40:11] ok [13:41:11] the shell is very weird, every time I press up and down the prompt goes up one line o.O [13:41:29] (I mean its location) [13:41:58] right [13:42:09] zhuyifei1999_ type 'stty rows 50 cols 150' [13:42:13] that should fix that [13:42:18] zhuyifei1999_: done [13:42:19] (this is an upstream bug I'm tracking and will deploy a fix once they do) [13:42:28] zhuyifei1999_: (the /srv thing I mean) [13:42:34] yuvipanda: ok [13:42:38] andrewbogott: thx [13:42:50] I'll pool them soon [13:43:38] https://www.irccloud.com/pastebin/tOc88vOM/ [13:43:47] yuvipanda: ^ [13:44:03] should I deactivate virtualenv before doing so? [13:44:41] zhuyifei1999_ you should just ignore the venv that currently exists [13:44:42] yeah [13:44:46] just create a new one [13:44:52] we can move it into place once we verify it works [13:44:57] ok [13:44:58] and yeah, venvs created in trusty don't work on jessie... [13:45:39] ok there's $ deactivate which works pretty well [13:46:14] * yuvipanda nods [13:47:53] wow this pip is fast [13:48:23] yeah [13:50:13] so everything installed, what's the next step? [13:51:36] zhuyifei1999_ make sure app.py loads? [13:51:40] python app.py [13:51:46] from the python in your new venv [13:51:51] ok [13:52:45] yep [13:52:52] ok [13:52:53] it loads [13:52:57] cool [13:53:03] then just deactivate venv again [13:53:04] and move it [13:53:07] mv venv venv.old [13:53:10] mv venv.new venv [13:53:21] so now we know our venv works we're just moving the old one away... [13:53:29] but keeping it around just in case we need to revert back to gridengine [13:53:36] when you're done with that, exit the shell [13:53:38] and do [13:53:45] 'webservice --backend=gridengine stop' [13:53:52] 'webservice --backend=kubernetes python2 start' [13:54:04] um should I stop webservice before moving? [13:54:09] nope [13:54:11] shouldn't matter [13:54:13] ok [13:55:57] 10Labs-project-Phabricator, 13Patch-For-Review: Stabilize vcs-user owned files and directories in Phab-02 - https://phabricator.wikimedia.org/T95982#2452896 (10Danny_B) [13:56:18] yuvipanda: it's up https://tools.wmflabs.org/video2commons/ [13:56:26] \o/ [13:56:27] test? [13:56:36] 10Labs-project-Phabricator: Admin access to phab-01.wmflabs.org for RobLa-WMF - https://phabricator.wikimedia.org/T85498#2452900 (10Danny_B) [13:57:40] everything looks okay [13:58:02] \o/ cool [13:58:10] 10Labs-project-Phabricator: phab-01 is broken: HTTPFutureCURLResponseStatus - https://phabricator.wikimedia.org/T88272#2452903 (10Danny_B) [13:58:19] zhuyifei1999_ try a restart with the webservice code in /tmp? it should be much faster than gridengine based restarts [13:58:23] 10Labs-project-Phabricator: Registration for phab-01.wmflabs.org broken: AphrontDuplicateKeyQueryException - https://phabricator.wikimedia.org/T88346#2452904 (10Danny_B) [13:58:48] using the /tmp code or the code in path? [13:58:59] the /tmp code [13:59:05] which I'll hopefully deploy later today [13:59:24] ok [13:59:47] /tmp/tools/bin/webservice --backend=kubernetes python2 restart ? [13:59:52] yeah [14:00:54] yep up [14:01:03] \o/ [14:01:16] I definitely like this solution to the uwsgi problems... [14:02:00] :) [14:03:46] 10Labs-project-Phabricator: New tasks on phab-01.wmflabs.org created with conduit aren't visible to others - https://phabricator.wikimedia.org/T91995#2452944 (10Danny_B) [14:04:03] zhuyifei1999_ thanks for being a guinea pig! [14:04:10] lol [14:05:01] yuvipanda hi, creating a instance using precise does not work. Im only trying to create an instance to test zuul running in precise [14:05:12] to find out why a newer version wont work on precise on production [14:05:27] But im getting error like these [14:05:28] Jul 12 14:03:20 gerrit-test-4 nslcd[1069]: [8b4567] ldap_start_tls_s() failed: Connect error: No such file or directory (uri="ldap://ldap-eqiad.wikimedia.org:389") [14:05:28] Jul 12 14:03:20 gerrit-test-4 nslcd[1069]: [8b4567] failed to bind to LDAP server ldap://ldap-eqiad.wikimedia.org:389: Connect error: No such file or directory [14:05:36] paladox you should ping andrewbogott or file a bug, I think. [14:05:40] Oh ok [14:06:41] yuvipanda: btw jessie is systemd right? [14:07:01] yeah but we don't run systemd in the containers [14:07:13] all of these are running in docker containers orchestrated by kubernetes [14:07:37] oh I'm just having trouble repooling the backends [14:07:46] ah unrelated, I see [14:07:46] yes, jessie is systemd [14:07:48] not k8s [14:07:51] right [14:08:10] finding the right one to use from https://github.com/celery/celery/tree/master/extra [14:08:40] ah [14:08:44] yes you need systemd [14:09:07] 06Labs: Creating a instance with precise fails - https://phabricator.wikimedia.org/T140099#2452963 (10Paladox) [14:09:10] yuvipanda ^^ [14:09:52] paladox: creation of precise hosts is almost entirely unsupported — I know how to fix that, but how important is it? [14:10:03] Oh not important [14:10:29] Not that important since gerrit is being updated this week including it moving to a new host [14:10:40] Just was wondering why. [14:10:57] 06Labs, 10Labs-project-Phabricator: Phab-02 not serving web pages from hostname linked to IP - https://phabricator.wikimedia.org/T96484#2452980 (10Danny_B) [14:12:48] paladox: probably the precise base image doesn't know about the new ldap servers [14:12:57] Oh [14:14:38] 06Labs: Switch existing and new trusty instances to GRUB 2 - https://phabricator.wikimedia.org/T140100#2452995 (10faidon) [14:15:43] 06Labs, 10Labs-project-Phabricator: Get rid of NFS in the phabricator Labs project - https://phabricator.wikimedia.org/T102703#2452988 (10Danny_B) [14:17:12] legoktm I moved fab-proxy to k8s [14:17:46] bd808 I'm going to move hatjitsu to k8s now [14:21:07] bd808 hatjitsu moved over! [14:24:52] yuvipanda: http://docs.celeryproject.org/en/latest/tutorials/daemonizing.html#usage-systemd <= there's no /etc/conf.d on jessie, what's the jessie equivalent of the dir ? [14:25:52] zhuyifei1999_ it's just convention - that path is explicitly referred in https://github.com/celery/celery/blob/master/extra/systemd/celery.service#L9 for example. I think /etc/defaults is usually where people put it [14:26:01] ok [14:28:53] 06Labs, 10Tool-Labs: Needing help running the webservice for a simple flask application - https://phabricator.wikimedia.org/T140103#2453053 (10Dalba) [14:30:13] 06Labs, 10Tool-Labs: Needing help running the webservice for a simple flask application - https://phabricator.wikimedia.org/T140103#2453053 (10yuvipanda) Do you *really* need 3.5? I think it'll be much easier for everyone if you could just use 3.4 - I don't think we can provide support for 3.5 yet unfortunatel... [14:46:51] andrewbogott: can you rebuild the two instances? I think I've created quite a lot of junk experimenting with puppet in the last hour [14:47:03] zhuyifei1999_: sure [14:47:11] right now, or do you want to mess with them more first? [14:47:30] hmm [14:47:38] * zhuyifei1999_ checks [14:49:44] yeah I think everything is okay [14:49:51] ok, will rebuild now [14:49:59] thx [14:50:33] zhuyifei1999_: you will need to fix the proxies again, since IPs will change [14:50:44] oh [14:50:56] well, I'm not projectadmin :/ [14:51:34] ah, ok, I'll do it then [14:58:14] zhuyifei1999_: ok, all set [14:58:30] k [14:58:39] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Packages to be installed in Tool Labs Kubernetes Images (Tracking) - https://phabricator.wikimedia.org/T140110#2453218 (10yuvipanda) [15:02:51] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Install libmysqlclient-dev in tools python2 kubernetes containers - https://phabricator.wikimedia.org/T140112#2453261 (10yuvipanda) [15:04:46] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Packages to be installed in Tool Labs Kubernetes Images (Tracking) - https://phabricator.wikimedia.org/T140110#2453218 (10yuvipanda) [15:05:53] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Packages to be installed in Tool Labs Kubernetes Images (Tracking) - https://phabricator.wikimedia.org/T140110#2453294 (10yuvipanda) [15:08:48] andrewbogott: both instances are up, but they aren't receiving [15:09:12] (I mean old task that 01 is supposed to handle) [15:09:35] zhuyifei1999_: that's because of a pool someplace, right? [15:10:00] well, idk how celery does this exactly [15:10:36] so I can't risk depooling 01 right now [15:10:54] 06Labs, 10Tool-Labs: Needing help running the webservice for a simple flask application - https://phabricator.wikimedia.org/T140103#2453318 (10Dalba) Not *really*... I'll try python3.4. Just a dumb question: how do you create a virtual environment for python 3.4? Should I compile another python3.4 from the so... [15:11:21] zhuyifei1999_: sounds like we need matanya's help to figure out about pooling — I don't know anything about the internals of that project of course :) [15:12:21] oh well I'm usually the person managing those (unless he's doing a lot of stuffs that idk) [15:14:25] iirc all pending tasks gets rescheduled hourly [15:15:07] 06Labs, 10Incident-20151216-Labs-NFS, 06Operations: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#2453330 (10fgiunchedi) adding labs too, ATM this is the situation kernel-wise: ``` $ ssh labstore1001.eqiad.wmnet uname -a Linux labstore1001 3.1... [15:18:31] andrewbogott: the real thing I'm worried about is 01 might run out of disk space http://tools.wmflabs.org/nagf/?project=video, with so much stuffs running [15:21:09] zhuyifei1999_: let me know if I can do anything to help [15:21:16] ok [15:28:55] 06Labs, 10Tool-Labs: Needing help running the webservice for a simple flask application - https://phabricator.wikimedia.org/T140103#2453053 (10tom29739) @Dalba ```tools.piagetbot@tools-bastion-03:~$ virtualenv ~/venv -p /usr/bin/python3 Running virtualenv with interpreter /usr/bin/python3 Using base prefix '/u... [15:30:59] andrewbogott: 03 is receiving :) [15:31:06] 06Labs, 10Tool-Labs: Needing help running the webservice for a simple flask application - https://phabricator.wikimedia.org/T140103#2453424 (10yuvipanda) You can do so with: ``` virtualenv -p python3 venv ``` [15:31:06] cool [15:31:17] 06Labs, 10Tool-Labs: Needing help running the webservice for a simple flask application - https://phabricator.wikimedia.org/T140103#2453425 (10yuvipanda) https://phabricator.wikimedia.org/T104374#1911373 has info too [15:32:18] I'll depool 01 after all pending tasks are being handled [15:37:44] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Install dependencies for python-lxml in python container - https://phabricator.wikimedia.org/T140117#2453448 (10yuvipanda) [15:39:16] Amir1 around? [15:39:33] yuvipanda: yup [15:39:59] Amir1 I see you're involved in the checkdictation-fa tool - I want to move it to k8s. thoughts? [15:40:32] yuvipanda: I'm involved a little bit, but I'm just doing some maintenance and robustness [15:40:48] if I move it to k8s do you think you can verify it works? [15:40:50] the whole thing is being done by Yamaha5 (Reza) [15:40:57] yeah [15:41:09] k [15:41:09] ok [15:43:28] Amir1 actually no, I guess I'm not doing that just yet. I'll do so later and ping you etc :) [15:43:31] thanks tho [15:43:54] yuvipanda: okay, thanks :) [15:59:25] 06Labs, 10Labs-Infrastructure: Review disk overcommit ratio for Nova - https://phabricator.wikimedia.org/T140122#2453627 (10Andrew) [16:00:07] 06Labs, 10Labs-Infrastructure: Review Nova RAM overcommit ratio - https://phabricator.wikimedia.org/T140119#2453645 (10Andrew) [16:01:03] 10Wikibugs, 06Project-Admins: Rename "phawikibugs" project to just "wikibugs" - https://phabricator.wikimedia.org/T1123#2453648 (10Danny_B) p:05Triage>03Low [16:13:49] 06Labs, 10Tool-Labs: Needing help running the webservice for a simple flask application - https://phabricator.wikimedia.org/T140103#2453747 (10Dalba) 05Open>03Resolved a:03Dalba tom29739 and yuvipanda, thank you so much! I could get to work with `plain-uwsgi`. [16:16:43] 06Labs, 10Tool-Labs: Needing help running the webservice for a simple flask application - https://phabricator.wikimedia.org/T140103#2453789 (10Dalba) [16:38:45] yuvipanda: Ping? [16:39:24] hi Matthew_ [16:39:51] Hello! Did you ping me this morning about moving xtools-ec to something? [16:40:30] Matthew_ yes! I want to move it to kubernetes (nothing changes for you!) and want someone who knows it to test if it is ok after I move it [16:42:52] yuvipanda: oh fyi I see someone transcoding a file named "Wikimania 2016, Hackathon- Running bots and executive code on labs with just a web terminal (PAWS)" on my tool. I'll watch this one :) [16:43:07] \o/ [16:43:08] niiiice [16:43:18] let me know when a link is available? [16:43:26] although I feel quite embarassed by my voice... [16:43:40] ok lol [16:44:10] yuvipanda: I can do that. Feel free to make the change. [16:44:25] ok! [16:45:20] 06Labs: Make ladsgroup admin on the labs 'fa-wp' project - https://phabricator.wikimedia.org/T138372#2454032 (10Andrew) Huji, any objection? [16:45:27] Matthew_ try now? http://tools.wmflabs.org/xtools-ec/ [16:46:01] Appears to work for me. I'll keep an eye on Phabricator and GitHub and let you know if there are any issues. [16:46:17] 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: mariadb doesn't come up properly on silver after reboot - https://phabricator.wikimedia.org/T125987#2454039 (10Andrew) 05Open>03Resolved [16:46:21] Matthew_ \o/ ok! [16:46:33] Matthew_ any other php tools I could move? [16:46:33] :) [16:46:42] Hum... [16:47:24] 06Labs, 06Operations, 10ops-codfw: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2454045 (10Andrew) p:05Normal>03Lowest [16:48:18] zhuyifei1999_: encoding01 drained yet? [16:48:52] yuvipanda: I assume you want big tools right now? [16:49:02] Matthew_ any kind really [16:49:17] not yet, haven't depooled yet. there's still 3 tasks in pending state [16:49:26] You could probably move the rest of the xtools stack. The other ones I'm maintainer for aren't active... [16:49:36] Except for peachy-docs. If you want to :) [16:49:48] and 2 aborting while in pending (which outcome is untested) [16:49:55] Matthew_ looking... [16:50:09] Matthew_ moving xtools-articleinfo now [16:50:30] zhuyifei1999_: ok, no problem [16:50:35] and it'll take a few more hours until the running tasks on 01 to finish after I graceful shutdown the service [16:50:42] k [16:50:52] 06Labs: Make ladsgroup admin on the labs 'fa-wp' project - https://phabricator.wikimedia.org/T138372#2454072 (10Andrew) p:05Triage>03Normal [16:51:02] Matthew_ try xtools-articleinfo? [16:51:57] Looks good to me. [16:52:21] Matthew_ moving 'xtools' itself now [16:53:17] OK! [16:54:29] Matthew_ xtools has problems, I'm moving it back to gridengine. try now? [16:57:37] Matthew_ xtools-dev also had similar problems so I just moved them back [16:57:40] but article-info seems ok [16:57:56] so xtools-ec and xtools-articleinfo now run on k8s! \o/ [16:58:21] Yes. The two seem ok. xtools is still working. xtools-dev is a strange beast IIRC... it's fine that it didn't go. [16:58:27] ok [16:58:54] Matthew_ I see you are listed as maintainer for a bunch more tools :D any other I can move? [17:01:19] yuvipanda: Two ticks please. [17:01:22] 06Labs, 10Labs-Other-Projects: video project: move rendering instances to SSD servers - https://phabricator.wikimedia.org/T139802#2454147 (10zhuyifei1999) Andrew applied the puppet role and I got 02 and 03 up with a just-written-today [[https://github.com/Toollabs/video2commons/blob/0b5ce84f59444a2bf1d42cb2eb... [17:01:46] Matthew_ I don't understand the expression.... :) [17:01:56] ticks as in 'minutes'? [17:17:59] yuvipanda: Sorry, I had a co-worker walk up to my desk. You are welcome to move any project I'm listed as maintainer for. Most are unused currently, so it's not a big deal. [17:18:31] ah ok! [17:19:46] The expression is one that I picked up from a book a long time ago... it just means "I'm doing something real fast" Sorry about that... [17:20:30] Matthew_ matthewrbowker-dev and matthewrbowker moved just now [17:21:14] Look good to me. [17:22:17] Matthew_ moved http://tools.wmflabs.org/articlerequest-dev/ and http://tools.wmflabs.org/articlerequest/ [17:22:40] Matthew_ you wanted me to not move peachy-docs? [17:23:00] articlerequest ones look good. [17:23:11] It's OK to move peachy-docs [17:23:40] ok! moving now [17:25:12] Matthew_ http://tools.wmflabs.org/peachy-docs/ done [17:25:31] Matthew_ you are part of the 'paste' tool as well, might I move that too? [17:25:40] I am...? [17:25:48] yeah :D [17:27:00] I didn't know that actually. [17:27:09] :D [17:28:15] tom29739 any more tools you can move over? :) [17:29:05] yuvipanda, I think there's a couple/ [17:29:24] \o/ no uwsgi-plain support yet, but I think I want to have python3 [17:29:30] yuvipanda: why are we moving them? just curious, I saw you said there is no change on our end [17:30:07] musikanimal I'd like to eventually deprecate gridengine for webservices before end of 2016, so my current strategy is: 1. move things, 2. if they work fine, Great! 3. if not, fix the issues that happen, 4. go to (1) [17:30:24] gotcha [17:30:33] it gives you better resource isolation + newer software, and it gets rid of a number of racy code that we've written ourselves [17:30:52] well I have a handful of tools you are welcome to try moving [17:30:56] cool [17:31:13] musikanimal sure! which ones? [17:31:26] pure php or pure nodejs or pure python ones are easiest just now [17:31:53] before we go any further, how can I see the state of the webservices? e.g. qstat returns nothing for xtools-ec and -articleinfo [17:32:02] kubectl get pods [17:32:07] ^ [17:32:26] logs are in the same place as before (error.log and access.log) [17:32:29] webservice status also works [17:32:31] I like it [17:32:32] ! [17:32:32] hello :) [17:32:34] kubectl is very powerful [17:33:01] so are restart script needs to be modified? [17:33:36] actually it still does "qstat | awk '{ print $1; }' | tail -1 | xargs qmod -rj" not sure if that still would work [17:33:48] for which tool is this/ [17:33:48] ? [17:33:59] kubectl delete pod [17:34:03] ^ that works [17:34:25] yuvipanda: all of the xtools suite [17:34:38] if you recall, how it has that weird thing where it just dies and doesn't automatically restart [17:34:39] I'd think that script won't be needed on k8s [17:34:48] we can also add an actual health check [17:34:50] where it does a http call [17:34:55] and then restarts if that http call fails [17:35:04] rather than this hack [17:35:05] webwatcher.sh [17:35:11] which calls webstart.sh [17:35:13] right [17:35:18] so how about I just add a http health check instead? [17:35:32] where it'll hit the /$toolname page and if it fails, it'll restart the pod [17:35:47] http://kubernetes.io/docs/user-guide/production-pods/#liveness-and-readiness-probes-aka-health-checks [17:35:55] yup! [17:36:00] They call it a "liveness probe" [17:36:04] interesting [17:36:10] I will look into that [17:36:11] I'll add that an option to 'webservice' soon but I can add it directly to the pod just now [17:36:40] musikanimal I'm adding it to xtools-articleinfo now [17:36:49] awesome, thanks [17:39:05] musikanimal done for http://tools.wmflabs.org/xtools-articleinfo/ [17:39:08] verify it works fine? [17:39:28] yup! thank you! [17:40:11] musikanimal ok, doing it to xtools-ec just now [17:40:26] this is a much better solution than webrestarter I guess [17:40:51] sounds like it! :) [17:42:19] Getting the webservice off NFS seems to speed it up greatly. [17:42:30] I was going to say the same [17:42:35] so no more NFS? fo real!? [17:42:53] I was running my tool's webservice out of /tmp for a while [17:43:06] musikanimal nope, this is still NFS [17:43:08] just no more gridengine [17:43:24] dah, okay hah [17:43:26] tom29739 uwsgi is much faster on k8s because the NFS options are different [17:43:46] In k8s git repos can be used as volumes: http://kubernetes.io/docs/user-guide/volumes/#gitrepo [17:43:59] musikanimal liveness probe in place for xtools-ec too [17:44:23] works great! [17:44:24] musikanimal I've gotten rid of it from cron [17:44:39] musikanimal so what else can I move? :) [17:44:59] Community Tech has a new PHP tool, there's the test version at plagiabot [17:45:07] production tool is called copypatrol [17:46:47] musikanimal ok, I'm going to move plagiabot now. [17:48:08] musikanimal moved http://tools.wmflabs.org/plagiabot [17:48:24] ok let's stop for a second... [17:48:30] stopping [17:48:39] load /plagiabot and then load /copypatrol [17:48:40] same code [17:48:50] plagiabot is loading a zillion times faster! [17:49:06] :D [17:49:25] it's newer php, and it also has resource guarantees and stuff [17:49:33] maybe it also has less load [17:49:34] I figured it was the SQL queries [17:49:41] that was making it so slow [17:49:43] guess not! [17:49:52] anyway, plz do copypatrol too! :) [17:49:57] ok [17:50:36] musikanimal done [17:50:40] wow [17:50:43] this is amazing! [17:50:52] the team will be very excited about this [17:50:59] haha [17:51:01] nice ;D [17:51:04] please let them know! [17:51:22] I will! let me see what other stuff I can throw your way [17:51:46] my tools are written in Ruby, use a Unicorn web server, e.g. no `qstat` etc, so pretty sure it doesn't apply? [17:52:02] I don't have a ruby environment yet unfortunately [17:52:08] do you use bundler? [17:52:11] that's fine [17:52:15] nah it won't let me :( [17:52:22] ah, I want to let you to use bundler! [17:52:30] I manually `gem install --user-install gem-name` [17:52:34] I'll try to work on it next week [17:52:39] yuvipanda: just wondering, how do we setup monitoring ourselves (since I'm a newbie to k8s) [17:52:39] to allow you to use bundler properly [17:52:52] cool, no rush! that would be awesome [17:53:02] zhuyifei1999_ monitoring or the livenessProbe? [17:53:04] zhuyifei1999_ livenessProbe lets you restart the pod based on conditions [17:53:11] now, what about all the pageviews tools?? they all run on the grid, with lightweight PHP in the background [17:53:32] hmm I guess both are cool [17:53:33] if you want, start with pageviews-test, langviews-test, and topviews-test [17:54:21] musikanimal moving pageviews-test now. try? [17:54:49] 502 right now [17:55:02] also what's the future of those multi-purpose tools that's written in multiple languages? [17:56:20] zhuyifei1999_ good question I don't have an answer to right now. also I'll get back to you on livenessProbe shortly after moving the pageview stuff. [17:56:30] looks like we lost some symlinks [17:56:42] you can read more about it in pageviews-test-561416278-hropm and you can edit/play with the YAML by doing 'kubectl edit deployment/$toolname' [17:56:47] wait nvm [17:56:51] musikanimal ah, which ones? where were they linking to? [17:57:12] they're fine, symlinks are only on the other tools, all of them except pageviews and pageviews-test, got confused [17:57:19] I see [17:57:22] weird way that I set it up [17:57:41] (I mean those tools that are collection of scripts written in multiple languages, as php scripts or cgi scripts or static content, using lighttpd on grid) [17:57:41] musikanimal can you tell what's wrong with it from error.log? [17:57:47] looking [17:58:37] complaining about the WhichBrowser\Model\Browser library is missing, which is the first one that gets loaded I think [17:58:45] maybe need to re-run composer update? [17:58:50] I see [17:58:54] possibly [17:59:02] musikanimal if you run 'webservice shell' [17:59:11] it gives you a shell with php5.6 [17:59:16] which is the version that'll be running on the webserver [17:59:37] (it'll have a tiny width - please type 'stty rows 50 cols 150' to make that better - I'll have a bugfix for this shortly) [17:59:37] 06Labs, 10Horizon, 13Patch-For-Review: Disable renaming of instances on Horizon - https://phabricator.wikimedia.org/T139768#2454404 (10Andrew) @Luke081515 maybe. Renaming hosts seems generally likely to cause problems, there are too many things to keep in sync. [17:59:47] what version does the new system use? [18:00:02] debian jessie [18:00:21] new system uses php 5.6 on debian jessie [18:01:11] yuvipanda, what PHP version did gridengine use? [18:01:21] oh ok, so no code differences should be needed [18:01:42] tom29739 5.5 on trusty, 5.3 on precise [18:01:50] and 5.5 has been the default for a few years now [18:02:01] andrewbogott: the last pending task is running, I'll depool 01 in a sec [18:03:57] Apparently as of 2 days ago, PHP 5.5 is end-of-life and will receive no more support. [18:04:11] yuvipanda: how might one restart the webservice under kubernetes? [18:04:52] musikanimal same old' way - webservice restart [18:04:56] /usr/bin/python /usr/local/bin/celery multi stopwait 2 --pidfile=/var/run/celery/%N.pid at pid 14519 [18:04:57] if you want to move from gridengine to kubernetes [18:04:59] you do [18:05:04] webservice --backend=gridengine stop [18:05:04] oh okay [18:05:10] webservice --backend=kubernetes start [18:05:16] if you want to move from k8s to gridengine [18:05:18] just do the reverse [18:07:21] !log tools reboot tools-worker-1012, it seems to have failed LDAP connectivity :| [18:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:08:36] yuvipanda: on pageviews-test I get "bash: webservice: command not found" [18:10:24] oh wait sorry I was in the shell [18:11:26] 06Labs, 10Horizon, 13Patch-For-Review: Disable renaming of instances on Horizon - https://phabricator.wikimedia.org/T139768#2454477 (10Andrew) p:05Triage>03Normal [18:11:31] sweet, I just restarted and now pageviews-test works great [18:14:45] \o/ [18:14:52] musikanimal what did you have to do? composer again? [18:15:00] nope, just restarted the webserver [18:15:13] in gridengine? [18:15:35] just did `webservice restart` [18:16:21] wanna try langviews-test and topviews-test? [18:16:48] let me look at what happened to pageviews-test [18:16:58] indeed, its seems ok! [18:18:01] musikanimal http://tools.wmflabs.org/langviews-test/ works too [18:19:37] yuvipanda: the only problem I'm seeing is restarts take a bit longer [18:19:57] musikanimal yup, I've a fix for that about to merge [18:20:13] it's only pageviews and it's family of tools that I really care about downtime [18:20:17] right [18:20:31] musikanimal topviews-test also done [18:20:58] so are we otherwise pretty darn confident in kubernetes? that it's stable and what not? [18:21:08] pageviews is all JavaScript, so really all it has to do is load [18:21:16] there's the i18n which is PHP [18:23:10] musikanimal yeah, I think so. a lot of magnus' tools are on it now [18:23:15] we have 64 tools on there [18:23:24] musikanimal grrrit-wm has been on it for months now [18:24:12] yuvipanda: [18:24:29] ok cool. and you said you'll be able to speed up the restart process? [18:24:30] I guess k8s stuffs could be listed in http://tools.wmflabs.org/?status ? [18:24:53] musikanimal yeah, you can try the new speed already if you use the test install on '/tmp/tools/bin/webservice' [18:26:31] zhuyifei1999_ yeah, that needs to happen. It'll probably end up being in a different place though. [18:26:46] hmm okay [18:27:26] btw your video is 38% done [18:27:32] \o/ [18:33:49] !log wikistats shutting down instance wikistats-southpark, lastlog said not used since February [18:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikistats/SAL, Master [18:36:45] musikanimal ok, I've updated webservice to a newer version with faster restarts [18:37:37] do you wanna move them? :D [18:44:19] lemme try it out [18:44:28] ok [18:44:40] nice! that was fast :) [18:45:06] alright, I'm sold. For pageviews we have pageviews, langviews, topviews, siteviews and massviews [18:45:11] I guess we should try one at a time [18:45:25] ok :D [18:46:52] on the hunt to delete more instances [18:47:01] from git project ..that we dont need [18:48:01] !log git deleted instance git-phab [18:48:03] !log git deleted instance git-phab-03 [18:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL, Master [18:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL, Master [18:48:33] mutante \o/ <3 [18:48:41] :) [18:55:30] having an odd occurance in one of my labs instances, it uses the lxc version of mediawiki vagrant. the setup.sh script installes mediawiki-vagrant-0.14.0.gem every time its run, and the vagrant command thinks mediawiki-vagrant gem isn't installed [18:55:58] might just blow the instance away and reload, probably easier? [18:56:04] musikanimal yeah, one tool at a time sounds great :D [18:56:11] ebernhardson always my reccomendation if it' seasy [18:56:16] yuvipanda: :) [19:12:59] ebernhardson: is your vagrant command actually /usr/local/bin/mwvagrant? And are you trying to run ./setup.sh manually or letting Puppet do it? [19:23:09] musikanimal I'm gonna be gone in about 10-15mins. Feel free to move them without me being around. if not I'll bug you tomorrow! [19:25:41] oh, didn't realize you were having me do it! what's the procedure? [19:25:55] we can deal with this when you're back, no rush [19:26:29] musikanimal oh, I'd love to get it over with :D https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web/Kubernetes#Switching_between_GridEngine_and_Kubernetes is the procedure, pretty simple [19:26:34] shouldn't take more than a minute per tool [19:26:37] can we do it now? :D [19:26:47] sure! [19:27:14] ok! [19:27:22] musikanimal do you wanna do it or shall I? :D [19:27:32] I'll give it a try! [19:27:38] ok! [19:28:03] that was really easy [19:28:16] and no noticeable downtime [19:29:04] \o/ [19:29:06] musikanimal which tool did you move? [19:29:15] pageviews [19:29:19] having an issue with topviews, though [19:29:40] https://www.irccloud.com/pastebin/EN7QiTsX/ [19:30:07] musikanimal yeah, there's a race condition. delete 'service.manifest' file and try again? [19:30:15] musikanimal actually no, I'm looking [19:30:30] I tried again and it worked I think [19:30:35] right [19:30:36] ok [19:30:39] yeah, it's a bit of a race there [19:34:03] alright, all pageviews tools migrated and working :) [19:34:10] thanks, this speeds things up for those tools as well [19:35:20] that's it for my tools. Thanks again! excellent improvement [19:36:07] awesome [19:47:13] musikanimal thanks for the help, and let me know if anything goes wrong. [19:47:23] will do! [19:48:43] Since you're around... Got a quick question. How does one acquire database credentials for the replicas for a labs instance? Not tool labs but labs proper. [19:48:56] zhuyifei1999_: ready? (I have no reason to be impatient, I just bother you ever time I get to a stopping point with my other work) [19:49:28] checking [19:49:45] Matthew_: the "easiest" thing to do is create a tool and use the creds it gets from your project [19:49:49] 2 ffmpegs still running [19:50:04] bd808: That's what I'm doing now :) Is that a problem? [19:50:12] so I guess not [19:50:59] Matthew_: no, I don't think it is really. There might be a tracking task in phab about asking for project credentials. I think a root has to make them manually [19:51:17] bd808: Okay. Then I'll keep it the way it is. Thank you :) [19:51:31] zhuyifei1999_: ok! [19:52:52] andrewbogott: http://tools.wmflabs.org/nagf/?project=video should be able to tell its status [19:53:25] and I have a fork in http://tools.wmflabs.org/yifeibot/nagf/?project=video that shows the load [19:55:58] zhuyifei1999_: 01 is already depooled, so that as soon as its load drops that means it's done for good? [19:57:27] well, before it's deleted can you check `ps -A u | grep celery` and make sure no tasks are doing stuffs like uploading (low load stuffs)? [19:57:36] other than that, yeah [19:59:25] 'k [20:00:52] the "WIkimania 2016 Closing Ceremony" video (/srv/v2c/output/db682b4e05e13d51/) will probably take a few more hours [20:02:43] andrewbogott: oh that video might end up in the server-side-upload temporary storage in /srv/v2c/ssu, can you backup that directory? [20:03:42] zhuyifei1999_: I really don't want to get my hands dirty in that project — just let me know when/if I can move things. Later in the week is fine. [20:03:55] ok [20:06:35] 06Labs, 10Labs-Infrastructure: Rebalance labvirt1010 - https://phabricator.wikimedia.org/T137719#2455052 (10Andrew) p:05Triage>03Normal Labvirt1010 is behaving OK right now. We don't want to add anything new, but https://gerrit.wikimedia.org/r/#/c/298480/ should take care of that. So once that patch (or... [20:09:22] andrewbogott: sorry for the spam about the memory over commit :( [20:09:46] I have really no clue how it is configured / supposed to work. I am just throwing random hints and I should probably stop! [20:10:12] it seems like a cool feature but I fear such complexity :) [20:11:10] oh [20:11:32] I would trust the super intelligent to have something that works after 7 years or so [20:11:39] but definitely fear it completely exploding =] [20:12:00] our puppet code has a check to prevent some linux kernel from being installed hinting about KSM [20:12:05] so maybe it is in use already [20:13:46] confirmed: It's enabled on all labvirt hosts [20:13:49] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Review Nova RAM overcommit ratio - https://phabricator.wikimedia.org/T140119#2453539 (10chasemp) Ok my understanding of our issue is that yes we allowed for overprovisioning...and it happened. I don't know that it's all bad. It's pretty much SoP for any... [20:14:03] oh [20:14:18] andrewbogott: is there a ksmd daemon running as well? [20:14:32] potentially that would be how we can allow RAM over commitmment [20:14:52] https://www.irccloud.com/pastebin/U7w26loM/ [20:15:02] hashar: ^ I think means it's enabled and running everywhere [20:15:11] I also hinted at a parameter that reserve some amount of memory for the host. Default to 512MBytes which is definitely super low. [20:15:23] oh [20:15:32] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Review Nova RAM overcommit ratio - https://phabricator.wikimedia.org/T140119#2455085 (10chasemp) FWIW openstack changed to take into account total RAM for overprovision https://github.com/openstack/nova/commit/1b40708287808243be27b83791b7d23f8b51b194 KVM a... [20:16:44] andrewbogott: https://www.kernel.org/doc/Documentation/vm/ksm.txt hints about a bunch of metrics under /sys/kernel/mm/ksm/ [20:16:54] and indication of which ratios to watch for [20:17:11] the kvm docs on memory management explain overcommit pretty well [20:17:22] maybe that could give an accurate view of memory overcommit by KSM if any [20:17:48] sorry if I disturb you. The topic kind of haunted me much of the week-end :D [20:17:56] so 'Maybe KVM manage to magically share the unused RAM between instances ' is more or less what happens, but no more so than how procs share memory [20:18:05] but either way it's a concern [20:18:24] I explained here https://phabricator.wikimedia.org/T140119#2455070 [20:19:59] ah with links! neat [20:20:30] I also found the memory ballooning concept which is to shrink the guest total memory [20:20:40] but apparently OpenStack does not use that and always allocate the max mem [20:23:41] chasemp: changing it to 1.0 won't really affect the behavior on labvirt1001-1009 anyway, since we'll hit the disk limits before we hit the RAM limits. So it really doesn't cost us much in the near-term… it just means being slightly inefficient with the new servers. [20:23:46] Doesn't bother me in the least, really. [20:24:51] tom29739 you might like http://tools-prometheus.wmflabs.org/tools/ - actual CPU / RAM and other usage statistics per-container, and so per-tool [20:25:01] yeah, bottom line is probably -- it's not actually ok so let's stop doing it :) [20:25:03] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Review Nova RAM overcommit ratio - https://phabricator.wikimedia.org/T140119#2455149 (10Andrew) I'm convinced that 1.2 is the optimum answer and that 1.0 is the safest answer. I certainly won't fight against 'safe' -- as the graphs show, the difference is n... [20:25:13] andrewbogott: do you need anything from me regarding the video project ? [20:25:25] matanya: nope, just waiting for 01 to drain so I can kill/rebuild it [20:25:32] I think zhuyifei1999_ has already taken care of 02 and 03 [20:25:37] ok, cool [20:25:43] 06Labs, 10Labs-Infrastructure: Shrink default quota for labs projects - https://phabricator.wikimedia.org/T140158#2455158 (10yuvipanda) [20:25:51] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Review Nova RAM overcommit ratio - https://phabricator.wikimedia.org/T140119#2453539 (10yuvipanda) T140158 is related. [20:26:03] andrewbogott: how does the disk provisioning work exactly though, I'm less clear on it, as I've not seen it before [20:26:25] say we give an xlarge and ti's got 180G, even though we partition 18G within for / [20:26:35] chasemp: hang on, I'm confused — looks like you're advocating for 1.0 (RAM) here but just +1'd a patch that does 1.2? [20:26:44] does that 180G now count as part of allocated? [20:27:07] andrewbogott: I made a note 1:1 was probably best and then +1'd he idea I guess in general :) [20:27:22] sorry that was confusing [20:27:25] yuvipanda, ah nice, now I can see what my stuff is using :) [20:27:47] chasemp: so, disk-space... in my chart I have the 'Committed' column. In your example that 180G would appear as part of 'committed' [20:28:00] but not part of 'Actual Used' until the partition is created and filled with data [20:28:43] right [20:28:49] ok that's what I imagined but making sure [20:29:19] it's really going to get messy then w/ a 9:1 ratio on most xlarge for actually used disk and committed [20:29:41] i.e. over 50% of them that I saw were still only using the 18G / and no /srv [20:29:53] but they are all eating huge poritions of committed [20:32:06] 06Labs, 10Labs-Infrastructure: Shrink default quota for labs projects - https://phabricator.wikimedia.org/T140158#2455158 (10chasemp) yes I think for /default/ allocation we do `Shrink it to 8 cores + 16G of RAM - 1 xlarge or equivalent number of smaller instances (upto 4 mediawiki-vagrant instances, for examp... [20:33:06] 06Labs, 10Labs-Infrastructure: Shrink default quota for labs projects - https://phabricator.wikimedia.org/T140158#2455158 (10Andrew) Sounds good to me. The vast majority of projects don't come anywhere close to the quota in any case. [20:33:23] andrewbogott: not a strawman ask, but why 1.2 on RAM? [20:33:41] chasemp: The text in that phab task shows numbers and explains. [20:33:49] but I just re-submitted with 1.0 anyway :) [20:34:36] this https://phabricator.wikimedia.org/T140119#2455149 ? [20:34:56] In the description: "So, at the very least we need to lower that ratio, as 1.5 is clearly too high. Lowering it to 0 would be the conservative choice, but might be overreacting since right now labvirt1010 is stable and it's still exhibiting a 1.3 overcommit." [20:35:12] taht doesn't even say 1.2 in it :D [20:35:18] why 1.2 specifically I wasn't getting [20:36:04] Oh — arbitrary, I guess. [20:36:18] Taking 1.3 as an upper bound (since more than that was clearly causing problems) [20:36:25] and adding some slack between us and the upper bound. [20:36:59] at 1.2 are we still playing a game of chance w/ actual VM usage and issues on a host? [20:37:33] which isn't a death trap at all we depend on load spread everywhere [20:37:53] Yes, as I understand it with anything over 1.0 it's possible for a pathological case (where all instances use all their ram at the same time) and we have problems. [20:38:23] (unless KSM guarantees us a certain amount of safety, and I don't have any data about what that amount would be) [20:38:25] yeah I'm torn, but I think we should play it safe for now [20:38:35] but I see your thinking now thanks [20:38:37] +1 for playing it safe [20:40:34] well, there's 'safe' in both directions. e.g if we lower ratios so much that nothing can be scheduled anymore, that isn't 'safe' either :) [20:40:44] But in this case, 1.0 is fine, I'm just splitting haiars [20:40:46] hairs [20:41:20] In other news, 75% of the time that I try to type the word 'ratio' I instead type the word 'ration' [20:41:43] sure I don't think we are looking to shut down the system for safety kind of thing [20:43:18] but it seems clear we need to deploy with swap to compensate, set oom ratings to eat disposable VMs first if we coudl define such a thing, wait for explosion or go safe and rethink [20:44:35] probably better to not be able to spawn an instance than having a labvirt dies terribly ? [20:46:03] hashar: yeah, that's definitely better. Oddly that is something I've had to fight for among the nova devs :/ [20:46:38] oh and I found out Diamond has a collector for the Linux KSM thing ( /usr/share/diamond/collectors/ksm/ksm.py available on our Jessie version ) [20:48:02] https://github.com/python-diamond/Diamond/blob/v3.5/src/collectors/ksm/ksm.py [20:51:45] andrewbogott: I have no idea what is the politic behind nova development. I guess preparing a spec and attending the dev summit to defend it would work [20:51:50] but that is a big investment in time [20:52:47] hashar: I think they'll accept my patch eventually. I'm just surprised that the status quo solution to prevent scheduling on an already-full host is "keep an eye out for that" [21:04:35] andrewbogott: I have no idea how clouds manages their infra / over commit really :( [21:12:37] anyway time to sleep with a bunch of kvm/malloc etc related literrature [21:15:10] 06Labs, 10Incident-20151216-Labs-NFS, 06Operations: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#2455332 (10chasemp) We are basically in a holding pattern as we (...well me I guess) tries to get labstore2003/2004 going so we can shift load so... [21:15:56] 06Labs, 06Operations: Failed drive in labstore2001 array - https://phabricator.wikimedia.org/T139937#2455341 (10chasemp) 05Open>03Resolved ```md0 : active raid1 sdb1[2] sda1[0] 1952839680 blocks super 1.2 [2/2] [UU] bitmap: 1/15 pages [4KB], 65536KB chunk ``` [21:30:58] i shut down (not deleted) an instance earlier today [21:31:11] now i want to power it back up. but it stays status SHUTOFF and "failed to reboot" [21:31:15] when i click "reboot" [21:31:21] anything in the log? [21:31:28] you mean console output? [21:31:36] yeah [21:31:44] it shows how it shut down when i did that earlier [21:31:50] what instance? [21:31:59] can't help w/o that :) [21:32:04] wikistats-southpark.wikistats.eqiad.wmflabs [21:33:26] You have a south park wiki [21:33:41] no, it's the name of a user :p [21:33:44] mutante: it's on now I think [21:33:46] SPF [21:33:51] chasemp: thank you ! [21:33:57] oh [21:34:03] i see it as active, thx [21:34:16] SPF|Cloud = southpark(fan) [21:34:36] Oh [21:34:43] OMG SSH works!? thanks guys [21:34:46] Your a southpark fan [21:34:53] heh:) [21:37:00] where I grew is http://www.cityofwaupaca.org/parksnrec/?parks=south-park [21:37:07] so it's sort of confusing to me for a second :) [21:38:29] 👍 https://phabricator.wikimedia.org/p/Southparkfan/ [21:38:37] (needs login) [21:38:55] Oh, south park is a park and tv show [21:39:06] anyway, thanks for fixing it. looks like the instance isn't needed anymore so I'll check with mutante if it can be deleted (saving some labs resources.... :)) [21:39:29] so how'd you do it chasemp? [21:39:44] SPF|Cloud, thank you! [21:39:55] it is on commedy central here. [21:40:03] mutante: the instance can be deleted. [21:40:05] Krenair: nova start $uuid [21:40:25] that's it? okay.. [21:40:36] !log wikistats nova start wikistats-southpark.wikistats.eqiad.wmflabs [21:40:39] I figured you did some really crazy magic or something [21:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikistats/SAL, Master [21:40:47] nope [21:41:26] SPF|Cloud: ok, deleting it. thanks for checking [21:41:27] http://www.comedycentral.co.uk/tv-guide [21:42:02] !log wikistats deleted wikistats-southpark instance to free resources [21:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikistats/SAL, Master [22:02:34] 06Labs, 10wikitech.wikimedia.org: Labs front-page statistics are very wrong - https://phabricator.wikimedia.org/T139773#2455604 (10Andrew) @ If you delete a instance at horizon, the wikitech won't get deleted That is incorrect -- the pages are deleted by a callback within nova that's triggered by deletion. T... [22:16:20] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 07Tracking: Packages to be installed in Tool Labs Kubernetes Images (Tracking) - https://phabricator.wikimedia.org/T140110#2455680 (10Danny_B) [22:36:51] 06Labs, 10Labs-Infrastructure: Investigate rabbitmq tcp_listen_options setting (and others) - https://phabricator.wikimedia.org/T140175#2455779 (10Andrew) [22:38:56] 06Labs, 10Labs-Infrastructure: Investigate rabbitmq tcp_listen_options setting (and others) - https://phabricator.wikimedia.org/T140175#2455793 (10Andrew) Probably unrelated, but there's also this: WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is greater than warning threshold: 20. Ther... [22:54:02] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Fdapuzzo was created, changed by Fdapuzzo link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Fdapuzzo edit summary: Created page with "{{Tools Access Request |Justification=For educational purposes about databases. |Completed=false |User Name=Fdapuzzo }}"