[04:17:15] 10PAWS: Remove clusters tab - https://phabricator.wikimedia.org/T120629#1857538 (10jayvdb) 3NEW [04:27:52] YuviPanda: I think that something funky in openstack is breaking jenkins/zuul now. The instances in https://wikitech.wikimedia.org/wiki/Nova_Resource:Contintcloud are all in a deleting state and zuul is backed up because of that. [04:28:31] that's the weird project that spawns jessie instances on demand for jenkins jobs [04:50:58] (03PS7) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [04:54:25] (03PS8) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [04:57:32] (03CR) 10Ricordisamoa: "PS7 adds and fixes JSHint" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [04:57:44] (03CR) 10Ricordisamoa: "PS8 fixes .jshintrc" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [04:59:05] (03PS9) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [05:01:18] (03CR) 10Ricordisamoa: "PS9 adds JSCS" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [05:09:06] 10PAWS, 10pywikibot-core: "Open in browser" option does not work in PAWS - https://phabricator.wikimedia.org/T120632#1857580 (10jayvdb) 3NEW [05:14:27] (03PS10) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [05:16:25] (03CR) 10Ricordisamoa: "PS10 fixes JSCS" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [05:20:47] 10PAWS, 10pywikibot-core: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857587 (10jayvdb) 3NEW [05:21:01] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857587 (10jayvdb) p:5Triage>3High [05:40:44] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857599 (10Legoktm) Right click + paste should work. [05:42:24] 10PAWS, 10pywikibot-core: "Open in browser" option does not work in PAWS - https://phabricator.wikimedia.org/T120632#1857601 (10Legoktm) From `#pywikibot` a few days ago: ``` [14:02:10] YuviPanda: how good is jupyterhub integration with the browser? pywikibot has a "open in browser" thing which uses... [05:59:06] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857610 (10yuvipanda) https://github.com/jupyter/notebook/issues/104 [06:02:41] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857611 (10yuvipanda) Works on chrome but not firefox. [06:03:28] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857612 (10yuvipanda) Looks like this is actually in the *4.1* release, while we're currently running 4.0 [06:06:08] 10PAWS, 10pywikibot-core: "Open in browser" option does not work in PAWS - https://phabricator.wikimedia.org/T120632#1857613 (10jayvdb) Until better integration with jupyterhub is possible, `webbrowser` should fallback to using a terminal web browser. https://docs.python.org/2/library/webbrowser.html It app... [06:06:32] 10PAWS, 10pywikibot-core: "Open in browser" option does not work in PAWS - https://phabricator.wikimedia.org/T120632#1857615 (10yuvipanda) I can install one of them. Which would you prefer? [06:16:20] 10PAWS, 10pywikibot-core: "Open in browser" option does not work in PAWS - https://phabricator.wikimedia.org/T120632#1857616 (10jayvdb) I think the safest is to use `links`, as the `webbrowser` module is selecting that first if it is present. [06:17:55] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857617 (10jayvdb) >>! In T120633#1857599, @Legoktm wrote: > Right click + paste should work. Doesnt work for me in Firefox :/ [08:43:13] 6Labs: mathosphere.math.eqiad.wmflabs does not respond irregulary - https://phabricator.wikimedia.org/T120637#1857694 (10Physikerwelt) 3NEW a:3Physikerwelt [09:23:58] 6Labs, 10MediaWiki-extensions-Newsletter: Internal error when creating new user in newsletter-test.wmflabs.org - https://phabricator.wikimedia.org/T119945#1857732 (10Qgil) [09:26:39] 6Labs, 10MediaWiki-extensions-Newsletter: Internal error when creating new user in newsletter-test.wmflabs.org - https://phabricator.wikimedia.org/T119945#1857738 (10Qgil) [10:00:03] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857769 (10yuvipanda) Yeah, we need to upgrade. I am asking around to see when they plan on releasing. [10:05:11] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857773 (10hashar) Nodepool (the CI process that spawn instances on labs) has been broken since Dec 6th 23:00 UTC at least. It is no more able to spawn instan... [10:08:54] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857775 (10hashar) p:5Triage>3Unbreak! On the `contintcloud` labs project, I can get a list of servers and they are reachable: ``` nodepool@labnodepool1001:... [10:13:36] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857793 (10hashar) The last instances spawned were apparently on 2015-12-06T14:00:17 UTC. [10:16:02] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857794 (10yuvipanda) There is a bunch of VMs that are stuck in the building/scheduling state: ```root@labcontrol1001:/home/yuvipanda# nova show 9bb7842a-4580-... [10:27:44] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857830 (10yuvipanda) Ok, I restarted nova-conductor and nova-scheduler, and that seems to have fixed it. There is also a labsservices1001 puppet failure that m... [10:27:54] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857831 (10yuvipanda) p:5Unbreak!>3High [10:40:45] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 323 bytes in 0.007 second response time [10:42:13] wah [10:42:19] wtf [10:45:49] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 951207 bytes in 4.566 second response time [10:46:27] !log tools restarted nscd on tools-proxy-01 [10:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [10:49:18] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857898 (10yuvipanda) The puppet failures were because the ip alias generator found instances that had no address (because they were stuck in the scheduler): `... [10:54:45] Is labs down? I get a 502 when visiting both tools and librarybase. [10:55:11] tarrow: 11:46 AM !log tools restarted nscd on tools-proxy-01 [10:55:13] looks vaugly related [10:55:26] yes I'm investigating [10:55:29] :D [10:55:31] the nscd fixed it first time [10:55:33] but it's back [10:55:37] hm... [10:55:43] thanks! [10:56:00] CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 323 bytes in 0.007 second response time [10:56:01] hah, what a great check [10:56:23] yep :P [10:56:46] internal DNS is dead [10:57:38] YAY! [10:57:44] and I've no idea why [10:58:34] https://pbs.twimg.com/media/BjRPqWpCEAAGis6.jpg:large [10:58:47] just kidding [11:02:47] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 323 bytes in 0.005 second response time [11:29:22] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1857950 (10Luke081515) hm, creation of my instance rcm-5 worked. I start the creation at 21UTC, and one and a half hour later, it finisshed, so this is a long t... [11:35:24] can someone take a look at horizon.wikimedia.org? The design seems broken for me [11:37:22] Luke081515: can you file a bug? it's almost 4AM and I have to go sleep now :( [11:37:27] hopefully no more outages when I sleep! [11:40:56] ok, then I would file a task now [11:41:14] thanks Luke081515 [11:41:34] YuviPanda: Good regeneration :) [11:41:43] ty :) [11:43:30] 6Labs, 10Labs-Infrastructure: [Horizon] Design broken - https://phabricator.wikimedia.org/T120646#1857967 (10Luke081515) 3NEW [11:43:34] done [11:57:37] 10PAWS: Make the default PS1 more helpful - https://phabricator.wikimedia.org/T120560#1857988 (10jayvdb) 'current wiki' would require finding and parsing user-config.py. [12:03:14] PROBLEM - Host ToolLabs is DOWN: PING CRITICAL - Packet loss = 100% [12:05:00] :/ [13:46:02] !log tools The new grid masters are happy, killing the old ones (-shadow, -master) [13:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:47:53] 6Labs, 10Tool-Labs: Phase out precise instances from toollabs - https://phabricator.wikimedia.org/T94790#1858183 (10coren) [13:47:55] 6Labs, 10Tool-Labs: Move tools-master and tools-shadow to trusty - https://phabricator.wikimedia.org/T94791#1858181 (10coren) 5Open>3Resolved The new masters are happy on Trusty; the old ones have been deleted. [13:52:49] mobrovac: apparently YuviPanda stated earlier that the labs DNS is dead :( [13:53:06] ah i see [13:53:07] :( [13:53:48] i must have misread him email to ops-l then [13:54:01] i gathered it has been fixed [13:54:08] apparently not :( [13:54:21] I think it is partially fixed.. but not fully. [13:55:09] I can access a labs project from outside but access from one instance to another is still broken for me [13:55:22] yup, same here [13:55:39] * Coren tries to catch up with that email thread. [13:55:40] curl -v en.wikipedia.beta.wmflabs.org/w/api.php works form the outside, not from within the project though [13:57:37] mobrovac: I seem to be able to resolve that name fine, from within tools. [13:58:14] (AFAICT, Yuvi's fix is partial only insofar as the cause has been avoided but not fixed) [13:58:43] Coren: from within deployment-prep, that resolves to 208.80.155.135, when it should resolve to 10.68.18.103 [13:59:05] Ah! I see what you mean - it resolves to the outside IP. [13:59:05] so i guess inter-projects dns works, but intra-project it doesn't? [13:59:08] yup [14:01:52] Wait, that's a DNS entry for a public IP - those never worked from the inside without trickery. [14:03:18] Coren: right, but until a couple of hours ago they were resolved correctly, now they don't any more [14:04:46] is there a task filled for the DNS issue? [14:06:01] I have a known build at 06:46am UTC and a breakage at 12:46 UTC [14:06:06] so something happened this europe morning [14:06:10] mobrovac: Ah - that might be a side effect of the emergency fix Yuvi deployed now that I look in depth in the ticket. Looks like he had to comment out part of the code that handles that. [14:07:02] heh [14:08:53] (Which is what confused me because T120586 really isn't DNS) [14:09:23] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Instance creation fails - https://phabricator.wikimedia.org/T120586#1858207 (10coren) a:3coren [14:11:20] 6Labs, 10Labs-Infrastructure: [Horizon] Design broken - https://phabricator.wikimedia.org/T120646#1858221 (10hashar) [14:12:45] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [14:13:01] * Coren stares at shinken. [14:13:11] Coren: instance creation are working again ( https://phabricator.wikimedia.org/T120586 ) [14:13:12] That host doesn't even exist anymore, silly cow. [14:13:25] yuvi kept the task open for investigation [14:13:36] so one can try figure out what happened starting roughly 24 hours ago [14:13:54] hashar: Yes, and it looks like the DNS issue was caused by the stuck instances leading Yuvi to have to disable that bit of code. [14:14:09] hashar: Which is what I'm looking into now. [14:14:15] ok :} [14:15:59] RECOVERY - Host tools-shadow is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [14:16:10] Coren: and I then noticed labservices1001 had a puppet run failure which is what yuvi fixed [14:16:15] might or might not be related [14:18:03] It is, indirectly. [14:23:39] mobrovac: That should have fixed it? [14:24:13] mobrovac: (You may have to flush nscd cache to see it - nscd -i hosts) [14:25:25] k, lemme try [14:26:32] Coren: yup, works@ cheers! [14:45:03] andrewbogott: When you're around, can you take a quick peek at https://gerrit.wikimedia.org/r/#/c/257323 ? [14:46:26] Coren: looking [14:46:57] Coren: I can’t log in to irccloud but it has a tight grip on my nick. [14:49:01] andrewbogott_: Can't you ask nickserv to ghost it? [14:49:23] probably. Want to figure out what’s happening first [14:50:59] hah, can’t connect to the irccloud support channel unless I’m using irccloud [14:52:35] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review: Instance creation fails - https://phabricator.wikimedia.org/T120586#1858278 (10coren) 5Open>3Resolved @yuvipanda was correct that the puppet error was caused by the stall in the scheduler; the patch abov... [14:57:34] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review: Instance creation fails - https://phabricator.wikimedia.org/T120586#1858287 (10hashar) Since people mentioned issue with the DNS resolver aliasing, it is working again: ``` deployment-bastion:~$ dig +short... [15:06:42] andrewbogott: Ima merge and babysit https://gerrit.wikimedia.org/r/#/c/256693/ now; it touches the access.conf of pretty much every instance, but it should be a functional noop. I'm going to be keeping a close eye on it but if you see anything screwy with access to instances wave at me? [15:07:12] RECOVERY - Host ToolLabs is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [15:07:26] Coren: ok… [15:07:58] andrewbogott: Or would you like to give the patch a once over first? In case me and p.aravoid missed something? [15:09:17] Coren: I don’t think I’ll have anything useful to add pre-breakfast [15:11:15] andrewbogott: Want me to wait until your brain has carbs? [15:11:27] yeah, if you don’t mind, maybe merge post-meeting [15:16:00] No worries. [16:41:02] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1858577 (10coren) I'm attempting to reproduce the issue without any shelf attached as a first attempt at isolating where the actual issue lies. If that happens with... [16:47:44] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [16:48:44] Coren: is that something? [16:49:05] andrewbogott: Yeah, it's my not understanding how to tell shinken that those hosts don't exist anymore. [16:49:21] (And also puzzling on how it can get regular RECOVER) [16:52:44] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [16:53:04] andrewbogott: Like that ^^. [16:53:19] Amusing how an instance that no longer exists can be up. [16:53:45] hm, curious [17:23:08] Coren: can you merge and deploy https://gerrit.wikimedia.org/r/#/c/257062/ at some point? thanks. [17:23:53] (03CR) 10coren: [C: 032] "Yeah, 256M was rather too conservative." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257062 (https://phabricator.wikimedia.org/T120517) (owner: 10Zhuyifei1999) [17:24:06] (03CR) 10coren: [V: 032] "Yeah, 256M was rather too conservative." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257062 (https://phabricator.wikimedia.org/T120517) (owner: 10Zhuyifei1999) [17:24:49] Coren: can we increase it some more, to 512? 400 seems an odd number... [17:24:58] 6Labs, 10Tool-Labs, 5Patch-For-Review: Raise the default memory allocation to jsub/jstart jobs from 256M to at least 400M. - https://phabricator.wikimedia.org/T120517#1858864 (10coren) 400M is still reasonable; and 256M was arguably too conservative given it didn't allow most scripting langues. [17:25:16] I don't want to set it too big - 400M covers php-cli and perl without issue. [17:25:41] 6Labs, 10Labs-Infrastructure, 6operations: logrotate/disk space on silver for nutcracker log - https://phabricator.wikimedia.org/T120683#1858873 (10Dzahn) 3NEW [17:25:54] Coren: ok [17:27:11] I have no issue about any other reasonable value; I just didn't want to bikeshed the suggested 400M [17:27:54] Coren: can we increase to 512? :) [17:28:10] YuviPanda: If you feel it worth it, sure. :-) [17:29:43] Coren: yes :) can you bump it up and roll it into the build? :) [17:29:55] YuviPanda: Will do after the meetings. [17:29:58] Coren: thanks. [17:30:38] 6Labs, 10Tool-Labs, 5Patch-For-Review: Raise the default memory allocation to jsub/jstart jobs from 256M to at least 400M. - https://phabricator.wikimedia.org/T120517#1858903 (10coren) a:3coren After discussion on IRC, we'll go with a more generous and round 512M limit. [17:37:42] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [17:42:59] YuviPanda: can you make ^ stop happening? It scares me [17:46:34] andrewbogott: that host *is* down, it was shut down las tweek and Coren is keeping it around... [17:46:43] ooh [17:46:46] I thought it was deleted [17:46:48] which is totally legit, I guess. [17:46:49] YuviPanda: No, it got deleted this morning [17:46:51] andrewbogott: no not yet. [17:46:54] oh [17:46:56] I see [17:47:00] Coren: andrewbogott hmm, so again puppet failing on shinken-01 [17:47:02] And shinken started complaining. [17:47:10] looking [17:47:26] it hasn't runu in a while but is clean >_> [17:47:38] Something broken with cron, then? [17:47:50] I think so [17:47:52] hmm [17:47:55] I wonder if stuff broke with *apt* [17:48:00] YuviPanda: what was it that paged you on toollabs (side note) [17:48:02] which would prevent puppet-run [17:48:06] didn't get anything so was curious [17:48:16] chasemp: shinken emailed me [17:48:31] chasemp: icinga also paged, but it was in the dead zone for all of us [17:48:38] understood thanks [17:48:58] did that go to normal admin's then outside of teh blackout? [17:49:07] of does that only go to ...labs folks none of whom were alive? [17:49:25] chasemp: it goes to everyone. [17:49:30] kk [17:50:01] chasemp: the hope is that https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin is useful for other people too [17:50:49] > W: Failed to fetch file://<%=/dists/@dir/%>//binary-amd64/Packages Invalid URI, local URIS must not start with // [17:51:39] RECOVERY - Host tools-worker-05 is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [18:02:10] 6Labs, 10Tool-Labs, 5Patch-For-Review: Raise the default memory allocation to jsub/jstart jobs from 256M to at least 400M. - https://phabricator.wikimedia.org/T120517#1859028 (10coren) p:5Triage>3Normal [18:04:26] andrewbogott: Coren the shinken staleness stuff is fixed now [18:04:49] YuviPanda: I'm guessing we want to make 'apt fails so puppet can't run' an error condition. [18:05:13] YuviPanda: thanks [18:05:16] 6Labs, 10Tool-Labs: Redirect //stable.toolserver.org/geohack/geohack.php requests - https://phabricator.wikimedia.org/T120526#1859043 (10coren) p:5Triage>3Normal a:3coren [18:05:16] Coren: it is, it raises the puppet staleness error [18:05:30] and shinken-01 has been in puppet stale error for a while :| [18:05:33] just haven't investigated [18:08:55] 6Labs, 10Incident-20150617-LabsNFSOutage: Audit projects' use of NFS, and remove it where not necessary - https://phabricator.wikimedia.org/T102240#1859066 (10yuvipanda) [18:08:57] 6Labs, 5Patch-For-Review: Disable NFS on the orgcharts project - https://phabricator.wikimedia.org/T103137#1859064 (10yuvipanda) 5Open>3Resolved And it's gone. Thanks everyone. [18:09:47] 6Labs: Write up a report about kubecon to the Ops team - https://phabricator.wikimedia.org/T118757#1859070 (10yuvipanda) I wrote this up last week, but it got to 800 words and was still incomplete, so definitely TL;DR. I'll try again. [18:15:53] !log orcharts it is dead [18:15:54] orcharts is not a valid project. [18:16:04] !log orgcharts it is dead [18:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Orgcharts/SAL, Master [18:42:54] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1859258 (10coren) I've gotten the problem to reproduce once out of 17 attempts with POST stalling at `F/W Initializing Devices 0%`; which is the original issues. FW... [18:44:32] andrewbogott: Think you can look at that changeset now? I'd like to get started on it. [18:44:45] yes, one moment [18:44:59] YuviPanda: In the meantime, I'll do that memory thang. [18:45:38] thanks [18:48:04] (03PS1) 10coren: jsub: Default memory allocation to 512M [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257383 (https://phabricator.wikimedia.org/T120517) [18:49:53] (03CR) 10jenkins-bot: [V: 04-1] jsub: Default memory allocation to 512M [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257383 (https://phabricator.wikimedia.org/T120517) (owner: 10coren) [18:52:33] (03PS2) 10coren: jsub: Default memory allocation to 512M [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257383 (https://phabricator.wikimedia.org/T120517) [18:53:45] Coren: I didn’t read it all that closely but it seems good. I presume you’ve tested this 101 times and are convinced it won’t just lock us out everywhere? [18:54:28] bearND: still broken? [18:54:29] (03CR) 10jenkins-bot: [V: 04-1] jsub: Default memory allocation to 512M [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257383 (https://phabricator.wikimedia.org/T120517) (owner: 10coren) [18:54:53] bd808: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Teahouse_question_query_tool is the thing that sparked the ES stuff btw [18:54:59] andrewbogott: I might only have tested it a dozen times or so. :-) [18:55:24] 2 is 1 and 1 is none :) [18:55:36] YuviPanda: ah. a nice FAQ app [18:55:39] that would be cool [18:56:15] having a nice FAQ extension would really be cool too [18:56:26] andrewbogott: I agree with your bikeshed color, but those are already existing on wikitech and in use, so renaming them as part of that patch is a non-starter. [18:56:41] Coren: ok then :) [18:56:45] I've got a growing number of pages with too many "how do i...?" subsections [18:57:01] bd808: yup [18:57:17] bd808: I made a similar comment in the talk page [18:57:21] YuviPanda: the problem is ... structured data! [18:57:45] which is the root of a lot of "I wish X was easier on-wiki" things [18:58:11] bd808: SMW! [18:58:55] well, actually yes [18:59:36] its certainly not the wikibase use case that is being pursued and why SMW is very popular outside of the WMF cluster [19:02:11] bd808: however, that's a bit too long way off... [19:02:23] yeah [19:02:29] bd808: and I think 'here, a working and unideal solution!' is a great intermediate step [19:02:41] I totally agree [19:02:50] * bd808 was jsut soapboxing [19:02:53] ofc [19:03:06] bd808: luis was talking about it too, wanted a 'real' Q&A platform (like one of the SO clones) [19:03:12] bd808: that seems like a much larger social undertaking [19:03:19] ugh [19:03:42] How about we don't try to make the MW into the emacs of web software [19:04:40] * YuviPanda integrates PAWS into MW, which gives it a shell that runs emacs [19:07:38] (03PS3) 10coren: jsub: Default memory allocation to 512M [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257383 (https://phabricator.wikimedia.org/T120517) [19:07:47] YuviPanda: Oh blah; the automated tests fail because they presume a default mem value. [19:08:19] nice :D [19:12:03] YuviPanda: Sorta. debian-glue doesn't report the actual testing.log [19:12:21] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review: Instance creation fails - https://phabricator.wikimedia.org/T120586#1859398 (10yuvipanda) 5Resolved>3Open The edge case was unrelated to anything here, let's keep this open for investigation. [19:12:35] YuviPanda: Wanna go over https://gerrit.wikimedia.org/r/#/c/257383/ ? [19:12:56] (03CR) 10Yuvipanda: [C: 031] jsub: Default memory allocation to 512M [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257383 (https://phabricator.wikimedia.org/T120517) (owner: 10coren) [19:13:24] (03CR) 10coren: [C: 032] jsub: Default memory allocation to 512M [labs/toollabs] - 10https://gerrit.wikimedia.org/r/257383 (https://phabricator.wikimedia.org/T120517) (owner: 10coren) [19:20:17] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:18] I'm on it ^^ [19:20:18] goddamn [19:20:18] Coren: I just restarted ldap and pdns, could be me [19:20:27] It's coming back now, apparently. [19:20:46] ... slowly. [19:21:13] Maybe just thinks trembling as connections got lost and are reestablished? [19:21:20] things* [19:21:45] pdns failing causes this to fail [19:21:51] sing nginx can't talk back to the instances [19:21:51] afaict, everything is happy now. [19:22:08] marc@tools-bastion-01:~$ host tools.wmflabs.org [19:22:09] tools.wmflabs.org has address 10.68.21.49 [19:22:40] That works now; though it might not have while restarting [19:22:44] Coren: if you look at /var/log/nginx/error.log, all the failures were it trying to reach something.tools.eqiad.wmflabs [19:24:48] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 951291 bytes in 3.458 second response time [19:28:22] andrewbogott: howdy! are you able to help? I still can't ssh to https://wikitech.wikimedia.org/wiki/Nova_Resource:Search-datavis-experimental.shiny-r.eqiad.wmflabs but can ssh into other search-datavis*.wmflabs instances no problem. [19:28:37] bearloga: are you the same person as bearND? [19:28:50] andrewbogott: nope [19:28:54] Well, that’s confusing [19:29:05] anyway, new instance creation is broken currently. We’re working on nit [19:29:22] topic Status: New instance creation temporarily broken | Channel is logged: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/ [19:30:40] andrewbogott: ok, thanks for letting me know. should I delete that instance and recreate it when creation is fixed? [19:33:59] YuviPanda: Built and sent to aptly. [19:36:53] 6Labs, 10Tool-Labs, 5Patch-For-Review: Raise the default memory allocation to jsub/jstart jobs from 256M to at least 400M. - https://phabricator.wikimedia.org/T120517#1859501 (10coren) 5Open>3Resolved New version (1.7) with 512M default sent to tools' apt repo. [19:38:37] And now I need my (very late) lunch. [19:53:57] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review: Instance creation fails - https://phabricator.wikimedia.org/T120586#1859622 (10hashar) Nodepool uses the OpenStack API to delete instances. The first error occurred at 2015-12-06 08:00:30,030 (nothing earli... [20:00:07] bearloga: if you delete that instance and build a new one (with a different name) it should work fine [20:04:40] andrewbogott: awesome! works fine now, thanks! [20:04:48] great [20:04:56] sorry it was broken — we still don’t really know why [20:06:25] 6Labs, 10Labs-Team-Backlog, 5Patch-For-Review: Labs: create a new scheme for /etc/security/access.conf customization - https://phabricator.wikimedia.org/T120106#1859667 (10coren) 5Open>3Resolved This is complete, and was deployed with success. Keeping a close eye on it for a while, but otherwise done. [20:07:25] andrewbogott: The changeset is deployed; I checked all the tricky cases and a couple randos, but everything looks fine. I'm keeping a close eye on auth for a while, but please mention if you see anything even vaguely fishy? [20:07:33] ok! [20:10:13] Hm. Images are being built with the old PAM scheme; I'll open a task to convert those too - but it's not a catastrophe since puppet will clean up new instances. [20:12:49] 6Labs, 7Graphite: Setup "official labs grafana" instance - https://phabricator.wikimedia.org/T120295#1859718 (10chasemp) p:5Triage>3Normal [20:13:36] chasemp: you wanted to talk about ^ [20:13:44] 6Labs, 10Labs-Infrastructure: Labs: update image builders to use new PAM scheme - https://phabricator.wikimedia.org/T120710#1859722 (10coren) p:5Triage>3Normal a:3coren [20:13:56] I don't see wikibugs :) but the grafana thing we can chat later it's no big hurry [20:14:09] chasemp: ok [20:15:00] was mainly gonig to ask, is this for toollabs or all labs or what [20:15:17] since we have some potential monitoring silo's and monitoring in labs is somewhat even less defined well than prod [20:15:18] chasemp: all of labs. [20:15:22] which is saying something :) [20:15:34] what project would you put soemthing like that in assuming it's a vm? [20:15:38] chasemp: it's just an equivalenet to grafana.wikimedia.org that hits labs graphite [20:15:42] chasemp: nah, I'll just put it on labmon1001 [20:15:45] which is where graphite lives [20:15:53] ok I understand better then [20:16:05] I kind of disagree with some of that in loose principle [20:16:09] but not enough to hold anything up [20:16:14] alternatives welcome :D [20:16:15] as it's a long winded and nuanced disagreement [20:16:19] :) [20:16:23] I don't like it either... [20:16:46] but monitoring tools that are useful, I do not like putting them on labs instances themselves... [20:16:53] the tldr is I want to do better for labs than that and reframe teh prod equiv things to staging/beta labs [20:16:56] where the context makes sense [20:17:06] mirroring prod logisitics for separation and access is super messy [20:17:20] and somewhat cluttered when we talk about translating mechanics [20:17:25] from a single to multitenant world [20:17:27] etc etc [20:17:33] but like I said, complicated [20:17:37] and mostly all in my head [20:17:58] right, but that's a very long winded conversation that IMO we're not in a position to fix unfortunately at this moment. it's one of those things you start talking about and churn around in the background for a while and slowly do something about... [20:18:06] :) [20:18:08] yes [20:19:18] atm just floating the idea so it can slowly creep [20:20:27] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: labcontrol1001 and 1002 running web servers on 80 and 443 open to all - https://phabricator.wikimedia.org/T120449#1859749 (10chasemp) [20:20:38] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: labcontrol1001 and 1002 running web servers on 80 and 443 open to all - https://phabricator.wikimedia.org/T120449#1854056 (10chasemp) p:5Triage>3High [20:20:44] chasemp: yeah, but IMO it also involves involvement from rest of ops and releng, hence... complicated [20:21:12] rest of ops to an extent at least on buy-in or lack of blocking idk really, it's a grey area [20:21:14] releng def [20:21:33] but there is some rough Beta cluster as a service and staging is a real thing brewing in my head [20:21:36] that ties in pretty closely [20:22:02] where we have predefined projects that come w/ appropriately scaled relevant and equiv monitoring in context and we don't monitor inside project things [20:22:08] and staging is a more-real 1:1 at least [20:22:11] chasemp: yeah, we tried it for a few months and gave up about a year ago. [20:22:14] where the grafana equiv would make senes [20:22:21] yeah I've talked w/ greg a bit about it [20:22:28] chasemp: remnants of that are found in nodes/labs in ops/puppet :) [20:22:30] I think it's way not will that is the limitation [20:22:34] right [20:22:42] needs resources [20:23:01] chasemp: well, this grafana is just for making whatever graphs we want off data in labs graphite. So we'll need this anyway, for arbit people to make arbit graphs [20:23:02] chasemp: we (releng) quickly talked about staging during our today meeting [20:23:04] my understanding in part is it died due to lack of intended resourcing, i.e. reorg expectations in part [20:23:16] chasemp: yeah [20:23:21] chasemp: told: staging is not being worked on (lack of people / other priorities) [20:23:32] anyhoo, this is long in the tooth talk :) [20:23:48] yeah I'm thinking into next year etc and just general outline [20:24:01] otherwise we spin our wheels on 2 week solutions and never overcome teh real limitations [20:24:17] ^ [20:24:25] I am pretty confident Releng is all for a staging [20:24:55] and bunch of dev/sysadmin/devops --whatever the name-- would love something a bit more stabler than beta [20:25:19] give me a year or so and I think we'll have a real convo :) [20:25:23] that's not sarcasm actually [20:25:25] but that all boils down to how we organize our dev pipeline from idea to infra. Too often we push straight to prod and then have to catch up on the integration platforms [20:25:42] !google define convo [20:25:43] https://www.google.com/search?q=define+convo [20:25:47] oh agreed staging is only important if releng forces it to be [20:25:58] well [20:26:02] convo short for conversation :) [20:26:13] lets first remove +2 / merge rights from ops ? :-} [20:26:15] chasemp: yeah, I agree. I think there's other fires we can put out first (neutron, wikitech, etc) [20:26:19] (just kidding) [20:27:02] I believe staging is not much of a priority because most people learned how to feature switch off by default and use the beta cluster as the semi prod platform [20:27:26] YuviPanda: the reason I asked really is being unsure but also one of the big coming things I think is drawing harder lines between inside of tenant issues and labs-team supported things [20:27:33] so I was just looking to frame this in that light [20:27:40] we have to draw better lines of demarcation really [20:27:49] chasemp: riht, and hence the grafana one would be something along the lines of labs-team supported thing, similar to graphite... [20:27:51] well [20:27:53] *this* grafana [20:27:57] :) [20:28:12] cool [20:28:25] hashar: sure of course [20:28:53] I think teh short long term reply is beta as is dies in a fire, there is a Beta-Cluster-as-A-Service that teams can create to test things short term [20:29:03] and changes are merged into staging as a pre-prod due diligence [20:29:05] etc etc [20:29:12] but that's ways away [20:29:44] yup [20:30:06] lot of teams have been asked for a way to spawn their own whole stack cluster with a click of a mouse [20:30:20] kind of mediawiki-vagrant on steroids :D [20:30:59] so if one team wanted to rework something they do it on their on staging tenant. Then we head to staging where QA/ops validate and we then push to prod [20:31:06] a nice waterfall pipeline :-} [20:31:30] hashar: have you seen tools.wmflabs.org/paws? :) (use with non (WMF) account only because of bugs) [20:31:40] hashar: it's a 'pywikibot on a shell' on a one-click service [20:31:49] hashar: in the long term it might be nice to have something like that for mw [20:32:05] I have made tools.wmflabs.org to point to 127.0.0.1 . That site hosts too many great tools that ends consuming all my free time! [20:32:14] hahaha [20:32:31] oops [20:32:32] 6Labs, 10Labs-Infrastructure, 6operations: logrotate/disk space on silver for nutcracker log - https://phabricator.wikimedia.org/T120683#1859807 (10chasemp) p:5Triage>3High There is a logrotate there now I guess we need to get more aggressive on this box: silver:~# cat /etc/logrotate.d/nutcracker /var/l... [20:32:42] bah I logged with my hashar account [20:33:18] hashar: that's ok [20:33:31] hashar: do *not* use your WMF acount (it fails with accounts that have special chars like '(') [20:33:34] 6Labs, 10Labs-Infrastructure: [Horizon] Design broken - https://phabricator.wikimedia.org/T120646#1859815 (10chasemp) p:5Triage>3High [20:33:35] oh [20:33:50] hashar: anyway, you can click 'new' and you get a terminal :) [20:33:56] which is a fully functional terminall... [20:33:56] 6Labs, 10Labs-Infrastructure, 6operations, 7HTTPS: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#1859817 (10chasemp) p:5Triage>3Normal [20:33:57] so what is that Jupyter thingie ? [20:33:59] with pywikibot installed [20:34:05] lies! [20:34:08] and oauth integrated [20:34:10] hashar: try it! [20:34:12] 6Labs, 10Quarry, 10Labs-Infrastructure, 7HTTPS: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627#1859818 (10chasemp) p:5Triage>3Normal [20:34:13] hashar: you can even run emacs there [20:34:30] does it provides a text editor? [20:34:44] hashar: yes :) [20:34:50] hashar: that's also an option in 'new' [20:35:03] * andrewbogott -> food [20:35:08] back (possibly much) later [20:36:06] YuviPanda: is thsi closed then? https://phabricator.wikimedia.org/T120287 [20:36:09] resolved even [20:36:38] 6Labs, 10Tool-Labs, 5Patch-For-Review: install ruby build tools on dev hosts - https://phabricator.wikimedia.org/T120287#1859826 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done I think? [20:36:43] chasemp: yes :-) [20:36:43] YuviPanda: so in theory I can head to Jupyter to run background tasks easily ? [20:36:52] ah, yuvi beat me to it [20:36:55] YuviPanda: instead of having to foobar.sh & on some random host? [20:36:55] what is wikibugs doing [20:37:01] hashar: well, eventually I'll kill sessions that have had no user activity for like a hour [20:37:06] hashar: err, for like 12h [20:37:19] hashar: so no, but that's another problem I will solve at some point :) just not yet [20:37:29] hashar: but if your task will complete within that time period, sure... [20:37:45] YuviPanda: I am not sure how useful it is to me. But 10+ years ago I would surely have welcomed such a service [20:37:57] must be the phab backend that's broken [20:37:57] hashar: yeah, it is targetted at people using pywikibot atm :) [20:38:12] sounds good [20:38:21] hashar: but I was showing it to you primarilay as a 'it would be nice to have something like this for mediawiki people' [20:38:31] heck one day we will have a Django app on top of pywikibot [20:38:58] 2015-12-07 20:36:38,067 - irc3.wikibugs - DEBUG - > JOIN #wikimedia-labs [20:38:59] 2015-12-07 20:36:38,110 - irc3.wikibugs - DEBUG - > PRIVMSG #wikimedia-labs :6Labs, 10Tool-Labs, 5Patch-For-Review: install ruby build tools on dev hosts - https://phabricator.wikimedia.org/T120287#1859826 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done I think? [20:39:00] O_o. [20:39:04] YuviPanda: yeah MzMcBribe asked for a solution where one would have its own labs instance with a one click [20:39:28] something like (warning buzz words) mediawiki-vagrant-on-labs-as-a-one-click-service [20:39:36] yeah [20:39:36] !log tools.wikibugs wb2-irc thinks it's connected but messages don't actually get out to IRC. Restarting. [20:39:38] long way off though [20:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [20:40:03] YuviPanda: hoesntly when we reach that point, we should create a for profit spin off dedicated to mw hosting :-} [20:40:16] with dividends yields up to the wmf ! [20:40:24] oh, no, it's just me >_< [20:40:43] YuviPanda: that jupyler uses docker I bet or some thin lxc containers isn't it? [20:40:45] what happened, valhallasw`cloud [20:40:50] hashar: it's almost as wikia.com [20:40:53] hashar: yeah, it's using kubernetes.io [20:40:53] irccloud changed their div classes [20:40:56] valhallasw`cloud: haha :D [20:40:58] ofc [20:41:07] so ignored messages don't show up as gray but as actually ignored [20:41:19] YuviPanda: I still don't get the point in us having OpenStack vs Ganeti vs Kubernetes :D [20:41:49] hashar: they're all very different things (at least Kubernetes is very different from OpenStack and Ganeti) [20:42:00] k8s and openstack solve very different problems [20:42:26] well, there is some prelim container stuff that overlays on openstack but in large part a lot of k8s deployements are on VMs and openstack [20:42:30] from what I can tell [20:42:51] grand pa' hashar see all of that as being "The Cloud"©®™ [20:43:42] so k8s is kind of on top of a cloud infra? [20:43:49] to ease driving/managing a farm of instance isn't it? [20:43:53] if you think of a container as a light weight VM you are already off the rails [20:44:21] k8s (and yuvi is way more qualified to quibble about this) is a container orchestration solution [20:44:29] which is to say distributed process management in teh form of containers [20:44:35] with HA and all kinds of goodness [20:45:25] YuviPanda: am I in the neighborhood? [20:45:42] so I would ask for a varnish / 5 apps server / a DB and kubernetes figure it out for me? [20:45:48] basically yeah, hashar [20:45:55] think of a container as 'process + dependencies' [20:45:58] rather than as a VIM [20:46:51] sounds like Ubuntu Juju https://jujucharms.com/ [20:47:16] no... [20:47:31] like, I *can* pick up a nail with my feet, but hands are infinitely better [20:47:42] * hashar contempts how he is never going to ever be accepted in ops team [20:50:51] We can run k8s inside juju! [20:50:59] * valhallasw`cloud runs [20:51:16] heheahaha cue yuvi's panic attack [20:54:47] YuviPanda: although, given the size of the 'container', the distinction with a real VM does blur pretty much [20:55:13] valhallasw`cloud: hmm? [20:55:41] valhallasw`cloud: not really, since containers also don't run a init process... no systemd/upstart so if you have multiple processes you gotta manage them all yourself [20:55:50] and restarting a container loses state [20:59:17] right, you can start processes in the container, which is not possible for a vm (although SSH comes close :P) [21:00:12] but I guess in the end it's all turing complete, so it's all the same ;-) [21:01:06] it's definitely at heart a technical and not an efficiency distinction by default :) [21:05:10] 6Labs, 10Tool-Labs, 5Patch-For-Review: Redirect //stable.toolserver.org/geohack/geohack.php requests - https://phabricator.wikimedia.org/T120526#1859896 (10coren) In addition to the page above, the entry for stable.toolserver.org needs to be added. [21:14:35] 6Labs, 10Tool-Labs: Move tools-mail to trusty - https://phabricator.wikimedia.org/T96299#1859936 (10coren) a:3coren [21:19:37] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org: Investigate the creation of a wiki for the dev community that is organizing around Labs - https://phabricator.wikimedia.org/T70818#1859943 (10coren) 5Open>3declined a:3coren Given how little momentum this has in practice, and the fact that thi... [21:25:22] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org: Investigate the creation of a wiki for the dev community that is organizing around Labs - https://phabricator.wikimedia.org/T70818#1859959 (10Dzahn) thanks for declining . adding more wikis (just like adding mailing lists) is a common suggestion to... [21:28:23] hashar: in the end, k8s is going to be closer to how we use SGE now (scheduling, but then more decoupled) than how we use openstack [21:30:21] valhallasw`cloud: yeah sounds like an orchestration tool that runs process in their own little container and takes care of scheduling the container at some place [21:30:29] where some place can be an openstack cloud [21:30:45] I will need to sit down face to face with the proper set of person to properly understand it i guess [21:31:01] hashar: yeah, I might give a presentation at next hackathon / wikimania [21:31:03] 6Labs, 10Tool-Labs, 5Patch-For-Review: Provision and test tools-mailrelay-02 - https://phabricator.wikimedia.org/T97574#1859966 (10valhallasw) [21:32:01] Coren: the actual conversion to trusty takes some work, the manifests did not apply cleanly last time I tried. But we can skip mailrelay-02 and go to a trusty -03 immediately, I suppose. [21:32:37] valhallasw`cloud: That was my plan; in broad strokes. [21:32:46] sounds good. [21:33:12] valhallasw`cloud: I think it's easier to combine the make-new-relay with make-relay-trusty in the long run. [21:33:25] valhallasw`cloud: Coren while doing it, can we not inherit from the 'toollabs' class for mailrelay? it brings in all of gridengine with it... [21:33:33] does mail relay need gridengine? [21:33:41] YuviPanda: we need that for mail piping I think? [21:33:51] YuviPanda: It does - sends jobs to the grid for piping. [21:33:52] oh hmm yeah [21:33:53] right [21:33:55] ok [21:34:02] YuviPanda: Though I can make that whole think conditional though. [21:34:10] yeah, the original reason to to precise was because we thought we needed a new host ASAP because of the COW-migration-issue [21:34:19] It's very tools specific so I can move /that/ into toolabs:: and include both for the tools relay [21:34:28] but then we realised the host was only m1.small and we shrugged about that part [21:34:54] yeah, the toollabs class is too large and inheritance is a bit confusing with hiera [21:34:58] YuviPanda: So the general relay config won't know about the piping. [21:35:16] right [21:38:58] Coren: fwiw, I'm 99% certain tools-mailrelay-02 can just be deleted. The notes for testing are all in T97574. [21:40:02] valhallasw`cloud: That makes sense. [21:40:12] This is going to be my day tomorrow, I think. [21:55:02] Coren: is there a calendar event for the ldap switch? [21:55:06] on the morrow [21:55:30] None that I can see. [21:55:45] Might be a good idea to add it to the maintenance calendar? [21:56:13] probably yeah I'm thinking and also just blocking calendar [21:56:13] etc [22:01:33] (03PS1) 10Ori.livneh: Add self to root-authorized-keys [labs/private] - 10https://gerrit.wikimedia.org/r/257444 [22:02:20] (03CR) 10Yuvipanda: [C: 032 V: 032] Add self to root-authorized-keys [labs/private] - 10https://gerrit.wikimedia.org/r/257444 (owner: 10Ori.livneh) [23:37:17] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org: Investigate the creation of a wiki for the dev community that is organizing around Labs - https://phabricator.wikimedia.org/T70818#1860673 (10bd808) Perhaps we could create some more targeted tickets around specific problems that the current Labs an...