[01:16:13] PROBLEM - Puppet failure on tools-exec-13 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:34:18] 10Tool-Labs: Create a utility that dumps all databases of a user - https://phabricator.wikimedia.org/T91231#1127233 (10scfc) The legacy databases are the ones I was concerned about and caused me to not take any short cuts; the "something custom" bit reinforced that now :-). My initial motivation was to find out... [01:41:14] RECOVERY - Puppet failure on tools-exec-13 is OK: OK: Less than 1.00% above the threshold [0.0] [03:11:55] 10Wikimedia-Labs-wikitech-interface, 6operations: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1127368 (10Krinkle) 3NEW [03:12:29] 10Wikimedia-Labs-wikitech-interface, 6operations, 7Regression: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1127376 (10Krinkle) [03:14:06] 10Wikimedia-Labs-wikitech-interface, 6operations, 7Regression: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1127381 (10Krinkle) [04:49:23] 6Labs, 10Continuous-Integration, 6operations: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1127424 (10Krinkle) >>! In T92710#1118969, @scfc wrote: > Looking at http://shinken.wmflabs.org/service/integration-slave1402/Puppet%20failure, shinken seems to have no... [06:41:59] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [06:41:59] (03CR) 10Florianschmidtwelzow: [C: 031] "@Ricordisamoa: If you want, you can add me to reviewer list, too, and i will try to review patches in this repo, too :) Sounds like an awe" [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [06:43:32] (03CR) 10Yuvipanda: [C: 032] ":)" [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [06:45:19] (03CR) 10Ricordisamoa: "not in Jenkins :)" [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [06:46:49] (03CR) 10Yuvipanda: "I don't have submit rights either." [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [06:52:29] (03CR) 10Ricordisamoa: [V: 032] add .gitreview [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [07:07:08] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [07:12:25] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1127539 (10yuvipanda) 3NEW [07:12:40] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1127546 (10Thgoiter) >>! In T92963#1126408, @Aklapper wrote: >> Isn't there any possibility to make this tool run more stable? > > That might be a question for Para in https://wikitech.wikimedia.org/wiki/User_talk:Para#Moving... [07:13:04] 10Tool-Labs, 5Patch-For-Review: Unify proxylistener interacting code across portgrabber / tool-nodejs / tool-uwsgi - https://phabricator.wikimedia.org/T91957#1127548 (10yuvipanda) See also T93046 [07:13:13] 10Tool-Labs, 5Patch-For-Review: Unify proxylistener interacting code across portgrabber / tool-nodejs / tool-uwsgi - https://phabricator.wikimedia.org/T91957#1127550 (10yuvipanda) [07:13:14] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1127539 (10yuvipanda) [08:54:59] 6Labs, 10Continuous-Integration, 6operations: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1127872 (10hashar) Just a note: instead of looking at /var/log/syslog, you can look at /var/log/puppet.log on the instances. That files comes with ansi colored output w... [09:32:48] 10Wikimedia-Labs-wikitech-interface, 6operations: wikitech instances list is blank - https://phabricator.wikimedia.org/T89808#1127979 (10Nikerabbit) 5Invalid>3Open I'm encountering empty instance list almost every time I try to use the instance list. Needing to log out and log in again works around the pro... [10:53:56] 10Wikimedia-Labs-wikitech-interface, 6operations, 7Regression: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1128104 (10Aklapper) p:5Triage>3High [11:49:54] !log tools.wikibugs yuvipanda: Deployed 23240bd0dc5aebcc2a94b6f1ac268e2e3ad41114 Add more projects for devtools and mobile wb2-phab, wb2-irc [11:50:00] Logged the message, Master [12:16:14] 6Labs, 5Patch-For-Review: db servers for designate and labs pdns - https://phabricator.wikimedia.org/T92694#1128258 (10Springle) I noticed two grants were applied on m1-master: pdns_admin@dbproxy1001 pdns_admin@holmium The holmium account only had INDEX priv. I've removed it. Please only grant and connect v... [12:55:52] (03PS1) 10Ricordisamoa: Code-style tweaks [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/197514 [13:03:07] 6Labs, 10Wikimedia-Labs-wikitech-interface, 5Patch-For-Review: Use a Puppet ENC to define which classes are included in which nodes (in Labs) - https://phabricator.wikimedia.org/T85279#1128311 (10scfc) That's a bummer. If we could make the regular expression match of the host name a fact, we might use a "no... [13:04:42] 6Labs, 10Wikimedia-Labs-wikitech-interface, 5Patch-For-Review: Use a Puppet ENC to define which classes are included in which nodes (in Labs) - https://phabricator.wikimedia.org/T85279#1128316 (10scfc) (Or `${instanceproject}-${typeofinstance}${number}`, seems to be another common pattern.) [13:05:56] (03CR) 10Ricordisamoa: [C: 04-1] Code-style tweaks (038 comments) [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/197514 (owner: 10Ricordisamoa) [13:22:59] 6Labs, 10Wikimedia-Labs-wikitech-interface, 5Patch-For-Review: Use a Puppet ENC to define which classes are included in which nodes (in Labs) - https://phabricator.wikimedia.org/T85279#1128399 (10yuvipanda) @scfc Assuming you meant 'bummer' is the fact that we can't get rid of hiera_include classes? Anyway,... [13:32:51] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128437 (10scfc) The problem with the uid = port approach is that if that one port is taken, you're stuck for good. Ideally, web servers would bind to a free port (0) and then report to the proxies the actual port they allocated. I... [13:34:28] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128448 (10yuvipanda) @scfc: Right, so the only 'true' way to find out if a port is free from userspace that I could find is actually bind to a free port, but I don't know if closing it will release it immediately... We could do some... [13:34:38] !log deployment-prep added mobrovac to project [13:34:42] Logged the message, Master [13:34:50] :) [13:35:41] !log deployment-prep made mobrovac projectadmin [13:35:44] Logged the message, Master [13:43:44] andrewbogott_afk: do you know how far off is a trusty / jessie migration for the labs puppetmaster? [13:44:20] YuviPanda: I don’t know for sure. There are a bunch of tickets around that, if you search for ‘virt1000 spof’ [13:45:02] andrewbogott_afk: looking. [13:45:30] !log deployment-prep added restbase security group [13:45:34] Logged the message, Master [13:46:21] andrewbogott_afk: hmm, I see. It looks like I might do the ENC ‘properly’ enough that we can deploy it on the puppetmaster, but then I’m wary of putting one more thing on virt1000 [13:48:53] virt1000 is the right place for it I think. Although I understand why you’re wary [13:49:38] andrewbogott_afk: yeah, esp. for it to be performant enough it will have to be basically a running service that is hit via http or something... [13:49:54] andrewbogott_afk: so that means I basically am ‘stuck’ with precise’s versions of python libraries, which sucks [13:50:08] there’s already apache on virt1000 for the normal puppet master. [13:50:33] right, but precise apache + mod_wsgi or something seems very unideal [13:50:59] Ideally I’d just put this behind a uwsgi thing, but also other things like python-flask are fairly old in precise... [13:54:25] robh really wants me to rename virt1000, so maybe that’ll be the impetus for getting a new box :) [13:55:07] andrewbogott: :) [13:59:35] 6Labs, 10Wikimedia-Labs-wikitech-interface, 5Patch-For-Review: Use a Puppet ENC to define which classes are included in which nodes (in Labs) - https://phabricator.wikimedia.org/T85279#1128500 (10scfc) With bummer I meant https://gerrit.wikimedia.org/r/#/c/197487/. The more places there are where configurat... [14:05:55] <2 million files left to copy. [14:07:04] I think I can safely say that this will have to be the last time for a filesystem move; it's getting impressively massive. [14:07:47] (And also that I think that daily mirror to codfw may be overly optimistic) [14:11:37] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128563 (10coren) There is no "safe" way to do so, except passing an open file descriptor to the web server as it starts (which very, //very// few servers support - uwsgi being the only one I know of). That said, if you retain the "r... [14:12:51] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128567 (10yuvipanda) Portgranter testing has the same effect as the wrapper script testing, which is the race condition. And doing it in the wrapper script reduces the number of moving parts. [14:13:58] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128578 (10coren) No, it's not the same effect: the granter testing means that if the port is found use after all it can be /removed/ from the list entirely and another one granted to the client. I.e.: the client doesn't fail, it ge... [14:15:27] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128582 (10coren) And also, there are no race condition possible if there is a single granter given it will never *try* to allocate the same port twice. [14:16:15] Coren: have a few minutes to check my work? [14:16:25] andrewbogott: Sure. Point me at it. [14:16:40] a new dns server, labs-ns2.wikimedia.org [14:17:05] It should have entries for instances that were created in the last 12 hours or so. they’re of the form ..eqiad.wmflabs [14:17:26] Can you dig around there a bit and confirm that the server is acting as you’d expect? I don’t necessarily know what to look for beyond things generally resolving. [14:17:46] andrewbogott: Doing so now, but it already looks promising. [14:17:49] labmon1001 uWSGI alert for 6d22h [14:18:15] an example of a recent host: deployment-restbase01.deployment-prep.eqiad.wmflabs [14:18:26] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128587 (10yuvipanda) You can do the exact same looping in the wrapper script too :) Again, the problem is that the portgranter doesn't actually allocate anything. The 'race condition' is jobs not going through webservice getting on... [14:18:42] YuviPanda: paravoid’s comment ^ seems like something you might know about [14:19:05] paravoid: Coren was looking at it. [14:20:28] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128597 (10yuvipanda) Basically, I'm trying to figure out what exactly porgranter adds here, and I can not. It is another running service with a dependency on NFS, and I do not see what it actually gives us over simple uid based port... [14:20:58] But also, "hostname.projectname"? I agree this is a good change, but will need tweaking of resolv.conf to get right without breaking things that don't expect it. [14:21:32] Coren: Yes, I’m not sure about that change. But… [14:21:54] Coren: whatever happened with the uwsgi alerts you were checking on labmon1001? paravoid points out they’re still alerting... [14:22:15] It’s going to be a while untl that server supports public/floating ips. So I’m hoping we can leave everything going at once (dnsmasq + old ldap server + new designate server) since they don’t overlap in scope. [14:22:27] But, maybe that won’t work since one or the other needs to be authoritative for .eqiad.wmflabs? [14:22:52] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128606 (10coren) You're looking at it the wrong way: it doesn't allocate ports, it allocates and registers from a pool. Think of it like a DHCP server, not a bind() system call. Its missing feature right now is removing collisions... [14:23:09] andrewbogott: Search order can cope with that. [14:23:14] cool [14:23:38] So, .project.eqiad.wmflabs seems like the right direction going forward, and we can switch usage over gradually. [14:24:09] YuviPanda: Give me a minute to swutch context. [14:24:15] Coren: cool :) [14:24:54] andrewbogott: I +1 that DNS server, if only because it answers correctly when it doesn't know. But everything else I see is full of win. [14:25:04] Coren: so, when setting up that server I had to specify things like SOA, and I don’t entirely understand how/if that affects the actual resolving of addresses. You don’t see any typos or mistakes there? [14:25:19] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128615 (10yuvipanda) Right, but I'm saying this doesn't require a DHCP server.... Imagine I get rid of portgranter, and just set the PORT to be the uid. What goes wrong? What i the negative of that? The positive, as I said, is a sim... [14:25:47] andrewbogott: No, it makes sense (though 300s ttl for a SOA is on the low side). Also, you think root@wmflabs.org is the best mailbox? [14:26:04] Coren: no idea — what do you think is better? [14:26:22] dontbugus@example.org? [14:26:39] Oh god no; please have the hostmaster remain a real mailbox. :-) [14:26:59] (Besides it being mandated by the specs) [14:27:03] Coren: what’s the ttl on dnsmasq now? Seems like it must be pretty short since I destroy/recreate hostnames without incident. [14:27:52] It's tiny atm iirc. 60s? [14:28:15] Yeah, that’s why I went with 5 minutes, that seems generally handy to not have to wait when reusing hostnames. [14:28:33] even 5 minutes is long enough to cause confusion [14:28:40] so, the mailbox, what do you suggest? [14:31:23] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128633 (10coren) Exatcly the same things as static IP allocation. Nothing when all goes well, and hard-to-track issues that need manual intervention when there is a collision. If your wrapper doesn't test the port then that blocks... [14:31:55] andrewbogott: That says to me "we want a short TTL on A records" not "We want a short-lived SOA" :-) [14:32:19] Oh, ok, I was confused. Lemme see where that’s set. [14:33:02] what query is getting you that ttl? [14:33:05] 10Tool-Labs: Memory Exhausted Near / Tool labs error while querying with Python - https://phabricator.wikimedia.org/T93074#1128639 (10marcmiquel) 3NEW [14:33:30] dig SOA @labs-ns2.wikimedia.org eqiad.wmflabs.org [14:33:32] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128646 (10yuvipanda) Since we have only one webservice running per user, there would be no collissions from webservice itself. And if there is an external collision, that is going to come from something not using webservice (and cons... [14:35:54] dig SOA @labs-ns2.wikimedia.org wmflabs.org [14:35:58] I mean [14:39:46] Coren: mailbox? Or is there no good answer? [14:40:40] 10Wikimedia-Labs-wikitech-interface: Get Parsoid (and thus VisualEditor) working again on Wikitech - https://phabricator.wikimedia.org/T75104#1128666 (10Krenair) 5Open>3Resolved a:3Krenair Looks like this was fixed at some point. [14:41:22] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128669 (10yuvipanda) Hmm, zombie webservice processes that have refused to die would cause problems, maybe? (with uid -> port mapping) [14:41:36] 10Wikimedia-Labs-wikitech-interface: Not possible to delete files - https://phabricator.wikimedia.org/T73208#1128671 (10Krenair) Does this still occur, @scfc? [14:41:38] andrewbogott: root@ may be the least bad one. I can't think of a better place. [14:41:49] ok, fair enough :) [14:42:01] I don't think we have a dedicated hostmaster mailbox right? [14:45:22] 10Wikimedia-Labs-wikitech-interface: wikitech strict warnings on API save - https://phabricator.wikimedia.org/T72628#1128685 (10Krenair) 5Open>3Resolved a:3Krenair I think this would've been fixed ages ago (by https://gerrit.wikimedia.org/r/#/c/183069/ ? or the wikitech config changes) [14:45:36] Coren: not as far as I know [14:45:58] YuviPanda|food: That labmon1001 error thing isn't what I was looking at - I fixed a puppet issue. I'm looking at the error now but it looks like the test itself is broken. [14:46:13] andrewbogott: Then root@ it is. :-) [14:46:17] Coren: hmm, I remember you were looking at uwsgictl vs service uwsgi, etc [14:46:45] 10Wikimedia-Labs-wikitech-interface: Thumbnails for Commons images are not displayed - https://phabricator.wikimedia.org/T74245#1128692 (10Krenair) That link works for me... Is this the same thing as {T93041}? [14:47:09] Coren: ok, thank you! I release you to Yuvi :) [14:47:35] YuviPanda|food: Yeah, but that was in the context of "puppet breaks trying to start the service" [14:48:05] YuviPanda|food: The service runs now; the error is... well, I'm not sure what it's trying to say is in error. [14:48:51] "Not all configured uWSGI apps are running" is clearly incorrect when there is exactly one configured app and it /is/ running. I'm looking at how it's trying to decide that now. [14:49:25] It might be confused about what is configured. [14:50:54] 10Wikimedia-Labs-wikitech-interface: Wikitech: Install TemplateData extension - https://phabricator.wikimedia.org/T56316#1128703 (10Krenair) 5Open>3Resolved a:3Krenair This seems to have been done, I think by the wikitech config/deployment changes. [14:51:10] Coren: ah, hmm. I’m neck deep in beta / staging stuff, can you take care of the alert in some form / way? :) [14:51:17] and I should actully go away and eat food.. [14:51:36] YuviPanda|food: Yeah, I'm trying to figure out why it's firing as we speak. [14:51:47] ty [14:54:40] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128716 (10coren) They would, and with no way to allocate a different port that would stall the service. Whereas if a new webservice job is started it'd get a fresh port and would override the redis mapping with the new one. Serious... [14:55:29] Yuvi|reallyFood: I have more thought to that port thing than you gave me credit for. :-) [14:55:33] gave* [14:56:19] But test-and-remove-from-pool is missing indeed. 100% of the current issues are exactly that. [14:57:18] 10Wikimedia-Labs-Infrastructure: Have shell requests marked as uncompleted or completed automatically - https://phabricator.wikimedia.org/T47456#1128733 (10Krenair) [14:57:19] 10Wikimedia-Labs-wikitech-interface: Cleanup and enable UserFunctions extension on wikitech - https://phabricator.wikimedia.org/T47455#1128732 (10Krenair) 5Open>3stalled [15:02:45] Hey andrewbogott [15:02:58] Krenair: eating, back in 20 [15:03:00] I was trying to ssh to silver yesterday but had to go via bastion. I thought it was accessible externally? [15:03:01] ok [15:08:29] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1128771 (10scfc) @coren: If `portgranter` would test each port before giving it out, you could replace the pool with a random function (or binding to 0), this also would mean that you could move that logic to the wrapper (`portgrabber... [15:11:04] 10Wikimedia-Labs-wikitech-interface: Thumbnails for Commons images are not displayed - https://phabricator.wikimedia.org/T74245#1128786 (10scfc) 5Open>3Resolved a:3scfc I don't know, but this issue works now for me as well. [15:11:17] 10Wikimedia-Labs-wikitech-interface: Thumbnails for Commons images are not displayed - https://phabricator.wikimedia.org/T74245#1128791 (10scfc) a:5scfc>3None [15:12:54] 10Wikimedia-Labs-wikitech-interface: Not possible to delete files - https://phabricator.wikimedia.org/T73208#1128808 (10scfc) 5Open>3Resolved a:3scfc No, this succeeded just now: "2-roller crusher.jpg has been deleted." [15:13:17] 10Wikimedia-Labs-wikitech-interface: Not possible to delete files - https://phabricator.wikimedia.org/T73208#1128818 (10scfc) a:5scfc>3None [15:20:40] Krenair: we just added a firewall to silver so now you can only ssh via bastion [15:21:06] andrewbogott, ok, will update the docs [15:21:30] andrewbogott, so what would the criteria be for it to become silver.eqiad.wmnet instead? :p [15:21:50] Krenair: it has a public IP because it hosts wikitech, a public web server. [15:21:59] Hosts with public IPs get public dns names [15:22:08] ok [15:24:40] Why is grrrit-wm announcing l10n updates? [15:31:46] 10Wikimedia-Labs-wikitech-interface, 6operations, 7Regression: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1128870 (10Krinkle) [15:35:03] Coren: I do give you plenty of credit - it’s worked well! I’m just trying to optimize more :) [15:35:12] I do see how I could’ve phrased all my comments much better, however. [15:35:41] YuviPanda: Honestly, just adding the (excellent idea) test-before-you allocate will solve 99.9% of the issues. [15:35:56] And keep the robustness in case of errors. [15:36:17] What you suggest, using the uid, would work fine most of the time but the failure modes are worse. [15:36:18] Coren: well, so will test-before-you-allocate and putting that entire logic in the wrapper script, instead of a persistant server. [15:36:21] essentially I’m trying to get rid of state wherever possible [15:36:35] another big part to get rid of is the persistant socket needed for proxylistener [15:36:45] that should be eliminated too, somehow. [15:36:58] The persistent socket is the only reliable way to make sure proxylistener knows when a server dies. [15:37:02] marktraceur: jenkins is merging them now instead of them being force-merged...file a bug? :) [15:37:21] Coren: yeah, I don’t have a solution yet, but I think it should be explored. [15:37:33] Coren: a lot of this is coming off pain from the 4-6h long outage after GHOST :) [15:37:38] and the subsequent outages. [15:38:14] legoktm: On it. [15:38:51] YuviPanda: Sure, but that's something that's fairly easy to fix. My concern is never "does it work well" but "how does it fail" and right now there was a gaping hole in the latter. [15:39:20] sure with the uid stuff, but see scfc’s latest comment :) [15:40:11] You still end up with a race condition, only now it's even *harder* to fix when it fails because you're adding entropy! :-) [15:40:40] That said, scfc's solution is - amusingly enough - the IPv6 answer to DHCP. :-) [15:40:49] indeed, and I think we can do the same thing here. [15:41:04] Only v6 gives you 48 bits of randomness so collisions are vanishingly unlikely. [15:41:27] With less than 12, and 100 some services on a box, birthday collisions become inevitable. [15:42:01] well, I’m sure we’ll figure some way out to get rid of this much state :D [15:42:07] right now port numbers are in three places [15:42:15] portgranter, proxylistener’s open socket, and redis [15:42:18] ideally they should be just in redis [15:42:22] I agree. [15:42:32] perhaps my fault was jumping to a solution instead of just defining the problem [15:42:40] I’ll open another task and perhaps use better words this time [15:43:29] That said, having the portgrabber scoreboard file means we *should* be able to rebuild redis at any time. [15:43:33] Coren: btw, labmon1001 is actually in trouble: http://graphite.wmflabs.org/ [15:43:55] YuviPanda: See SAL. I'm currently trying to debug the funky upstart mess. [15:44:07] Coren: ah, cool [15:44:26] Coren: I’m going to file a bug and assign to you for bookkeeping. thanks [15:44:40] Right now, the thing-that-checks-uwsgi-is-up doesn't talk right to the thing-that-tries-to-start-it [15:44:50] 6Labs, 6operations, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1128900 (10yuvipanda) 3NEW a:3coren [15:45:41] 6Labs, 6operations, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1128908 (10coren) Currently being worked on. There is confusing is the upstart config that prevents it from properly determining status, confusing both puppet and icinga. [15:51:00] nuria: ^ is the bug for graphite down [15:51:54] 6Labs, 6operations, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1128934 (10Nuria) [15:53:39] YuviPanda: fyi, the issue comes from the fact that all of the startup scripts are an unholy mix of init, manual control and upstart stuff with a sprinkling of systemd-ready tweaks. uwsgictl invokes upstart in an odd way, which itself does unusual things. [15:54:18] Where "odd way" means "in a way that is systemd flavoured by hacks over upstart" [15:54:42] Coren: :) I remember ori taking blame for those at some point, not sure if true or kidding [15:54:53] It seems to work well at bootstrap, but it's confusing the hell out of puppet. [15:55:23] (And icinga, which uses the upstart-style check) [15:55:55] I'm still trying to work out how everything is invoked through all paths to figure out what differs. [15:56:03] 6Labs, 5Patch-For-Review: Investigate replacing our custom DNS code with Designate - https://phabricator.wikimedia.org/T87280#1128941 (10Andrew) Further update: The dns server running on labs-ns2.wikimedia.org is serving up ips for names like: deployment-restbase01.deployment-prep.eqiad.wmflabs Those are au... [15:56:17] 6Labs, 5Patch-For-Review: Investigate replacing our custom DNS code with Designate - https://phabricator.wikimedia.org/T87280#1128943 (10Andrew) [15:58:08] 6Labs, 5Patch-For-Review: Copy old internal instance IPs to designate - https://phabricator.wikimedia.org/T93085#1128953 (10Andrew) 3NEW [16:00:06] 6Labs, 5Patch-For-Review: Move to a new dns scheme for labs: hostname.projectname.eqiad.wmflabs - https://phabricator.wikimedia.org/T93087#1128969 (10Andrew) 3NEW [16:00:30] 6Labs, 5Patch-For-Review: Move to a new dns scheme for labs: hostname.projectname.eqiad.wmflabs - https://phabricator.wikimedia.org/T93087#1128969 (10Andrew) [16:00:31] 6Labs, 5Patch-For-Review: Copy old internal instance IPs to designate - https://phabricator.wikimedia.org/T93085#1128978 (10Andrew) [16:01:29] 6Labs, 5Patch-For-Review: Use Designate for public/floating labs IPs - https://phabricator.wikimedia.org/T93088#1128990 (10Andrew) 3NEW [16:02:25] 6Labs, 5Patch-For-Review: Set up designate-dashboard - https://phabricator.wikimedia.org/T93089#1129001 (10Andrew) 3NEW [16:03:01] 6Labs, 6operations, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1129013 (10coren) More data: Currently, three distinct interfaces to manage uwsgi are in place: * sysvinit * upstart (which is split in subservices) * /sbin/uwsgictl (which invokes the upstart interface)... [16:05:23] 10Wikimedia-Labs-wikitech-interface: Project Bastion has service groups - https://phabricator.wikimedia.org/T64537#1129020 (10Krenair) [16:06:14] 10Wikimedia-Labs-wikitech-interface: Project Bastion has service groups - https://phabricator.wikimedia.org/T64537#666597 (10Krenair) Now we also have ziad, bastionramtonz, and jybot [16:07:41] 10MediaWiki-extensions-OpenStackManager: Service group order - https://phabricator.wikimedia.org/T50405#1129034 (10Krenair) [16:07:52] 10MediaWiki-extensions-OpenStackManager: Service group order - https://phabricator.wikimedia.org/T50405#514593 (10Krenair) [16:08:34] 10Wikimedia-Labs-wikitech-interface, 6Phabricator: Migrate new Labs projects request process to Phabricator - https://phabricator.wikimedia.org/T72626#1129038 (10Krenair) [16:08:57] 10Wikimedia-Labs-wikitech-interface, 6Phabricator: Migrate shell access request process to Phabricator - https://phabricator.wikimedia.org/T72627#1129043 (10Krenair) [16:36:06] YuviPanda|brb: I have registered on wikitech, username is vivekvc [16:45:43] 10Wikimedia-Labs-wikitech-interface, 10Wikimedia-IRC, 6operations: Enable irc feed for wikitech.wikimedia.org site - https://phabricator.wikimedia.org/T36685#1129227 (10Krenair) a:3Krenair I'll volunteer for it. [16:56:55] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1129294 (10yuvipanda) So here's the problem as I see it: Our webservice setup has the same 'state' maintained across different places, and they all need to be kept in sync. Currently, 'state' is kept in the following places: 1. por... [16:59:50] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1129306 (10coren) It does, very much so. How about we set up a hangout to work out plausible solutions and report the result of that discussion back here before we settle on a specific implementation? [17:00:45] 10Tool-Labs: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1129307 (10yuvipanda) Not yet :) I'm still deep in betalabs stuff, and I guess @scfc would like to be involved as well :) Let's keep noodling around here (maybe retitle if current title is too presemptuous?). [17:00:58] Coren: I’m also moving to SF in like, 9 days :) so I should hold off on doing much until after [17:01:12] Good point. [17:01:23] I take it your visa crap is now sorted out? Yeay! [17:01:39] Coren: yup! took only 2y, but most of that is my fault... [17:02:07] YuviPanda|brb: BTW, my first reflex would be "let redis be the source of authority and do all registration to/from there" [17:02:22] Coren: yeah, I’d agree, since redis is replicatable :) [17:02:58] Get rid of proxylistener entirely, and of the hack to the server wrapper that do that, and keep all state in redis. [17:03:39] YuviPanda|brb: Going through https://wikitech.wikimedia.org [17:03:51] (Also the scoreboard, etc) [17:04:13] Coren: andrewbogott Vivek wants to volunteer with some labs stuff, and is looking for things to do :) he does puppet / openstack in his dayjob, afaik :) [17:04:14] Although the plain text scoreboard is useful to track things down. Can be replaced with a query to redis. [17:04:22] Vivek: Coren and andrewbogott are the other admins on labs :) [17:04:26] Coren: yeah, but redis-cli :) [17:04:49] YuviPanda|brb: Needs to be made easier to parse than the keys as they are though. [17:05:21] Coren: so yeah, then it basically becomes just redis and OGE, and that’s all good. We should probably also tie this into ‘service manifests’, so ultimately state is only on NFS, and we can reconstruct that from scratch at any time. [17:05:27] It's hard to be cleaner than /data/project/.system/dynamic [17:06:00] Hi Coren [17:06:12] Nice meeting you. [17:06:15] Vivek: have you written any Horizon code? [17:06:28] YuviPanda|brb: I don't think that matches the scope of the manifests (what we want vs current state) though they'll obviously will want to talk to each other. [17:06:33] Vivek: o/ [17:06:42] Vivek: hello, btw :) [17:07:17] Vivek: If you can swing by, the Hackathon in Lyon at the end of May would be a most excellent place to meet. [17:07:18] andrewbogott: No [17:07:20] Coren: yeah, current state is they don’t exist, but a lot of this has to be hashed out, sure. I think this would end up being quarterly goal for toollabs next quarter. [17:07:37] YuviPanda|brb: That would sound good to me. [17:07:40] andrewbogott: Coren I think he hasn’t actually spent any time with our infrastructure at aall (just signed up on wikitech) :) [17:07:53] andrewbogott: Hi :) [17:08:23] Vivek: actually, this might be a good thing to work on — I don’t think it requires root access, most of the groundwork can be done on a labs instance: https://phabricator.wikimedia.org/T93087 [17:08:36] Vivek: that’s mostly my scrawled notes to myself, though, so may not make sense [17:09:04] It would be puppet work for the most part, might ultimately need to build new VM images as well (but that can also be done entirely with labs access) [17:09:59] Holey sheets. [17:10:31] Coren: ? [17:10:47] Coren: I'll try to be there,does the wikimedia foundation have any sponsorship for Lyon ? [17:10:50] andrewbogott: There's one tool that's... painful on the file count and total size. :-) [17:10:57] Coren: ah [17:11:27] Vivek: There are, but it might be too late to apply now. Though if Andrew or Yuvi are without a buddy you might qualify and squeak by. [17:11:44] Vivek: I might be able to sponsor you, I will look in to it. [17:12:09] YuviPanda|brb: I originally claimed that I was going to work on proxy-redundancy at the hackathon, but maybe that’s already done? [17:12:18] andrewbogott: yup, already done. [17:12:50] andrewbogott: cool. [17:13:28] ok, in that case it probably makes sense to switch over to some designate tasks and see about bringing Vivek along. Vivek, want to send me a brief note of intruduction? Not a resume but just a ‘I am working at X on Y’ so I have some context? [17:13:45] Vivek: oh, and maybe where you live as well :) [17:14:43] Vivek: abogott@wikimedia.org [17:19:28] (Indeed, it's easier to get you sponsorship to Lyon if you live in Genève than Tõkyõ) :-) [17:19:46] * Coren stares at his keyboard evilly. [17:20:09] Those were supposed to be macrons, not tildes. [17:20:54] andrewbogott: Sure. [17:21:43] Coren: as staff did you register yourself on the dons.wikimedia.fr registration page, or did that get done by Rachel or Quim or someone? [17:21:59] andrewbogott: Quim's email said to register so I did. [17:22:09] oops, I guess i will do that then :) [17:22:09] andrewbogott: Designate is slated for Kilo right ? [17:22:41] Vivek, there are tags for icehouse and juno, I’m not sure if it’ll make it into official openstack packages until Kilo, if then [17:22:47] But I’m running the icehouse packages now, it’s working ok. [17:23:00] well, I built them myself fromt he debian source package [17:24:27] Ok. [17:27:24] Vivek: do you have a labs account? And, if so, what’s your username? [17:32:04] oh, I guess actually this form wants a phab account link [17:37:00] andrewbogott: I am in Chennai,India. [17:37:45] andrewbogott: I have a wikitech account. [17:37:50] Vivek: ok. If you create a phab account right now I can link to it on this application I’m filling out. [17:38:01] It's vivekvc [17:38:02] That’s for some reason a separate creation from labs [17:38:09] Ok. [17:38:25] Ah, you have one already! [17:38:28] I was just misspelling your name. [17:38:44] Ok. [17:42:41] Vivek: one more piece of homework… visit https://dons.wikimedia.fr/civicrm/event/register?reset=1&id=8 and fill out that form as best you can. For the ‘hackathon’ buddy you can link to me at https://phabricator.wikimedia.org/p/Andrew/ [17:43:05] ‘as best you can’ meaning that it wants to know exactly what you’ll be working on which of course no one actually does [17:43:39] andrewbogott: ok. [17:46:25] andrewbogott: I've mailed you my resume. [17:46:38] Vivek: I see it — thanks! [17:51:06] 10Wikimedia-Labs-wikitech-interface, 6operations: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#1129484 (10Krenair) Unfortunately those files (silver.wikimedia.org:/a/backup/public/labswiki-*.gz) appear to be owned by root, which I'm guessing is why I can't download them fr... [18:00:57] Coren: Any word on graphite on labs? [18:01:54] nuria: Graphite should be up; it's the alerting that's sick. [18:02:11] nuria: I have to go to sleep in a bit, however :( [18:02:21] Coren: ok, so i should be able to sent metrics to graphite via statsd now, correct? [18:02:31] YuviPanda: understood man, no worries [18:02:49] nuria: however :D [18:02:50] nuria: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1426701760.768&target=wikimetrics.line_rate.value [18:02:54] nuria: your data is coming throuhg indeed [18:03:02] YuviPanda: oohhhh [18:03:08] nuria: what you might have initially seen is perhaps metric creation lag? takes a few minutes for it to be initially set up.. [18:03:40] YuviPanda: is this accesible through the main screen somewhere? http://graphite.wmflabs.org/ [18:04:11] nuria: yes, that’s how I found it. expand ‘graphite’ on the left, and under ‘wikimetrics’ I found it [18:04:25] nuria: however, if this is under the ‘analytics’ project, I ask you to make the prefix be ‘analytics.wikimetrics' [18:04:44] nuria: basically, leftmost element should be name of labs project this is running on... [18:05:15] YuviPanda: ok, i looked there yesterday to no avail, thank you , will change names to make everything prettier, [18:05:24] nuria: :D sweet [18:05:47] nuria: I’m also moving to SF in about 9 days :D beat madhu by about 10 days :D [18:06:48] YuviPanda: NICEEEEEE!!!! [18:21:25] YuviPanda: Did you find a good flat? [18:21:43] Coren: HAHAHAHAHAHAHAHAHHAAAAAA [18:21:50] Coren haven’t even started looking [18:22:16] what. Planning to live on Mission st. until you do? :-) [18:22:43] Coren: fortunately, Daniel is an expert at SF suitcase living, so he’ll have all kinds of pointers for Yuvi. [18:23:01] Coren: no, so basically I’ve a week at Dan (Deskana), and I’ll probably find an airbnb for a month after that, and then I have NY -> Lyon -> Glasgow [18:23:14] yeah, and that too. [18:24:45] YuviPanda: what dates do you need airbnb for? [18:25:16] andrewbogott: I think 4/5th april to 9/10th may [18:25:18] I can give you some leads, if you’re interested in private apartments (that cost $1-2k/month). If you’re looking for a couch to surf then I’m no help. [18:25:35] andrewbogott: private appartments sound nice... [18:25:40] andrewbogott: leads would be GREAT. [18:25:43] ok, will email [18:25:43] * Coren chokes on those prices. [18:27:06] Seriously! My mortgage+taxes for my 200m² house with a nice large garden is ~$900 USD/month [18:27:18] 10Wikimedia-Labs-wikitech-interface, 10Wikimedia-IRC, 6operations, 5Patch-For-Review: Enable irc feed for wikitech.wikimedia.org site - https://phabricator.wikimedia.org/T36685#1129670 (10Petrb) Oh yes! After 3 years this might finally get solved, then I will send a bottle of Champaigne to WMF ops. [18:27:28] More like $800, actually. [18:32:11] Coren: My house has 7 bedrooms/6 baths and my house payment is less than I spend for a month when I’m in SF. I’m getting used to it but I sure do feel rich being back in MN. [18:32:45] andrewbogott: Yeah, same principle I guess. [18:33:14] andrewbogott: My house only has 3 beds/3 bath but we've got lots of office space (by design) [18:33:59] andrewbogott: But I'm pretty sure that by the time you reach 7 bedrooms, you're supposed to start saying "manor" :-) [18:34:03] Coren: I should add that there are six adults living in my house — it’s not just that I’m averse to using the same toilet twice in a row. [18:34:33] Six adults? Flatmates or extended family? [18:35:13] flatmates, friends, tenants [18:35:14] a mix :) [18:35:19] Coren: I think andrewbogott is literally a landlord [18:35:27] It’s a weird house. My parents think it used to be a brothel. [18:36:14] Interesting arrangement. Not sure I could strive in that setting nowadays, though I suppose it might have been fun when younger. [18:36:52] There’s a fair bit of privacy when I need it. I have a little mini kitchen in my part of the house. [18:37:33] No doubt - we're at that point in our lives where the idea of having other people even live in the same building we do is unpalatable. :-) [18:39:35] YuviPanda: OK, I sent you three refs. Note that the last apartment (Irene’s) is the most likely to be available and also by far the most expensive :( [18:40:08] YuviPanda: also, check with mutante, he has opinions about which weekly hotels are affordable yet non-creepy. [18:42:50] YuviPanda: That's going to be quite a shock for you moving from opulent freedom to SF. :-) [18:43:15] Coren: fwiw, I haven’t been very opulent here :D everything I own still fits into one suitcase... [18:43:41] YuviPanda: Didn't you say that was by choice so you could keep moving from place to place as the mood strikes? [18:43:52] Coren: totally! [18:43:55] MusikAnimal: is there a problem with XTools ? [18:44:14] Coren: and well, right now I’m getting my flights to / from Glasgow booked, so maybe not much has changed? :) [18:44:44] I suppose. You haven't started paying rent/bart/utilities in SF though. :-P [18:45:28] pleclown: working for me [18:45:45] Coren: true, true :) [18:46:00] MusikAnimal: it is working now [18:46:19] MusikAnimal: I had a error 2 minutes ago [18:46:29] Coren, privacy: http://www.bogott.net/2012/attic/index.html [18:47:18] andrewbogott: Cool. [18:47:53] Coren: YuviPanda: where can I check on the status of the replication databases? like if they're lagging or down altogether [18:48:48] MusikAnimal: hmm, afaik there’s no ‘public’ place, but me / Coren / andrewbogott can check by going to https://tendril.wikimedia.org/tree [18:48:51] (no lag atm) [18:49:48] was it down or lagging within the past half hour? we get "no revisions" on the articleinfo tool from time to time, or 0 edits on the edit counter [18:49:54] I'm led to believe it's the repl db [18:52:43] ^ YuviPanda [18:53:09] MusikAnimal: nope, no indications of such. [18:54:00] Hmm ok [19:18:40] 10Tool-Labs: The portgranter service isn't puppetized - https://phabricator.wikimedia.org/T93120#1129884 (10scfc) 3NEW a:3scfc [19:22:07] 10Tool-Labs: The proxylistener service isn't puppetized - https://phabricator.wikimedia.org/T93121#1129895 (10scfc) 3NEW a:3scfc [19:40:36] andrewbogott: so, the ENC is taking shape now: https://gerrit.wikimedia.org/r/197712 [19:40:44] andrewbogott: I think whewn that’s done, it’s ‘fast enough’ to be put on virt1000... [19:40:53] great! [19:40:54] and I’ll keep an eye on its memory profile and what not. [19:41:18] andrewbogott: so this is YAML+LDAP, and in the future we can figure out what we want to do for horizon. [19:41:34] andrewbogott: and it runs as a HTTP service. I wouldn’t call it REST since it only has one GET endpoint... [19:43:06] YuviPanda: it doesn’t require ldap, right? It just integrates with the ldap def if there is one? [19:43:22] YuviPanda: wait, if it only has GET then how do you … modify anything? [19:43:27] andrewbogott: right now it does require LDAP, since I need to translate ec2id to hostname, since YAML specifies hostnames. [19:43:36] ah, mn. [19:43:38] hm [19:43:39] andrewbogott: you modify them by adding a .yaml file to ops/puppet :) [19:43:45] that’s unfortunate but reasonable. [19:44:01] andrewbogott: https://github.com/wikimedia/operations-puppet/blob/production/nodes/labs/staging.yaml [19:44:58] andrewbogott: so once we figure out how we want Horizon’s puppet integration should look like, we can extend this to be a proper REST service with multiple databackends [19:45:10] yeah, makes sense. [19:45:14] easy enough for it to be MySQL+YAML, for example [19:45:23] Coren: looks like https://graphite.wmflabs.org/ is down [19:45:32] andrewbogott: we should also spend the time needed to write that nova plugin so we can make puppet and salt use hostnames... [19:45:40] andrewbogott: just needs a nova plugin that does cleanup. [19:45:43] nuria: ori is working on it atm; it may go up and down a couple times for a while. [19:45:44] on delete [19:46:04] YuviPanda: yeah, that should be pretty simple. [19:46:08] Coren: ok, very well , will look at stuff later then [19:46:16] Coren: Thanks for the prompt response [19:46:19] andrewbogott: yup, and then I can take LDAP out of the equation [19:46:56] andrewbogott: hmm, anothe reason LDAP is needed is for me to figure out which project the hostname is in... [19:47:45] YuviPanda: yeah… you can learn all those things by talking directly to nova if you’re running on virt1000. [19:48:03] andrewbogott: ah, right. cool, cool :) [19:48:32] andrewbogott: alright, so I’ll focus on packaging this up with appropriate bits (logging, an upstart job, etc), and then try to get this on virt1000 before my move :) [19:48:45] yep, sounds good. [20:04:12] 6Labs, 6operations, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1130015 (10coren) The root cause of the issue was the switch to using cgroup via https://gerrit.wikimedia.org/r/#/c/183256/ which made uwsgi require root to start, whereas the upstart scripts drop root priv... [20:04:13] 6Labs, 6operations, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1130018 (10coren) 5Open>3Resolved [21:37:01] (03CR) 10BBlack: [C: 031] ":)" [labs/private] - 10https://gerrit.wikimedia.org/r/197385 (owner: 10Dzahn) [21:51:57] (03CR) 10Dzahn: [C: 032] add Brandon Black to labs roots [labs/private] - 10https://gerrit.wikimedia.org/r/197385 (owner: 10Dzahn) [21:52:14] (03CR) 10Dzahn: [V: 032] add Brandon Black to labs roots [labs/private] - 10https://gerrit.wikimedia.org/r/197385 (owner: 10Dzahn) [22:39:07] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1130491 (10RobH) a:3Andrew [22:39:20] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1063566 (10RobH) @Andrew: Please address the above and assign back to me, thanks. [22:39:36] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1130493 (10RobH) p:5Triage>3Normal [23:13:41] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1130655 (10Andrew) There's not a current hardware issue. Faidon suggested that we replace this host, and I'd like to replace it mostly in order to upgrade to Trusty with minima... [23:39:11] 10Wikimedia-Labs-wikitech-interface, 10Wikimedia-IRC, 6operations, 5Patch-For-Review: Enable irc feed for wikitech.wikimedia.org site - https://phabricator.wikimedia.org/T36685#1130712 (10Krenair) 5Open>3Resolved We waited 3 years for that?