[00:26:51] 10Wikimedia-DNS, 6operations: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980396 (10Krenair) [00:27:21] 10Wikimedia-DNS, 6operations: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980408 (10yuvipanda) local testing with dnsmasq (for example) returns ::1 for localhost. [01:52:00] "we still haven't closed the insecure-POST problem." [01:52:53] anomie discovered this independently this week. We are going to add logging in api.php to find the folks still doing this so we can whine at them. [01:53:19] See https://gerrit.wikimedia.org/r/#/c/266958/ [01:55:33] I fell into a hole [01:55:39] called 'resolving localhost by hitting DNS' [01:55:42] what a clusterfuck [01:56:01] google dns responds with NXDOMAIN [01:56:05] for A and AAAA [01:56:27] our DNS responds with 127.0.0.1 for A but NXDOMAIN for AAAA [01:56:42] and dnsmasq responds properly for both [01:56:56] and then you have nginx which solely uses only DNS for name resolution [01:57:11] which means if you try to proxy things from nginx to localhost (vs 127.0.0.1) you are kindof fucked [01:57:20] that took a day [02:00:39] YuviPanda: you know that localhost == 127.0.0.1 == ::1 right ;) [02:01:30] bd808: yes but nginx does not [02:01:40] so it says 'localhost; can not resolve domain' [02:01:46] and fucks right off [02:01:50] can you not tell it to talk to 127.0.0.1 ? [02:01:59] and the nginx people are basically like 'wat, that is dns servers being crazy' [02:02:01] or probably better ::1 [02:02:08] bd808: welll, so this is nginx coming from jupyterhub [02:02:14] and jupyterhub defaults to calling itself localhost [02:02:17] instead of 127.0.0.1 [02:02:28] so I'm basically in the intersection of like 3 different projects [02:02:40] anyway, in my glue code I just rewrote localhost to 127.0.01 [02:02:43] aka dependency hell [02:02:43] and left a big comment [02:02:50] bd808: yes except over HTTP :) [02:02:52] and DNS [02:02:54] and network protocols [02:02:57] rather than whatever [02:03:09] GLORIOUS NEW WORLD, A BIT WORSE THAN GLORIOUS OLD WORLD [02:03:32] I suppose nginx implements its own resolver too [02:03:39] yes [02:03:43] because gethostbyname isn't async [02:03:46] right [02:04:32] bd808: and I guess dns servers assume everyone uses gethostbyname [02:04:46] lol. my adblocker (uBlock) blocks sourceforge [02:05:05] "Found in: uBlock filters – Badware risks" [02:05:35] :D [02:05:39] it's been doing that for many months now [02:05:43] lucky you didn't have to hit sourceforge [02:05:46] YuviPanda: well... it would be a bit goofy to treat a bare "localhost" as a TLD [02:05:47] * YuviPanda glares at gridengine [02:05:57] bd808: yeah, so that seems fine too (from DNS perspective) [02:06:43] bd808: I guess I should try to change jupyterhub too [02:07:24] hardcoding "localhost" is kinda dumb I think [02:07:34] I don't think it's hardcoded [02:07:39] I need to go find out how it determines that [02:07:44] and maybe fix that to point to an ip [02:08:04] seems reasonable [02:08:13] yeah [02:08:23] except I did spend a few hours on it [02:08:28] but I guess tha'ts how you learn [02:08:42] unless they are using a hostname so that you could have some sort of cave dweller lb using round robin dns [02:10:45] bd808: nah, they are very SPOF-y now [02:10:57] I'm slowly fixing all the things for our install and upstream is graciously accepting them :D [02:11:01] * YuviPanda is at UCB at their lab now [02:11:34] I was telling them tales of {{int:}} and how mw accidentally got 'if' [02:24:14] bd808: jesus https://github.com/jupyter/jupyterhub/blob/2632d03dc2289a42cb31c14821ee455f2aa4bf0e/jupyterhub/utils.py#L198 [02:48:01] bd808: there's a ticket at https://phabricator.wikimedia.org/T105794 [02:48:10] (re: insecure post) [02:48:24] I was kinda tracking them down for a while, but got busy with other things heh [02:49:15] YuviPanda: nginx is right, NXDOMAIN is total failure for all other record types (and subdomainnames too) [02:49:54] bblack: right, so our DNS server returns NXDOMAIN for AAAA for localhost but 127.0.0.1 for A [02:50:00] our internal dns resolver that is [02:50:04] yeah, that's definitely wrong [02:50:15] but on the other other hand, nobody should ever be asking a DNS server for that anyways [02:50:20] right [02:50:23] it should be in /etc/hosts [02:50:33] yes, except then nginx can never resolve localhost :D [02:50:38] since they only hit the resolver [02:50:58] if nginx implemented their own resolver and it doesn't use information from /etc/hosts, that's broken too [02:53:17] right [02:53:29] bblack: and jupyterhub is using localhost instead of 127.0.0.1 for some strange reason [02:53:32] so that's broken too [02:53:35] in short [02:53:41] all the three things I'm gluing together are broken :D [02:56:14] we can fix the pdns_recursor I'm sure [02:56:24] I'll look into it a little later [02:56:58] bblack: <3 thanks [02:57:05] bblack: I'm looking into fixing jupyterhub too [03:01:54] YuviPanda: ummm... that juperhub code is disturbing. https://github.com/jupyter/jupyterhub/commit/4785a1ef87351202ebf06ab2cdf4c7d395d9c9eb [03:13:00] bd808: yeah [03:13:04] bd808: I have demanded an explanation [03:13:12] one person vaguely wrinkled their face and said 'I think it was windows' [03:17:19] 7HTTPS, 10OTRS, 6operations: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1980721 (10Matthewrbowker) [04:35:37] YuviPanda: https://gerrit.wikimedia.org/r/267208 (I tested this config manually, but I haven't compiler-checked or tested the puppetization) [05:08:47] bblack: awesome! I added a ticket number to the commit message [06:34:29] bd808: so apparently some machines that have ipv6 only (!?!) don't like it when you connect on 127.0.0.1 [06:36:22] YuviPanda: hmm... yeah I could see that. IPv6 only isn't incredibly common but it does exist [06:36:40] seems like they could detect that easier and use ::1 instead [06:36:43] yeah [06:37:10] is it a bind for a server or for IPC? [06:37:22] if it's a bind then 0.0.0.0 is better I think [06:37:38] it's SOA so only accessible on localhost usually [06:37:44] sits behind an authenticating proxy [06:38:06] ah right. and your trying to get the proxy to point to the backing service [06:38:17] *you're [06:38:27] right [06:40:01] bd808: so I guess pretty much everyone had legit reasons [06:40:11] except maybe nginx although they too do kinda have a legit reason [06:40:24] might not be the best solution [10:15:49] well these days anything that does IP at all should be doing v6, IMHO [10:15:58] we're kinda way past that point [10:18:35] ema: FYI I left cp1060 in the pool intentionally, because ottomata/joal need a heads up before the last bit of traffic falls off for analytics [10:18:48] Hi bblack [10:18:54] Just noticed yesterday's ping [10:19:10] oh hey :) [10:19:18] bblack: hi :) [10:19:22] bblack: We are in cluster-fuck currently, so a bit less, a bit more ;) [10:19:37] joal: basically we're almost done, but I left one server in so your data rate doesn't go to zero until you're ready [10:21:35] Thanks great bblack [10:22:17] bblack: I need to fix a lots of stuff on the cluster this morning, so we'll probably ask you to remove the last server from the pool on monday I guess ( if everythiong is back in shape) [10:22:42] joal: ok, we can do that [10:22:55] awesome, thanks [10:23:09] bblack: if things get fixed earlier, we'll let you nkow, but I doubt that [10:23:17] ok :) [10:29:34] bblack: it looks depooled to me actually: {"cp1060.eqiad.wmnet": {"pooled": "no", "weight": 10}} [10:37:54] in nginx? [10:38:08] it's only depooled in varnish-fe [10:38:12] ema: ^ [10:39:38] bblack: oh I see! I was listing hosts in varnish-fe [10:39:53] yeah I went ahead and pulled it there since that's mostly just 301 redirects [10:40:01] just left it weight=1 in the nginx service [10:40:12] yes I've seen that [10:46:30] 10Traffic, 10RESTBase, 6Services, 6operations, 5Patch-For-Review: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1981161 (10BBlack) 5Open>3Resolved a:3BBlack [10:46:34] 10Traffic, 6Services, 6operations: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1981163 (10BBlack) [10:46:38] 10Traffic, 6Services, 6operations: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1578453 (10BBlack) [10:51:23] heh [10:51:34] I guess I updated too many tickets at once :P [11:26:12] 10Traffic, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1981220 (10BBlack) Here's a better list: cxserver deploy: https://github.com/wikimedia/mediawiki-services-cxserve... [13:16:46] 10Traffic, 6Services, 6operations, 5Patch-For-Review: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1981369 (10BBlack) FWIW, in a 1 hour snapshot of all traffic to parsoidcache (regardless of internal vs external IPs), when varnish/pybal monitoring checks are excluded, we're left w... [13:25:06] 7HTTPS, 10Wikimedia-Blog: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728#1981397 (10Chmarkine) [13:28:00] 7HTTPS, 10Wikimedia-Blog: Wikimedia blog has unsecured elements on https - https://phabricator.wikimedia.org/T64488#1981399 (10Chmarkine) [13:59:52] 7HTTPS, 10Wikimedia-Blog: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728#1981480 (10Krenair) I did https://wikimediafoundation.org/w/index.php?title=Template:Blogbox&diff=prev&oldid=104829, so this should now work as soon as {T104726} is fixed. {T105905} is... [14:03:58] 7HTTPS, 10Wikimedia-Blog: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728#1981505 (10Krenair) And also https://wikimediafoundation.org/w/index.php?title=Spotlight_on_Wikimedia_Commons/Updates&diff=104830&oldid=92778 - but interestingly, that appears to have m... [14:10:56] bblack: https://phabricator.wikimedia.org/P2542 [14:11:36] I've tried varnish 4.1.1 behavior on PURGE requests. Is this test enough to conclude we don't need the keepalive patch? [14:15:32] ema: no, not really [14:16:12] so the gist of the issue here is that if we just let it return "200 OK", it will actually send back the page content (as shown with your e.g. Content-Length: 240) [14:16:30] the page contents are large, and we don't want to send them all back to vhtcpd constantly as part of the purge [14:16:38] right [14:17:26] in varnish3-land, we kill the content by using 204, which puts us in vcl_error, and then vcl_error by default will close the connection and not honor keepalive [14:17:53] oh so I should try with a 204 instead [14:17:59] yeah [14:18:15] I mean really, even a 200 would work, if there's a way to tell varnish "return 200 on this purge, but don't send content" [14:18:31] but the standards-y way to do that is 204 [14:18:57] 10.2.5 204 No Content [14:18:57] The server has fulfilled the request but does not need to return an entity-body, [14:19:00] ... [14:19:11] yeah that's better [14:24:32] 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1981540 (10BBlack) Status update: We're pretty much done with the cache traffic migration, but there's still 1x eqiad mobile cache (cp1060) pooled with low weight to keep mobile... [14:26:12] ema: rewinding a bit on all the above: the reason we don't want to send the page content back is because it will kill purge performance. the purger daemon would actually have to read (to a trash buffer) all the data before it can send another purge request over the keepalive conn. [14:28:15] of course, and page contents are useless data from the point of view of vhtcpd anyways :) [14:30:42] * elukey looks into https://github.com/wikimedia/operations-software-varnish-vhtcpd [14:34:58] bblack: 4.1.1 keeps the connection alive when returning 204 as well :) [14:35:06] 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1981548 (10Ottomata) Ok great! We’re having some issues with jobs right now due to some Kafka problems, and we’ll want to make sure everything is fine before we try to move on t... [14:35:33] that's because in varnish4 you purge with return(purge); [14:35:34] ema: nice :) [14:35:37] which calls vlc_purge [14:35:45] so it does not go through error at all [14:35:58] and then you can change the response however you like in vlc_purge itself [14:36:13] so basically you're doing return (purge) in vcl_recv, and return 204 in vcl_purge? [14:36:25] sub vcl_recv { [14:36:25] if (req.method == "PURGE") { [14:36:25] return(purge); [14:36:25] } [14:36:25] } [14:36:27] sub vcl_purge { [14:36:30] return (synth(204, "Purged")); [14:36:32] } [14:36:38] ok, that makes sense [14:36:46] so yeah, no need for any variant of the old keepalive patch [14:36:55] \o/ [14:47:18] ema: just curious, what is the current setting?? [14:47:22] (if you have time) [14:50:41] 10Traffic, 6operations: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1981574 (10ema) [14:50:44] 10Traffic, 6operations: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1981572 (10ema) 5Open>3Resolved Some of the patches have to been tackled in https://phabricator.wikimedia.org/T124281. Some other patches are not needed anymore. The remaining ones have been forw... [14:54:53] oooh, so resolved != done on https://phabricator.wikimedia.org/tag/traffic/ [14:56:27] I've marked T122880 as resolved and it's now gone from the board. I was expecting to see it under "Done" [14:57:03] elukey: which setting? :) [15:03:49] OK the guys here just explained that there is no relationship between a task status and board columns. [15:06:14] :) [15:06:16] ema: nevermind, I'll re-read the phab task :) [15:06:16] yeah [15:07:28] ema: the idea of even having a "Done" column was that the task might still be Open because someone else needs to do some action (e.g. analytics or dc-ops), but there's really no Traffic work left there [15:07:43] but in practice it really doesn't matter and the column could be deleted for all I care, I think [15:08:14] it gets to be a pain in the pass periodically turning off the closed-tasks filter and moving closed tasks from other columns to Done anyways [15:08:32] and the cases where the distinction matters don't happen often anyways, and could be fixed by breaking up the tasks better [15:09:16] heh "pain in the pass" == too much time spent in varnish [15:11:12] for that matter, in practice I've found that at least for me personally, I'm pretty bad at moving things around between backlog/upnext/inprogress [15:11:25] things stay stalled inprogress forever. tasks get resolved straight out of backlog, etc [15:11:58] maybe the "virtual queue/timeline" model of columns just doesn't map well here. that and/or I'm defective at using phab :) [15:12:39] maybe categorical columns would be more-useful [15:13:28] as in categories like "bugs that need fixing" vs "long-term architectural improvement ideas" vs "feature work for upcoming goals" or something along those sorts of lines [15:15:28] this seems like the sort of discussion that Team Practices would have insights about :) [15:23:52] rofl at "pain in the pass" [15:23:55] :) [16:14:35] 10Traffic, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1981719 (10ssastry) >>! In T110474#1981220, @BBlack wrote: > integration-visualdiff: > https://github.com/wikimedi... [16:49:50] analytics keeps a done column just so it is on the board for standup [16:49:55] once its been discussed at standup [16:50:01] it is then resolved and removed from done [16:50:11] so, ideally, things don't stay in the done column more than 24 hours [16:51:30] 10Traffic, 6operations: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#1981852 (10Ottomata) Just so it doesn't get lost in this process: https://gerrit.wikimedia.org/r/#/c/230173/ I still want to merge that and use it one day... :) [18:14:20] 10Traffic, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1982145 (10ssastry) [22:28:43] 10Traffic, 10MediaWiki-API, 6Services, 6operations, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1982972 (10GWicke) @faidon, is your view that this should be handled by somebody outside ops?