[00:08:45] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1950585 (10mmodell) @dzahn: awesome, thanks! [01:15:19] 10netops, 10Continuous-Integration-Infrastructure, 6operations, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1950863 (10Dzahn) [01:15:47] 10netops, 10Continuous-Integration-Infrastructure, 6operations, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204862 (10Dzahn) added @netops please specify which VLAN to use for cobalt [03:54:08] 7Varnish, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Wikimedia-Fundraising, and 3 others: Special:RecordImpression should die in a fire - https://phabricator.wikimedia.org/T45250#1951150 (10AndyRussG) [04:09:48] ok so https://gerrit.wikimedia.org/r/#/c/265316/ looks good now. legoktm and MaxSem helped fix up the code, and I just squished a minor unit test setup issue, but now it seems to DTRT [04:10:07] will see about how/when we can deploy that through to the prod wikis tomorrow [04:10:09] DTRT? [04:10:14] Does The Right Thing [04:10:22] or Do The Right Thing, I donno :) [04:11:54] I'm guessing it will take a week or so to get it up and running on the normal schedules [04:12:55] I'd like to get it going sooner, but I don't know that I can really justify whatever priority that requires. it's not like it's a security fix, and nothing's broken so long as we keep blocking on the mobile conversion. [04:13:20] I'm just impatient :) [04:13:53] you can put it up for SWAT [04:16:48] well yeah [04:16:58] I thought that still meant pushing through different groups, etc? [04:17:29] I'm just confused I guess [04:18:38] ok yeah if I wake up early enough to look at the branches and stuff, I'll try to get it in the morning SWAT [04:18:41] off for now, thanks :) [07:59:33] 10Traffic, 6operations, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951270 (10elukey) @BBlack: I didn't see the "Additionally, the following parameters are available as part of our commercial subscription:" before the directive, in the oth... [08:02:42] Additionally, the following parameters are available as part of our commercial subscription [08:03:00] I am starting to see a recurrent pattern in the nginx docs [08:11:07] :(( [08:13:44] I didn't think it was so bad, or maybe it is only the upstream module [08:23:34] 10netops, 10Continuous-Integration-Infrastructure, 6operations, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1951302 (10akosiaris) So, gallium is in `public1-b-eqiad` (208.80.154.128/26). The story behind a public IP is a... [09:01:46] 10Traffic, 6operations, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951327 (10elukey) Also I believe I got the wrong meaning of the max_fails directive, that does not mean "retry for" but jus "consider this backend unavailable if x request... [09:01:58] 10Traffic, 6operations: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951328 (10elukey) [09:29:49] 10Traffic, 10Analytics, 10Analytics-Cluster, 6operations: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1951353 (10elukey) [09:31:37] --^ me spamming [09:34:30] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951371 (10faidon) @mmodell can you please fix IPv6 instead or explain why it is difficult to do so? FWIW, IPv6 penetration is > 10% globally and... [10:16:58] 10netops, 10Continuous-Integration-Infrastructure, 6operations, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1951435 (10hashar) `gallium.wikimedia.org` has a bunch of services which are exposed publicly via the misc-web v... [10:33:12] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951457 (10Reedy) >>! In T100519#1951371, @faidon wrote: > @mmodell can you please fix IPv6 instead or explain why it is difficult to do so? FWIW... [10:50:36] 10Traffic, 6operations: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1951466 (10ema) a:3ema [10:55:42] 10Traffic, 6operations: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1951469 (10ema) 3NEW a:3ema [10:56:37] 10Traffic, 6operations: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#1951476 (10ema) 3NEW [10:57:58] ema: <3 [10:59:58] 10Traffic, 6operations: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951482 (10ema) 3NEW [11:00:00] paravoid: :) [11:09:34] what happens when Apache returns an horrible 503 default page? Is Varnish going to replace it with a "nice" error page? [11:11:03] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951490 (10mmodell) @faidon: I don't have any idea how to fix ipv6. I have zero experience with the systems involved and I don't even have ipv6... [11:20:52] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951505 (10faidon) >>! In T100519#1951490, @mmodell wrote: > @faidon: I don't have any idea how to fix ipv6. I have zero experience with the sys... [11:28:04] 10Traffic, 6operations: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1951515 (10ema) 3NEW a:3ema [11:31:25] paravoid: I've created a git repo on gerrit https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/varnish4,branches [11:31:35] it does not seem to be linked to diffusion though [11:31:50] is there any magic to do in order to "connect" the two? [11:34:24] wouldn't know! [11:35:11] other gerrit repos are connected so I guess it's doable. Who should I bug for this? [11:35:25] godog, maybe? [11:41:31] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951540 (10mmodell) @faidon: I was only summarizing the discussion we (myself, @reedy, @dzahn and @chasemp) had in IRC. Please don't shoot the me... [11:42:51] releng, probably.. [12:06:44] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951552 (10Reedy) >>! In T100519#1951505, @faidon wrote: > In any case, please approach "X is broken and I don't know how to fix it" with "can so... [12:07:39] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951553 (10hashar) The DNS IPv6 entry has been dropped yesterday because there is no ssh service listening there to serve the git repositories.... [12:43:04] 10netops, 6operations, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951614 (10mark) 3NEW [12:44:04] 10netops, 6operations, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951622 (10mark) p:5Triage>3Normal [12:44:48] 10netops, 6operations, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951625 (10faidon) Note that Juniper raises a system (or chassis?) alarm when the RE down, so a check for "show chassis alarms" and "show system alarms" (as described al... [12:45:09] uhm, why is ^^^ being echoed here? [12:45:13] it's not #Traffic [12:57:59] 10netops, 6operations, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951637 (10mark) [12:58:01] 10netops, 6operations, 7Monitoring: Juniper monitoring - https://phabricator.wikimedia.org/T83992#1951638 (10mark) [12:59:36] 10Traffic, 6operations: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951640 (10faidon) https://github.com/fgsch/varnish3to4 is pretty good. I spent a small amount of time (less than a half hour) at some point running this + manual changes against the upload VCL and I was successfu... [13:04:50] 10Traffic, 6operations: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951642 (10mark) And instead of inline C, we can consider using vmods too. [13:41:37] paravoid: I put several different tags on the phab echo for this channel, and one is netops [13:43:28] https://gerrit.wikimedia.org/r/#/c/265281/2/channels.yaml [13:46:02] ah ok [13:48:47] of course if we get through that tag cleanup, won't need so many :) [13:49:37] ah nice, legoktm merged my MFE patch to master, thanks :) [13:50:19] now to figure out how we version and deploy extensions or whatever [13:52:05] ok so it has matching branch/tag names for the core versions [13:53:26] group0/1 are currently 1.27.0-wmf.11 and tonight's train moves group2 to the same, maybe just do it for .11 and on [13:57:09] ok so https://gerrit.wikimedia.org/r/#/c/265316/ -> https://gerrit.wikimedia.org/r/#/c/265486/ [13:57:54] nice!! [13:59:22] now to stick that in SWAT and see if it flies [14:09:31] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1600 [14:13:13] bblack: morning! vmods seem to be binary compatible, see T124281 [14:14:00] they are not strictly, they're just more conscious about their ABI nowadays [14:14:32] ah right, you pointed already to that December thread [14:16:43] yep, what I mean with "binary compatible" is that in general it seems to be OK to use a vmod built against eg. 4.1.0 with 4.1.1 [14:19:20] yup [14:19:39] they also provide a way to build without having the full source around [14:19:47] just with libvarnishapi-dev or something IIRC [14:20:37] right [14:20:50] much better :) [14:21:52] I couldn't find any vmods in debian yet, just an ITP: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=732041 [14:22:10] but packaging them does not seem particularly tricky [14:22:18] well at least we can package->deploy them separately [14:22:26] we'll just need to bump them for varnish-4.2 or whatever [14:29:00] is squid still in use? [14:30:13] in general? [14:30:17] we still use it as a forward proxy [14:30:35] but not as a reverse proxy anymore, no [14:30:42] right, as a reverse proxy [14:30:59] so all references to squid in MW are legacy I guess [14:31:13] $wgHooks['TitleSquidURLs'][] = 'MobileFrontendHooks::onTitleSquidURLs'; [14:31:16] the names are [14:31:20] the functionality is the same [14:32:36] OK [14:32:45] purges etc. [14:33:13] but yeah, if you want to fix those references into some software-agnostic name (like CDN or something), i'm sure people would love that [14:33:31] noone cares because it's just naming really, depends on your OCD levels :P [14:34:09] for the time being I was just worried I've missed something fundamental about the whole architecture and that there were squids somewhere doing stuff [14:34:22] haha [14:39:33] is there any wikitech page except for https://wikitech.wikimedia.org/wiki/Hardware_Disambiguation mapping hostnames to human readable descriptions? [14:40:46] there is a naming page [14:40:55] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [14:41:37] OMG that's fantastic! [14:42:16] thanks paravoid :) [14:42:20] yw [14:47:43] 10netops, 6operations: Upgrade JunOS on cr1/cr2-codfw - https://phabricator.wikimedia.org/T113640#1951766 (10faidon) 5Open>3Resolved a:3faidon All done! [14:50:35] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951773 (10BBlack) I'm putting together 3x commits for review that I think will resolve this, they should show up below... [14:56:52] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951981 (10BBlack) I think those 3 and then uncommenting the public after it's deployed and tested should do the trick. Needs review! [15:16:29] ema: so conftool-data/nodes/codfw.yaml in the puppet repo [15:17:27] cache_mobile stanza has: [15:17:37] cache_mobile: cp2003.codfw.wmnet: [varnish-fe, varnish-be, varnish-be-rand, nginx] cp2009.codfw.wmnet: [varnish-fe, varnish-be, varnish-be-rand, nginx] cp2015.codfw.wmnet: [varnish-fe, varnish-be, varnish-be-rand, nginx] cp2021.codfw.wmnet: [varnish-fe, varnish-be, varnish-be-rand, nginx] [15:17:43] ugh I hate when paste doesn't wrap [15:18:04] anyways it has 4x lines like this for the existing cache_mobile machines in codfw: [15:18:07] cp2003.codfw.wmnet: [varnish-fe, varnish-be, varnish-be-rand, nginx] [15:18:34] the "nginx" and "varnish-fe" services are what LVS hits. varnish-be and varnish-be-rand are used to configure varnish<->varnish inside a given cluster [15:18:40] we're only wanting to mess with LVS [15:19:13] so the idea would be to submit a patch with all the cache_text nodes added to the cache_mobile list there, but with their service lists set to just [varnish-fe, nginx] [15:19:52] on puppet-merge->conftool-merge, that will add them to all the right etcd pools, defaulting to pooled=no, but won't affect varnish<->varnish (which is good!) [15:20:36] then from there the way we ramp in a text machine in that mobile cluster is using confctl to set pooled=yes for e.g. dc=codfw,cluster=cache_mobile,service=nginx (and service=varnish-fe) [15:21:03] the way we ramp out one of the original 4 mobile machines is using confctl to set pooled=no for service=nginx/varnish-fe [15:21:20] that's basically our new equivalent for what we were doing with the textfiles the other day [15:21:44] right [15:22:32] then if we really wanted to sync the state to puppet "for good", we might want to just remove the [nginx, varnish-fe] from the legacy mobile machines in cache_mobile in puppet I guess [15:23:36] all of this is just reaching a temporary state anyways, but we might wait several days in that temporary state with all the datacenters switched, before taking steps that are more-painful to revert. [15:23:55] in the long term, we'll just drop all of cache_mobile's definitions everywhere and add the mobile IPs as IPs for cache_text [15:27:19] (and since pybal never deletes a service, that's going to require some handholding when we deploy that to LVS) [15:28:29] (probably we'll submit a patch to kill LVS's definition of cache_mobile and move the IPs to cache text, which doesn't take effect until pybal restart. disable puppet on the LVS, deploy on backup LVS first then the primaries, and the deploy is like "puppet; stop pybal; manually delete cache_mobile from ipvsadm; start pybal; verify" [15:29:12] probably some background on how the LVS clusters work is in order there: [15:29:22] speaking within one datacenter, say codfw: [15:29:36] there are 3x "traffic classes": high-traffic1, high-traffic2, low-traffic [15:30:06] you can see refs to the classes in the lvs_service defs in hieradata/common/lvs/configuration.yaml, which sets which class that service is in [15:30:34] and then in modules/lvs/manifests/configuration.pp you can see definitions of which traffic class apply to which lvs hostnames [15:30:49] e.g. [15:30:49] 'high-traffic1' => $::realm ? { [15:30:49] 'production' => $::site ? { [15:30:49] 'eqiad' => [ 'lvs1001', 'lvs1004', 'lvs1007', 'lvs1010' ], [15:30:52] 'codfw' => [ 'lvs2001', 'lvs2004' ], [15:31:02] so in codfw high-traffic1 is lvs2001 and lvs2004 [15:31:17] in the normal state, both are running pybal with identical service definitions and public IPs [15:31:39] pybal speaks BGP to our local juniper routers to advertise the public IPs and get them routed into its machine [15:31:57] with both up and speaking BGP, the routers are going to prefer the lower-numbered one (lvs2001 over lvs2004) [15:32:24] but if lvs2001 dies or stops speaking BGP, the route to lvs2004 (from its own BGP advert) becomes active and all the high-traffic1 traffic starts flowing through lvs2004. [15:32:45] if for some reason BGP breaks or pybal stops on both, the routers have a static fallback to use the primary (lvs2001) [15:32:56] and when pybal stops, it doesn't delete the services it created, either [15:33:23] so in theory if pybal happened to crash out and not restart on both, the ipvs services would still be defined and the router would keep routing traffic through lvs2001 [15:34:19] so for any kind of change to the LVS config (to add/delete services, etc), when pybal needs a stop->start to pick up the config change, that's going to flip traffic from 2001->2004 at least briefly [15:34:44] if it's fairly fast, the impact is fairly minimal, but it's not something we want to do very often heh [15:35:11] so we'd deploy the change on lvs2004 first, and once its pybal is up and verified that ipvsadm output looks correct, etc... [15:35:39] then we quickly do the same on lvs2001, and while pybal is stopped on lvs2001, lvs2004 has the traffic, until we restart pybal on lvs2001 [15:42:23] this is all going straight to my notes :) [15:43:24] eqiad having twice as many LVS as codfw is a "temporary" thing that's been going on a while [15:43:36] it normally has lvs1001-1006, just like codfw has lvs2001-2006 [15:44:09] we're deploying new machines to replace those old hardware. the new eqiad LVS are lvs1007-lvs1012. and the idea is to get running on those and then decom/reclaim lvs1001-1006 [15:44:23] in the interim they're both configured, but lvs1007-lvs1012 have their BGP->routers disabled [15:44:53] there's a ticket for that transition at https://phabricator.wikimedia.org/T104458 [15:45:04] and we've been blocked for a while there on https://phabricator.wikimedia.org/T112781 [15:45:21] which is that there's some problem with one of the switches that two of the new LVS talk to, and we haven't gotten to the bottom of it [15:46:05] nobody's really had time to dig into it hard enough, it's a PITA problem and other stuff has taken priority [15:53:54] ema: also I stuck the mobile purging fix cherry-pick in the morning SWAT deploy, so it should/maybe get deployed as a cherrypick on 1.27-wmf11 during that window, coming up on the hour [15:53:58] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1600 [15:54:19] group0 and group1 wikis are already on wmf11 so they'll get it then [15:54:35] group1 wikis move to wmf11 during the train deploy shown 2h later there [15:54:55] <_joe_> I usually have to cherry-pick my changes to several branches, if it was just one, consider yourself lucky :) [15:55:00] at which point we should have good mobile purge everywhere. we can confirm that by watching the vhtcpd traffic and seeing the right URLs coming through [15:55:09] and then we can try proceeding again [15:55:52] although it might be late in EU time by then, so maybe wait for early Friday, just re-do codfw then, and then start the others on Monday? [15:56:50] bblack: let's see how late it will be, but yes latest early Friday and the others next week [15:57:01] ok [15:57:55] _joe_: lucky timing :) [15:58:04] <_joe_> yup [15:58:44] could've cherry-picked to wmf10 too I guess to skip the 2h wait, but I figure that gives a little softness to push out this patch anyways [15:58:53] time to notice it breaks things unexpectedly on group0/1 [16:00:19] what does group0/1 mean? :) [16:00:40] <_joe_> AH! [16:00:41] looking for a wikitech link or something [16:00:46] <_joe_> "smaller fishes" [16:00:49] try https://www.mediawiki.org/wiki/MediaWiki_1.27/Roadmap#Schedule_for_the_deployments [16:00:59] so group0 is MediaWiki.org; test.wikipedia.org; test2.wikipedia.org; test.wikidata.org; zero.wikimedia.org [16:01:07] <_joe_> so we first release a new version to the smallish wikis [16:01:10] group1 is All non-Wikipedia sites [16:01:11] (Wiktionary, Wikisource, Wikinews, Wikibooks, Wikiquote, Wikiversity, Wikivoyage, and a few other sites) [16:01:17] group2 is All Wikipedias [16:04:49] so at the moment group0/1 already Do the Right Thing when it comes to purges? [16:08:59] ema: they will after my patch cherrypick is deployed to wmf11 [16:09:10] which is happening during the ongoing SWAT in -operations probably [16:09:33] * ema goes to -operation with pop-corn [16:10:19] ema: if you scroll back to 16:00, you can see jouncebot msgs x2 announcing the SWAT [16:10:33] it's our process for rapid deploy of little fixes, etc to MW and related [16:13:38] ema: the list of people in the first jouncebot msg "< jouncebot> anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT" [16:14:13] are the developers that are possibly-on-duty. one of them that's available takes control of the SWAT process and runs through the listed patches with the submitters around to help [16:14:20] in this case thcipriani is running it [16:18:57] bblack: and I guess changes are deployed in order, as in yours will be pushed out after kart_'s? [16:19:11] probably mine will be last yeah [16:19:14] they do skip around sometimes [16:19:34] and there's a max of 8 patches, so if there's already 8 listed, you don't get to be in the window, pick a later one :) [16:24:31] and that 16:22 message, that's what gets generated into the channel when the SWAT person runs the commands to sync the code change [16:31:37] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1952355 (10greg) Thanks @bblack [16:38:36] ema: so to confirm how the purges look for group1, I'm running: [16:38:37] root@cp1065:~# varnishlog -c -n frontend -m RxRequest:PURGE -m 'RxHeader:Host:.*wikivoyage.*'|egrep -i 'Host:|RxURL' [16:39:43] you can see in that output the pairs like: [16:39:43] 38 RxURL c /wiki/%D7%AA%D7%9C_%D7%90%D7%91%D7%99%D7%91-%D7%99%D7%A4%D7%95/%D7%9E%D7%96%D7%A8%D7%97_%D7%94%D7%A2%D7%99%D7%A8 [16:39:47] 38 RxHeader c Host: he.wikivoyage.org [16:39:49] 38 RxURL c /w/index.php?title=%D7%AA%D7%9C_%D7%90%D7%91%D7%99%D7%91-%D7%99%D7%A4%D7%95/%D7%9E%D7%96%D7%A8%D7%97_%D7%94%D7%A2%D7%99%D7%A8&action=history [16:39:52] 38 RxHeader c Host: he.wikivoyage.org [16:39:55] which is the 2x expected purges on an article edit [16:40:00] it should become 4x and include mobile hostnames when the change hits [16:41:28] nice [16:41:52] and if that works, the group2 deploy will show the same stuff for en.wikipedia.org then [16:42:25] right, when the train deploy happens at 19:00-20:00 UTC [16:42:35] err 19:00-21:00 UTC [16:46:58] ok so as was saying in -ops, apparently group1 isn't yet affected [16:47:06] so I switched to checking mediawiki.org since that's in group0 [16:47:14] root@cp1065:~# varnishlog -u -c -n frontend -m RxRequest:PURGE -m 'RxHeader:Host:.*mediawiki.*'|egrep --line-buffered -i 'Host:|RxURL' [16:47:18] -> the lines I pasted in -ops [16:47:51] yes I'm following in -ops :) [16:48:07] it seems to be working on group0 then [16:48:23] yup [16:48:31] \o/ [16:49:06] not sure what that means precisely for group1/2 schedule [16:49:20] maybe group1 at the next train, and group2 delayed. or maybe they'll push both today. [16:49:54] in any case, once we can confirm the change has hit group2, we can continue on with re-doing codfw mobile->text :) [16:51:14] do the changes get applied to all mw machines at the same time in group2? [17:07:25] approximately :) [17:07:47] I'm out for lunch, bbl :) [17:24:37] 10Traffic, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 6operations, and 3 others: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1952512 (10BBlack) ^ So the fix is in 1.27.0-wmf.11, which is on group0 so far. When it reaches group1 and group2 as well, we can resolve... [17:29:40] bblack: when it comes to gerrit repos such as operations/debs/varnish4, what type of workflow are we going to follow? One git review per change? Or can one just push directly to the repo? [18:41:07] ema: it's really up to us. I'd say if we're importing a long series of commits from upstream, just push direct [18:41:26] and then push to refs/for/whatever for gerrit-review on the local commit that inevitably comes after for our local -wmfN version bump [18:44:45] 10Traffic, 6operations: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1952810 (10ema) Patches marked as forward-ported are available on Varnish 4 WMF repo on Gerrit: https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/varnish4 - [X] 0010-varnishd-cache_dir... [18:53:02] wikibugs called it a day! :) [19:00:14] does upload.wikimedia.org not go through apache? [19:00:54] ema, yeah, labs restarts killed it I guess [19:04:14] Krenair: right it doesn't [19:04:43] upload.wm.o hits nginx->varnishes->swift [19:04:52] and then swift handles interaction with imagescalers [19:04:57] basically... [19:04:59] but upload.beta.wmflabs.org... does? :| [19:05:20] maybe it doesn't have swift? [19:05:34] I really don't know what upload.beta's config is like [19:05:51] I guess there would be an apache between prod swift and prod imagescalers, right? [19:11:27] swift runs on those ms-[fb]e machines right? don't think I've seen any of them in beta [19:11:49] the config is... not great: https://gerrit.wikimedia.org/r/265526 [19:12:18] no idea re. apache between swift+imagescalers [19:12:39] I mean, I'd assume there's apache on the imagescalers, which is frontending the requests into the imagescaler stuff [19:12:54] and yeah swift is ms-[fb]e [19:13:43] swift is, effectively, just a very big cache, it wouldn't be necessary to have it in beta to make upload.beta work, I think [19:14:54] right, upload.beta certainly works somehow [22:45:04] ema: 22:24 < logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.11 [22:45:27] I should probably stop uploading apache changes until I can find people to review them :/ [22:45:39] ema: so as of now, I can confirm the purges look right even for group2 (e.g. enwiki), but it's late and they still seem to be dealing with trailing mess that maybe could end up with a revert, who knows [22:46:00] ema: so I'm not touching anything, we can start up with early tomorrow. maybe early tomorrow your time if I can wake up on time. [22:51:21] Krenair: puppetswat? [23:00:38] I dislike almost all of my puppet patches having to go through puppetswat when it only happens twice a week, but yeah [23:00:43] suppose I'll schedule it for tuesday [23:02:19] speaking of puppet swat, the calendar should probably be updated to list the correct names of the ops involved [23:03:31] added 5 to the schedule, might add more later. I suppose I should avoid filling it up in case anyone else wants to get stuff done