[00:34:18] 7Varnish, 10MediaWiki-Vagrant: Make Varnish port configurable using hiera - https://phabricator.wikimedia.org/T124378#1954287 (10Mattflaschen) 3NEW [04:11:32] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1954591 (10Volker_E) +1 WFM too! Awesome, thanks all people involved! > time GIT_SSH_COMMAND="ssh -v" git clone ssh://vcs@git-ssh.wikimedia.org/... [06:09:21] 7HTTPS, 6Analytics-Kanban, 6operations, 5Patch-For-Review: EventLogging sees too few distinct client IPs {oryx} [8 pts] - https://phabricator.wikimedia.org/T119144#1954687 (10leila) @Ottomata I checked couple of tables that I knew and the diversity of hashed IPs looks healthy. I also looked at two of Tilma... [10:09:48] ema: ping [10:10:29] so, 1.27.11 stuck on all the groups, we're good to go for codfw this morning [10:17:02] _joe_ has only switched ulsfo to the new etcd stuff so far, so actually we're still on the exact same file-based method for this initial codfw thing [10:17:49] but since I did make that change to VCL's hashing, I should re-check that it works manually first :) [10:18:04] <_joe_> bblack: if you want, we can migrate codfw as well [10:20:12] re-checking [done] [10:20:22] _joe_: yeah might as well [10:21:41] all at once or 2 by 2? :) [10:21:56] seems like ulsfo went pretty smooth [10:22:35] well 3 by 3 in codfw's case heh [10:22:55] the low-traffic config is a new thing and a little bigger though [10:27:05] _joe_: https://gerrit.wikimedia.org/r/265704 + https://gerrit.wikimedia.org/r/265705 [10:27:33] <_joe_> yeah looking :P [10:27:45] <_joe_> I'm doing 4 things in parallel, my scheduler is almost jammed [10:27:49] :) [10:29:22] <_joe_> bblack: actually I was thinking, we should use conf1002 for backups [10:29:30] <_joe_> and conf1001 for actives [10:29:46] <_joe_> it doesn't make sense to connect to the same host everywhere [10:30:15] <_joe_> (pending me finding the time to write a reconnection logic into pybal's etcd driver) [10:32:10] hmmm [10:32:19] yeah so it's been a while since I've looked at that stuff [10:32:38] <_joe_> bblack: it's ok for now anyways [10:32:51] I mean as far as the implications go [10:33:17] I guess if conf1001 died and couldn't be brought up quickly, and we needed to make a change, having backups on conf1002 would let us stop pybal on the primaries and still make changes? [10:33:38] <_joe_> yes [10:33:57] ok [10:34:06] <_joe_> but well, I'd prefer to make pybal able to know all servers available, and then cycle through them in case of failure [10:34:18] well sure [10:34:53] we could do it other ways, too [10:35:09] but I don't have a clear picutre what our long term multi-dc plan is for etcd [10:35:41] but we could have gdnsd do it even in one DC (it does things other than geoip) [10:35:50] I would say LVS could do it, but that gets circular heh [10:36:23] but gdnsd could be configured to poll http on confd100[12] and serve round-robin pair of IPs, and drop one from the set if 1/2 dies [10:36:34] for confd.svc.eqiad.wmnet or whatever [10:37:02] <_joe_> bblack: we already have a SRV record [10:37:11] <_joe_> we just have to make pybal support that :P [10:37:18] ok :) [10:37:21] <_joe_> as every other python app that connects to etcd does [10:37:25] <_joe_> and confd as well [10:37:48] well with the gdnsd-based thing, we could go beyond SRV though [10:37:58] <_joe_> oh, what do you mean? [10:38:22] <_joe_> having gdnsd do healthchecks? [10:38:41] well yeah the thing I mentioned above is already doing healthchecks [10:39:19] <_joe_> the ideal would be having the srv record just reporting whichever servers are up atm [10:39:21] but I mean, we could have a pair of confd listener IPs at each DC (when we have x-dc etcd sync somehow), and have gdnsd healthcheck them all from each authdns, and serve only the healthiest/closest one [10:39:44] so ulsfo clients get 1-2 healthy ulsfo IPs, or if both ulsfo seem to be offline they get eqiad IPs, etc... [10:40:08] <_joe_> as for multi-dc, I'm still heavily undecided. on one side, I want to try the 2-2-1 cluster in eqiad/codfw/ulsfo [10:40:27] <_joe_> OTOH, I think that's a pretty complex setup [10:41:05] etcd isn't really multi-dc aware right? [10:41:12] <_joe_> no [10:41:24] <_joe_> raft in general doesn't play well with uneven topologies [10:41:27] even just 2x DCs fixes a lot of problems, eqiad+codfw or whatever [10:41:39] <_joe_> I'm thinking of "stealing" the etcd replication agent I wrote when at JOB~1 [10:41:44] and eqiad+codfw are the pair that are best-connected and most-critical anyways, raft could be ok over that link [10:41:46] <_joe_> where I had active/backup [10:42:00] <_joe_> bblack: we need a tiebreaker though [10:42:05] <_joe_> or we risk split-brains [10:42:23] too bad we don't have a misc machine at eqord :) [10:42:31] <_joe_> heh! [10:42:36] <_joe_> yeah that would be ideal [10:44:14] well if you had total contol over etcd's implementation, you could implement naive tiebreakers too [10:44:37] as in "in case of 50%, tiebreak by succeeding at pinging this IP" or whatever [10:45:22] (where the IP would be cr1-eqord) [10:45:31] morning! [10:45:35] hey ema [10:45:35] <_joe_> hi ema [10:45:49] <_joe_> you managed to wake up later than brandon? :P [10:46:12] I set an early alarm so I could figure out if we're ready to go for an early friday codfw mobile switch heh [10:46:15] <_joe_> well, bblack has been horribly early today :) [10:46:20] because I hate doing things like that later friday [10:46:24] haha no, I had stuff to do this morning :) [10:46:55] <_joe_> ema: no one would be upset, we've hired YuviPanda so that anyone's life cycle seems sane in comparison [10:48:28] ok I'm gonna double-check the codfw confd -vs- files and see how lvs200[456] go [10:49:15] _joe_: also there's this on puppet-merge lately: [10:49:15] Now running conftool-merge to sync any changes to conftool data [10:49:15] Running conftool-sync on /etc/conftool/data [10:49:15] WARNING:conftool:Setting datacenters to the default value ['eqiad', 'codfw'] [10:49:18] WARNING:conftool:Setting default_values to the default value {'pooled': 'no', 'weight': 0} [10:49:21] WARNING:conftool:Service citoid not found, skipping [10:49:24] I think someone messed up a config, probably citoid, in puppet? [10:49:40] it happens on every run [10:51:06] are we going to use the file-based mechanism in codfw or conftool? [10:52:34] <_joe_> bblack: that's someone who misconfigured conftool-data [10:52:47] <_joe_> let me look in a few [10:53:31] <_joe_> so I bet they added servers to a citoid service on cluster scb [10:53:33] ema: conftool, but not done switching pybal to it yet [10:53:37] <_joe_> but didn't add the service stanza [10:54:37] codfw backup LVS confirmed the easy way [10:55:08] the easy way being: puppet agent -t; ipvsadm -Ln >x; service pybal restart; sleep 10; ipvsadm -Ln >y; diff -u x y [10:55:40] <_joe_> bblack: well I also did some magic to exclude the connections numbers [10:55:48] <_joe_> or that is hardly helpful [10:55:50] well on backup LVS they're all zero :) [10:55:55] <_joe_> ahha ok [11:10:31] hmm I had a diff on lvs2003, for 10.192.32.61 ( mw2173 ) [11:10:42] I'm guessing commented out on palladium, but conftool not updated [11:11:18] hmm, nope [11:12:19] <_joe_> uhm what is that about? [11:12:23] apparently mw2173 was not in the backends list on lvs2003 before the change, but is now. It was there before+after on lvs2006, and the file isn't recent and has it [11:12:41] <_joe_> so I guess an ipvsadm failure? [11:12:50] <_joe_> every machine has been rebooted yesterday [11:12:55] I picked up a diff line [11:12:56] +-> 10.192.32.61:80 Route 10 [11:12:58] <_joe_> maybe ipvsadm failed to add it back [11:13:03] maybe! [11:16:25] the rest all checked out, all codfw lvs on etd now [11:16:31] *etcd [11:16:53] <_joe_> \o/ [11:17:02] <_joe_> we should probably !log it [11:17:49] <_joe_> so, how are you going to proceed with mobile? [11:18:31] so basically, we're going to add lines to codfw's cache_mobile for all the cache_text machines, but only for services [nginx, varnish-fe], not be and be-rand [11:18:44] and puppet that out, which will add them all with pooled=no and do nothing in practice [11:19:02] then to ramp in a text node, set pooled=yes for those two services for that text hostname [11:19:19] to ramp out a legacy mobile no, set pooled=no for the mobile node only for those two servers (not be/be-rand) [11:19:27] s/no,/node,/ [11:19:50] I'm kinda assuming it's ok for a machine to have it is [11:20:16] <_joe_> sorry, but why just the frontend? [11:20:33] we're remapping this at the LVS layer, but at the varnish level the two clusters are still distinct [11:20:39] <_joe_> oh! [11:20:52] <_joe_> so LVS is shared, the backend varnish is not [11:20:52] the state we're moving towards here is a temporary one [11:20:55] <_joe_> wicked! [11:21:01] it's not shared, it's moving completely [11:21:16] <_joe_> sorry, bbiab [11:21:22] <_joe_> I have to deploy a mediawiki change [11:21:24] at the end of this process, at the LVS layer cache_mobile and cache_text will have the same node lists [11:21:27] ok [11:21:47] (and then when we're comfortable, we can kill cache_mobile and just have its IPs be another set of IPs for cache_text) [11:22:10] "the same node lists" above being just the current text machines [11:22:37] bblack: so is one of the advantages of conftool vs. text file that we can have machines with pooled=no? [11:22:42] but cache_mobile at the LVS layer temporary has both text and mobile machines while doing the switch, to soften the cache-miss blow [11:23:03] ema: no, we can depool from the files too [11:23:18] there's no functional change in that sense [11:23:50] it just gets all of this under one tool with a commandline [11:24:07] in the distant past, we had completely separate tools for pool/depool of cache machines at different levels: [11:24:13] 1) the files on palladium for pybal/LVS [11:24:15] ah, enabled: True/False? [11:24:45] 2) lists of nodes in puppet for varnish<->varnish, which required puppet runs to effectively add/delete/pool/depool/re-weight, etc [11:25:26] so "completely disable cp1065" meant disabling/removing in the file on palladium for LVS, and committing a nodelist change to puppet, and making sure puppet ran on all the relevant machines, etc [11:25:52] in the new world order, both are hooked up to etcd [11:26:14] you can use confctl to make etcd change for pooled/weight for both LVS and varnish purposes (and any others, it's a generic service) [11:26:34] but the varnish half switched over first, now we're finally switching over the LVS/pybal part [11:26:45] aha! [11:27:20] if you touch the pooled status on the services varnish-be or varnish-be-rand, that triggers a templated VCL change + VCL reload on the relevant machines to affect the varnish backend lists, basically. [11:28:21] and just recently, confctl gained a --find option too [11:28:38] so now we can with a single command depool all services for host cp1065.eqiad.wmnet too [11:29:02] next stop is hook that into shutdown scripts for the host, and/or for each service, too [11:29:51] if we integrate with the service init scripts / unit files, we probably want that to be smart-ish [11:29:57] so that it doesn't interfere with local manual hacks [11:29:59] uh so if you eg. stop varnish the machine gets depooled? [11:30:06] yeah [11:30:27] we might want some file in /etc to control that though, for when someone's trying to be smarter than the tools manually [11:30:50] * ema nods [11:30:58] e.g. if you want to manually depool a node and then restart varnish a few times testing something, you don't want it self-repooling [11:31:19] and we probably don't want nodes self-pooling on a fresh boot by default either [11:32:10] but it seems like depool-on-stop for services, and/or depool-all-on-shutdown, are universally good ideas [11:32:31] <_joe_> bblack: that's usually done with piggiback service units [11:32:39] <_joe_> that you can mask/unmask at will [11:32:43] ah [11:32:55] I was just assuming ExecStartPre= and such [11:33:29] anyways, so on with the show! [11:34:36] ema: so, puppet repo, conftool-data/nodes/codfw.yaml [11:34:45] yes! [11:34:55] cache_mobile stanza [11:34:57] basically copy all the cache_text nodes to cache_mobile, but reduce the service list in cache_mobile for them to just [nginx, varnish-fe] [11:35:34] merging that through shouldn't affect ipvsadm yet since by default conftool-sync adds new nodes in the depooled state [11:36:02] <_joe_> bblack: that depends on what you wrote in the service defs, though [11:36:08] well true :) [11:36:14] <_joe_> (and yes, it's depooled by default for all caches) [11:36:19] right, and we can pool them one by one afterwards [11:36:33] right [11:36:53] <_joe_> oh jesus, every reboot cycle in codfw renders dead appservers [11:37:35] ema: and on palladium, the confctl command to pool up cp2001 is: [11:37:38] confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action set/pooled=yes cp2001.codfw.wmnet [11:37:46] confctl --tags dc=codfw,cluster=cache_mobile,service=varnish-fe --action set/pooled=yes cp2001.codfw.wmnet [11:37:58] which flips on port 443 and port 80 respectively [11:38:03] <_joe_> also add the weight [11:38:17] <_joe_> confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action set/pooled=yes:weight=10 [11:38:25] oh right, I forgot we used weights last time [11:38:32] we have default weights, but we mucked with them on codfw last time around [11:38:34] for the first machine we did [11:38:57] anyways, same commands with set/weight=X to do weights [11:39:15] we should probably do like last time and bump them to 10 and ramp in the first one, etc [11:39:28] <_joe_> which machine did you add? [11:39:38] when? [11:39:48] none yet, there's a puppet commit to go yet [11:39:54] <_joe_> ahhh ok [11:40:49] ema: another helpful command: [11:40:50] confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action get 're:.*' [11:40:59] will list them all, with linebreaks [11:42:37] oh, no notifications from grrrit-wm here? [11:42:59] https://gerrit.wikimedia.org/r/265710 [11:43:16] yeah no grrrit-wm here :) [11:43:28] alright, we'll survive [11:43:43] should I move https://phabricator.wikimedia.org/T109286 to In Progress? [11:44:07] isn't https://phabricator.wikimedia.org/T124165 now done? [11:44:17] yeah just fixed that one :) [11:44:18] 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1955124 (10BBlack) [11:44:21] 10Traffic, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 6operations, and 3 others: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1955121 (10BBlack) 5Open>3Resolved a:3BBlack [11:46:22] ema: the workboard columns are a mess :) [11:46:32] they're mostly wishful thinking these days! :) [11:46:42] cool, then I'll move it! :) [11:47:16] Hi bblack, Just saw that blocking tasks for 'mobile cache merge into text' were done ... Will the move happen early necxt weeke? [11:47:33] joal: it's starting today, but it won't finish until sometime next week [11:48:02] Wow, starting on a Friday, you guys don't value your weekends enough ;) [11:48:23] ok bblack [11:48:38] I'll monitor on out side [11:48:42] Thanks for the update [11:48:47] well we're starting early on a friday [11:48:59] you won't see much change, we're only flipping the smallest dc traffic-wise [11:49:05] yup [11:49:07] bblack: do we need to set weights for service=varnish-fe as well? [11:49:17] it's probably enough to set it on nginx alone right? [11:49:20] When I see changes, it means the end of it is soom [11:49:23] ema: yeah basically everything we do today has to be done 2x for varnish-fe + nginx [11:49:35] ema: oh weights, yeah [11:49:44] I guess that doesn't matter since it's mostly redirects [11:50:05] I mean I don't mind setting them everywhere, I'm just trying to understand :) [11:53:45] <_joe_> joal: well, at least if the pain on the weekends is self-inflicted, it's justified [11:54:12] <_joe_> last weekend I worked for the whole saturday morning basically, and was on vacation from friday to monday :P [11:54:12] I can understand _joe_ :) [11:54:26] <_joe_> and... it wasn't our fault, so... [11:54:39] right [11:55:52] _joe_: thanks for updating https://wikitech.wikimedia.org/wiki/Conftool [11:56:32] <_joe_> ema: we should keep adding things, please take the time to do it or ask me if something is not clear [11:56:42] <_joe_> I'm notoriously bad at documentation [11:56:52] <_joe_> basically exposes my lazyness [12:00:06] alright! 265710 merged, can I go ahead and puppet-merge ; conftool-merge? [12:03:44] bblack: ^ Should I also set all weights to 10 in dc=codfw,cluster=cache_text,service=nginx? [12:04:04] not cache_text. nothing we do is in cache_text [12:04:10] but yes [12:06:14] ERROR:conftool:load_node Backend error while loading node: Backend error: The request requires user authentication : Insufficient credentials [12:06:20] when running puppet-merge [12:06:38] bblack: and yes sorry about the cache_text question, I got confused :) [12:07:26] ema: in a root "sudo -i" session on palladium, right? [12:07:33] I've never run it any other way [12:07:47] as root, yes [12:07:57] as in 'sudo puppet-merge' [12:09:12] bblack: https://phabricator.wikimedia.org/P2517 [12:09:14] that's different [12:09:40] I'm running a puppet-merge now in any case [12:09:50] it's syncing your data fine [12:10:00] don't use "sudo cmd", use "sudo -i" to get a real root shell, then run cmd [12:10:26] sounds good [12:18:09] bblack: let me know when I can start pooling [12:22:06] <_joe_> ema: yeah that's a known problem I should fix [12:22:31] <_joe_> I'm evaluating removing automatic run of conftool-merge from puppet-merge [12:22:37] <_joe_> while I fix it [12:25:08] ema: so jsut to confirm, we're at weight=10 on the existing pooled mobile machines, weight=1 on all the new ones, and you're going to pool cp2001 w/ weight=1 [12:25:11] right? [12:25:33] correct [12:29:43] bblack: confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action set/pooled=yes:weight=1 cp2001.codfw.wmnet [12:29:58] and same story for service=varnish-fe [12:30:27] I'll !log and start with cp2001 as soon as you give me your OK [12:30:33] ok! [12:32:06] <_joe_> ema: http://config-master.wikimedia.org/conftool/codfw/mobile-https and http://config-master.wikimedia.org/conftool/codfw/mobile give a decent idea of the situation [12:32:26] <_joe_> (might not be very freshly updated though) [12:32:53] cp2001.codfw.wmnet: pooled changed no => yes [12:33:14] <_joe_> { 'host': 'cp2001.codfw.wmnet', 'weight':1, 'enabled': True } in that file :) [12:33:30] if everything is fine, in 5 minutes (or longer?) I'll bump the weight to 5 [12:33:49] yeah I'd say wait 5 minutes go to 5, wait 5 minutes go to 10 [12:33:59] and then once again we'll pause there with just 1 machine for a while and let caches fill [12:34:16] alright [12:34:48] <_joe_> ema: welcome to the WMF: http://45.media.tumblr.com/5aa1c685be2bcaad16ebeb3944e11847/tumblr_n2zkwrWLvQ1qcj7x4o1_500.gif [12:34:52] _joe_: would be nice to sort those by enabled, weight [12:35:10] <_joe_> ema: that's just an horrible go text/template [12:35:29] <_joe_> if you want something better, you can query the http interface of pybal on a relevant LVS host [12:35:37] <_joe_> curl :9090/pools [12:38:53] cp2001.codfw.wmnet: weight changed 1 => 5 [12:40:40] _joe_: confctl output does not mention the dc, might be nice to add. eg: cp2001.codfw.wmnet: weight changed 1 => 5 (dc=codfw,cluster=cache_mobile,service=nginx) [12:43:20] <_joe_> ema: I agree, would you mind opening a ticket? [12:43:37] will do, migration first :) [12:43:53] cp2001.codfw.wmnet: weight changed 5 => 10 [12:45:18] bblack: cp2001 done, let's wait for caches to fill up [12:46:32] ema: yeah plus we should pause anyways on what's going on in -sec [12:48:05] <_joe_> yup [12:48:11] <_joe_> SNAFU [13:05:27] ema: aside from all the general wtf going on, there's a risk they may still roll back from .11 to .10, which would undo the mobile purge fixes as a side-effect [13:05:48] but ori's merging a cherrypick to .10 now, so that should insure us against a rollback [13:20:22] bblack: that's a very effective tl;dr [13:33:49] "all the general wtf"? :) [13:37:03] yeah that too [13:37:42] but no I was refering to: they might rollback from .11 to .10 and we would be screwed. Merging a cherrypick to .10 though so we're good. [13:38:19] so yeah we can continue on our own schedule now basically [13:38:27] we kinda know what the other wtfs are approximately [13:38:50] s/we/bblack/ [13:39:29] as in, I've tried to follow the discussion but some pieces of the puzzle are missing here :) [13:40:09] ema: s/we/you/ can carry on pooling in text servers, etc [13:40:48] sweet! [13:40:50] the other two ongoing wtfs are the general .11 mess with authentication/sessions, and some kind of jobqueue-driven massive increase in purge traffic, which is unrelated [13:41:34] will proceed with cp2004 then [13:42:57] cp2004.codfw.wmnet: pooled changed no => yes [13:42:58] cp2004.codfw.wmnet: weight changed 1 => 1 [13:43:40] we can probably just pool the rest in at 10 like before, with some timing/spacing [13:44:18] cp2004.codfw.wmnet: weight changed 1 => 10 [13:47:41] cp2007.codfw.wmnet: pooled changed no => yes [13:47:42] cp2007.codfw.wmnet: weight changed 1 => 10 [13:52:24] cp2010.codfw.wmnet: pooled changed no => yes [13:52:24] cp2010.codfw.wmnet: weight changed 1 => 10 [13:57:17] cp2013.codfw.wmnet: pooled changed no => yes [13:57:18] cp2013.codfw.wmnet: weight changed 1 => 10 [14:02:14] cp2016.codfw.wmnet: pooled changed no => yes [14:02:15] cp2016.codfw.wmnet: weight changed 1 => 10 [14:07:20] cp2019.codfw.wmnet: pooled changed no => yes [14:07:21] cp2019.codfw.wmnet: weight changed 1 => 10 [14:13:05] cp2023.codfw.wmnet: pooled changed no => yes [14:13:05] cp2023.codfw.wmnet: weight changed 1 => 10 [14:13:22] and that's it, all nodes added [14:13:55] now I'll grab a bite and after that I'll start removing mobile nodes [14:17:35] \o/ [14:20:37] 10Traffic, 6operations: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#1955351 (10elukey) Something that might be interesting: https://httpd.apache.org/docs/2.4/mod/event.html#how-it-works Disabling mod_deflate could be good if we plan to te... [14:36:16] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955365 (10BBlack) 3NEW [14:39:15] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955373 (10BBlack) From looking at runJob logs, I've initially started to suspect something related to htmlCa... [14:49:58] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955387 (10BBlack) (also, note for posterity: https://gerrit.wikimedia.org/r/#/c/265713/ was related to this:... [15:08:18] ema rocks ;) [15:09:39] quick question from T124195 - would it be worth to test mod-event for our apaches? If we disable mod-deflate it might give us some benefits (also had a chat with ori about this) [15:13:23] maybe, who knows :) [15:13:39] I think in the long term, though, we're trying to get rid of apache on the mediawiki hosts [15:24:44] yeah I know ori told me, but I still believe that we could gain some benefit from our old friend httpd :) [15:29:05] bblack: at your signal! [15:29:59] ema: go for it [15:32:19] cp2003.codfw.wmnet: pooled changed yes => no [15:33:04] after, we can go ahead and commit this to puppet too (as in, remove [nginx, varnish-fe] from the 4x legacy mobile machines in the same file as before) [15:33:07] Just In Case [15:33:12] bblack: depooling both nginx and varnish-fe right? [15:33:17] ema: yes [15:38:15] cp2009.codfw.wmnet: pooled changed yes => no [15:40:08] incoming network traffic on depooled nodes does not really go down [15:40:28] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp2003.codfw.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Mobile+caches+codfw [15:41:11] well it was already fairly low [15:42:14] ipvs looks right [15:42:19] right, and the number of established TCP connections definitely went down :) [15:42:33] ~1500 vs ~60 [15:43:44] cp2015.codfw.wmnet: pooled changed yes => no [15:43:45] another confusing aspect of this, is that we're in a natural ramp-up period right now [15:44:00] (as in, normally mobile traffic in codfw would be on a massive upswing due to normal daily cycles) [15:48:49] cp2021.codfw.wmnet: pooled changed yes => no [15:49:55] bblack: https://gerrit.wikimedia.org/r/#/c/265742/ [15:50:08] all nodes done [15:50:31] (codfw) [15:51:10] ema: awesome :) [15:52:30] let me !log this [16:07:40] how are we depooling now? :) [16:07:54] * paravoid is worried will get rusty :) [16:08:06] confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action set/pooled=no $host [16:08:39] and same story for service=varnish-fe [16:08:43] alright [16:08:46] good to know! [16:08:59] * ema finally answered a question :) [16:13:58] paravoid: there's also --find now too, since often a node has many services [16:14:02] e.g.: [16:14:20] confctl --find --action set/pooled=no cp2003.codfw.wmnet [16:14:27] (I think, I haven't run it like that yet myself) [16:14:43] but that should depool it from any cluster/service cp2003 is pooled in [16:17:53] conftool-sync re-removed the nodes, for example: [16:17:56] codfw: Removing node cp2021.codfw.wmnet from cluster cache_mobile/nginx [16:18:01] I assume that's OK [16:19:53] looks like everything is fine with confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action get 're:.*' [16:29:27] well the difference is they're removed now instead of just depooled [16:29:40] but also importantly, it didn't change the definitions for the backend services, e.g.: confctl --tags dc=codfw,cluster=cache_mobile,service=varnish-be --action get 're:.*' [16:29:52] so yeah all looks fine to me [16:31:30] so assuming no fallout/complaints over the weekend, next week we can step through the other DCs: ulsfo, eqiad, esams [16:31:58] with codfw having primed the hot objects in the eqiad backends over the weekend, ulsfo+eqiad pool changes can be done relatively-quickly, especially eqiad [16:32:06] as the misses won't commonly go to the applayer [16:32:19] for eqiad they'll fetch over the local network from the be cache [16:32:28] for ulsfo they'll fetch from eqiad be cache, which is relatively-small latency [16:32:40] esams we might want to go slow initially again [16:33:09] because (a) the latency hit pulling from eqiad caches is bigger + (b) the set of wikis/languages active in esams is going to differ more-substantially than the US sites do from each other [16:33:32] oh I wouldn't have thought of (b) [16:33:36] makes sense [16:34:48] bblack: at which point can we remove the backend services in codfw cache_mobile? [16:34:49] <_joe_> ema: confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action get all [16:36:00] ema: really yeah but "all" puts it all in one line, vs "re:.*" has linebreaks [16:36:16] err whoops, I started one message and ended with another, that was meant for _joe_ :) [16:36:34] ema: really we're not going to, basically. [16:36:59] ema: once we're comfortable with the state codfw is in, at all DCs... [16:37:38] the next step is to basically delete the cache_mobile service from pybal/LVS, and move its IP address to be just another one for the cache_text service [16:37:57] and then we can decom the cache_mobile machines, the role::cache::mobile definitions, etc, etc... [16:38:16] at which point we have 16 good cache machines we can reallocate elsewhere and reinstall/re-role ... [16:41:04] in esams in particular, we might want to swap them out for some others in other clusters, I'm not sure, I haven't really looked [16:41:22] but I know esams more than the other sites, we have a blend of hardware generations, and those mobile machines might be better some of the active ones in other clusters [16:48:04] hmmm peeked at the hardware, guess not [16:49:35] (at least, they're not better than any cache_(text|upload) machines) [16:59:09] in the long, long, term, my half-baked plans are that we'd rather have more large clusters and fewer small clusters [16:59:28] the ditching of cache_bits and now soon cache_mobile is part of that [17:00:01] really we could fold cache_misc into cache_text too, but only if we can get configuration sanity/refactoring/generation to a point where it's not a complete mess of conditionals in the VCL [17:00:24] the purpose of cache_misc is still unclear to me [17:00:26] cache_maps and cache_upload could possible share storage too, but there's a lot to look at there on both sides [17:00:53] if everything just worked out perfectly, maybe we could get down to two large clusters for "upload/maps" + "everything else" [17:01:15] but there's a lot of things to sort out before that's even really a plan [17:01:52] ema: well in a basic sense, cache_text was fronting mediawiki and is the primary cache in that sense, and has a lot of custom hacks related to that [17:02:03] cache_mobile was a separate service much like cache_text just for the mobile sites [17:02:20] cache_upload is upload.wm.o (very different VCL and storage needs and contention, etc) [17:02:26] cache_misc is "everything else" [17:02:35] <_joe_> it's misc services [17:02:48] a bunch of miscellaneous services, not usually mediawiki, and not usually requiring much VCL support [17:03:17] they mostly flow through a cp cluster at all (as opposed to just being separately and directly exposed on the internet) to standardize our edge stuff [17:03:29] same TLS termination, same basic cache protection, same ways to mitigate DoS, etc [17:04:46] at this point cache_text isn't just mediawiki, either. it's also fronting some non-wikimedia services directly, e.g. restbase [17:05:45] back on cache_misc, then there's also several services that seem like they could/should be flowing through cache_misc, but intentionally aren't, because they're ops-critical [17:06:01] as in, we don't want an outage or misconfig of cache_misc to break the tools we monitor such things with, etc [17:06:21] right [17:07:06] looks like some hosts ended up in the "Misc Web caching cluster eqiad" on ganglia and they shouldn't have? https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Misc+Web+caching+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=network_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [17:07:33] yeah, there's some issues with ganglia [17:08:14] templates/varnish/misc-backend.inc.vcl.erb gives an idea all the services flowing through cache_misc [17:09:35] because cache_misc has so many separate unique combinations of "this public domainname maps to this backend service host(s), with this very simple/standard other VCL config", it's the prime candidate for working on: https://phabricator.wikimedia.org/T110717 [17:09:57] basically auto-generate the per-service basics from declarative config [17:10:57] nice [17:11:21] one of many "nice idea" tickets that are low in the priority bin heh [17:17:55] 10Wikimedia-DNS, 7domains, 10Wiki-Loves-Monuments-General, 6operations, 5Patch-For-Review: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1955932 (10akosiaris) Repeating @faidon's comment from the gerrit change > Why are we not owning this domain? I don't think we sho... [17:20:26] 10Wikimedia-DNS, 7domains, 10Wiki-Loves-Monuments-General, 6operations, 5Patch-For-Review: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1955936 (10faidon) I discussed this with @JanZerebecki in person during the dev summit. I maintain that we should only be handling d... [18:53:32] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1956423 (10BBlack) p:5High>3Unbreak! This is getting worse now. vhtcpd can't forward messages as fast as... [20:29:25] 7domains, 6operations: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#1956930 (10Dzahn) p:5Triage>3Low [20:53:20] 7domains, 6operations: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#1956988 (10BBlack) Those seem amazingly low relative to our overall traffic rates... they might be candidates for parking, IMHO. I suspect in general typos are less-common than they used to be, because most peopl... [21:19:44] 10Traffic, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1957100 (10Denniss) Happened again today: https://commons.wikimedia.org/wiki/File:KutlugAtaman.JPG Was overwritten, reverted by... [21:43:09] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957241 (10greg) Anything else needed here? Or is this complete now? [22:06:12] 10Traffic, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1957364 (10BBlack) @Denniss - problems today are unrelated, they're from general random purge loss due to: T124418 [22:17:08] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1957410 (10BBlack) p:5Unbreak!>3High @ori cut the rate down a bit with: https://gerrit.wikimedia.org/r/26... [22:29:23] 10Traffic, 6Zero, 6operations: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#1957475 (10BBlack) 3NEW [22:29:39] 10Traffic, 6Zero, 6operations: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#1957485 (10BBlack) [22:29:42] 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1957486 (10BBlack) [22:33:05] 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1957516 (10BBlack) [22:44:37] what a day! :P [22:57:29] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957632 (10hashar) git clone works for me over v6 :-) There is still one comment that I dont think is formally addressed: >>! In T100519#171061... [23:00:21] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957636 (10BBlack) I don't really understand that quoted comment, but the ferm rules do have destination addresses that work at this time, and th... [23:18:00] 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957814 (10chasemp) 5Open>3Resolved that comment is out dated