[00:34:18] <wikibugs>	 7Varnish, 10MediaWiki-Vagrant: Make Varnish port configurable using hiera - https://phabricator.wikimedia.org/T124378#1954287 (10Mattflaschen) 3NEW
[04:11:32] <wikibugs>	 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1954591 (10Volker_E) +1 WFM too! Awesome, thanks all people involved!  > time GIT_SSH_COMMAND="ssh -v" git clone ssh://vcs@git-ssh.wikimedia.org/...
[06:09:21] <wikibugs>	 7HTTPS, 6Analytics-Kanban, 6operations, 5Patch-For-Review: EventLogging sees too few distinct client IPs {oryx} [8 pts] - https://phabricator.wikimedia.org/T119144#1954687 (10leila) @Ottomata I checked couple of tables that I knew and the diversity of hashed IPs looks healthy. I also looked at two of Tilma...
[10:09:48] <bblack>	 ema: ping
[10:10:29] <bblack>	 so, 1.27.11 stuck on all the groups, we're good to go for codfw this morning
[10:17:02] <bblack>	 _joe_ has only switched ulsfo to the new etcd stuff so far, so actually we're still on the exact same file-based method for this initial codfw thing
[10:17:49] <bblack>	 but since I did make that change to VCL's hashing, I should re-check that it works manually first :)
[10:18:04] <_joe_>	 bblack: if you want, we can migrate codfw as well
[10:20:12] <bblack>	 re-checking [done]
[10:20:22] <bblack>	 _joe_: yeah might as well
[10:21:41] <bblack>	 all at once or 2 by 2? :)
[10:21:56] <bblack>	 seems like ulsfo went pretty smooth
[10:22:35] <bblack>	 well 3 by 3 in codfw's case heh
[10:22:55] <bblack>	 the low-traffic config is a new thing and a little bigger though
[10:27:05] <bblack>	 _joe_: https://gerrit.wikimedia.org/r/265704 + https://gerrit.wikimedia.org/r/265705
[10:27:33] <_joe_>	 yeah looking :P
[10:27:45] <_joe_>	 I'm doing 4 things in parallel, my scheduler is almost jammed
[10:27:49] <bblack>	 :)
[10:29:22] <_joe_>	 bblack: actually I was thinking, we should use conf1002 for backups 
[10:29:30] <_joe_>	 and conf1001 for actives
[10:29:46] <_joe_>	 it doesn't make sense to connect to the same host everywhere
[10:30:15] <_joe_>	 (pending me finding the time to write a reconnection logic into pybal's etcd driver)
[10:32:10] <bblack>	 hmmm
[10:32:19] <bblack>	 yeah so it's been a while since I've looked at that stuff
[10:32:38] <_joe_>	 bblack: it's ok for now anyways
[10:32:51] <bblack>	 I mean as far as the implications go
[10:33:17] <bblack>	 I guess if conf1001 died and couldn't be brought up quickly, and we needed to make a change, having backups on conf1002 would let us stop pybal on the primaries and still make changes?
[10:33:38] <_joe_>	 yes
[10:33:57] <bblack>	 ok
[10:34:06] <_joe_>	 but well, I'd prefer to make pybal able to know all servers available, and then cycle through them in case of failure
[10:34:18] <bblack>	 well sure
[10:34:53] <bblack>	 we could do it other ways, too
[10:35:09] <bblack>	 but I don't have a clear picutre what our long term multi-dc plan is for etcd
[10:35:41] <bblack>	 but we could have gdnsd do it even in one DC (it does things other than geoip)
[10:35:50] <bblack>	 I would say LVS could do it, but that gets circular heh
[10:36:23] <bblack>	 but gdnsd could be configured to poll http on confd100[12] and serve round-robin pair of IPs, and drop one from the set if 1/2 dies
[10:36:34] <bblack>	 for confd.svc.eqiad.wmnet or whatever
[10:37:02] <_joe_>	 bblack: we already have a SRV record
[10:37:11] <_joe_>	 we just have to make pybal support that :P
[10:37:18] <bblack>	 ok :)
[10:37:21] <_joe_>	 as every other python app that connects to etcd does
[10:37:25] <_joe_>	 and confd as well
[10:37:48] <bblack>	 well with the gdnsd-based thing, we could go beyond SRV though
[10:37:58] <_joe_>	 oh, what do you mean?
[10:38:22] <_joe_>	 having gdnsd do healthchecks?
[10:38:41] <bblack>	 well yeah the thing I mentioned above is already doing healthchecks
[10:39:19] <_joe_>	 the ideal would be having the srv record just reporting whichever servers are up atm
[10:39:21] <bblack>	 but I mean, we could have a pair of confd listener IPs at each DC (when we have x-dc etcd sync somehow), and have gdnsd healthcheck them all from each authdns, and serve only the healthiest/closest one
[10:39:44] <bblack>	 so ulsfo clients get 1-2 healthy ulsfo IPs, or if both ulsfo seem to be offline they get eqiad IPs, etc...
[10:40:08] <_joe_>	 as for multi-dc, I'm still heavily undecided. on one side, I want to try the 2-2-1 cluster in eqiad/codfw/ulsfo 
[10:40:27] <_joe_>	 OTOH, I think that's a pretty complex setup
[10:41:05] <bblack>	 etcd isn't really multi-dc aware right?
[10:41:12] <_joe_>	 no
[10:41:24] <_joe_>	 raft in general doesn't play well with uneven topologies
[10:41:27] <bblack>	 even just 2x DCs fixes a lot of problems, eqiad+codfw or whatever
[10:41:39] <_joe_>	 I'm thinking of "stealing" the etcd replication agent I wrote when at JOB~1
[10:41:44] <bblack>	 and eqiad+codfw are the pair that are best-connected and most-critical anyways, raft could be ok over that link
[10:41:46] <_joe_>	 where I had active/backup
[10:42:00] <_joe_>	 bblack: we need a tiebreaker though
[10:42:05] <_joe_>	 or we risk split-brains
[10:42:23] <bblack>	 too bad we don't have a misc machine at eqord :)
[10:42:31] <_joe_>	 heh!
[10:42:36] <_joe_>	 yeah that would be ideal
[10:44:14] <bblack>	 well if you had total contol over etcd's implementation, you could implement naive tiebreakers too
[10:44:37] <bblack>	 as in "in case of 50%, tiebreak by succeeding at pinging this IP" or whatever
[10:45:22] <bblack>	 (where the IP would be cr1-eqord)
[10:45:31] <ema>	 morning!
[10:45:35] <bblack>	 hey ema
[10:45:35] <_joe_>	 hi ema
[10:45:49] <_joe_>	 you managed to wake up later than brandon? :P
[10:46:12] <bblack>	 I set an early alarm so I could figure out if we're ready to go for an early friday codfw mobile switch heh
[10:46:15] <_joe_>	 well, bblack has been horribly early today :)
[10:46:20] <bblack>	 because I hate doing things like that later friday
[10:46:24] <ema>	 haha no, I had stuff to do this morning :)
[10:46:55] <_joe_>	 ema: no one would be upset, we've hired YuviPanda so that anyone's life cycle seems sane in comparison
[10:48:28] <bblack>	 ok I'm gonna double-check the codfw confd -vs- files and see how lvs200[456] go
[10:49:15] <bblack>	 _joe_: also there's this on puppet-merge lately:
[10:49:15] <bblack>	 Now running conftool-merge to sync any changes to conftool data
[10:49:15] <bblack>	 Running conftool-sync on /etc/conftool/data
[10:49:15] <bblack>	 WARNING:conftool:Setting datacenters to the default value ['eqiad', 'codfw']
[10:49:18] <bblack>	 WARNING:conftool:Setting default_values to the default value {'pooled': 'no', 'weight': 0}
[10:49:21] <bblack>	 WARNING:conftool:Service citoid not found, skipping
[10:49:24] <bblack>	 I think someone messed up a config, probably citoid, in puppet?
[10:49:40] <bblack>	 it happens on every run
[10:51:06] <ema>	 are we going to use the file-based mechanism in codfw or conftool?
[10:52:34] <_joe_>	 bblack: that's someone who misconfigured conftool-data
[10:52:47] <_joe_>	 let me look in a few
[10:53:31] <_joe_>	 so I bet they added servers to a citoid service on cluster scb
[10:53:33] <bblack>	 ema: conftool, but not done switching pybal to it yet
[10:53:37] <_joe_>	 but didn't add the service stanza
[10:54:37] <bblack>	 codfw backup LVS confirmed the easy way
[10:55:08] <bblack>	 the easy way being: puppet agent -t; ipvsadm -Ln >x; service pybal restart; sleep 10; ipvsadm -Ln >y; diff -u x y
[10:55:40] <_joe_>	 bblack: well I also did some magic to exclude the connections numbers
[10:55:48] <_joe_>	 or that is hardly helpful
[10:55:50] <bblack>	 well on backup LVS they're all zero :)
[10:55:55] <_joe_>	 ahha ok
[11:10:31] <bblack>	 hmm I had a diff on lvs2003, for 10.192.32.61 ( mw2173 )
[11:10:42] <bblack>	 I'm guessing commented out on palladium, but conftool not updated
[11:11:18] <bblack>	 hmm, nope
[11:12:19] <_joe_>	 uhm what is that about?
[11:12:23] <bblack>	 apparently mw2173 was not in the backends list on lvs2003 before the change, but is now.  It was there before+after on lvs2006, and the file isn't recent and has it
[11:12:41] <_joe_>	 so I guess an ipvsadm failure?
[11:12:50] <_joe_>	 every machine has been rebooted yesterday
[11:12:55] <bblack>	 I picked up a diff line
[11:12:56] <bblack>	 +-> 10.192.32.61:80 Route 10
[11:12:58] <_joe_>	 maybe ipvsadm failed to add it back
[11:13:03] <bblack>	 maybe!
[11:16:25] <bblack>	 the rest all checked out, all codfw lvs on etd now
[11:16:31] <bblack>	 *etcd
[11:16:53] <_joe_>	 \o/
[11:17:02] <_joe_>	 we should probably !log it
[11:17:49] <_joe_>	 so, how are you going to proceed with mobile?
[11:18:31] <bblack>	 so basically, we're going to add lines to codfw's cache_mobile for all the cache_text machines, but only for services [nginx, varnish-fe], not be and be-rand
[11:18:44] <bblack>	 and puppet that out, which will add them all with pooled=no and do nothing in practice
[11:19:02] <bblack>	 then to ramp in a text node, set pooled=yes for those two services for that text hostname
[11:19:19] <bblack>	 to ramp out a legacy mobile no, set pooled=no for the mobile node only for those two servers (not be/be-rand)
[11:19:27] <bblack>	 s/no,/node,/
[11:19:50] <bblack>	 I'm kinda assuming it's ok for a machine to have <all services defined for the cluster
[11:20:07] <_joe_>	 it is
[11:20:16] <_joe_>	 sorry, but why just the frontend?
[11:20:33] <bblack>	 we're remapping this at the LVS layer, but at the varnish level the two clusters are still distinct
[11:20:39] <_joe_>	 oh!
[11:20:52] <_joe_>	 so LVS is shared, the backend varnish is not
[11:20:52] <bblack>	 the state we're moving towards here is a temporary one
[11:20:55] <_joe_>	 wicked!
[11:21:01] <bblack>	 it's not shared, it's moving completely
[11:21:16] <_joe_>	 sorry, bbiab
[11:21:22] <_joe_>	 I have to deploy a mediawiki change
[11:21:24] <bblack>	 at the end of this process, at the LVS layer cache_mobile and cache_text will have the same node lists
[11:21:27] <bblack>	 ok
[11:21:47] <bblack>	 (and then when we're comfortable, we can kill cache_mobile and just have its IPs be another set of IPs for cache_text)
[11:22:10] <bblack>	 "the same node lists" above being just the current text machines
[11:22:37] <ema>	 bblack: so is one of the advantages of conftool vs. text file that we can have machines with pooled=no?
[11:22:42] <bblack>	 but cache_mobile at the LVS layer temporary has both text and mobile machines while doing the switch, to soften the cache-miss blow
[11:23:03] <bblack>	 ema: no, we can depool from the files too
[11:23:18] <bblack>	 there's no functional change in that sense
[11:23:50] <bblack>	 it just gets all of this under one tool with a commandline
[11:24:07] <bblack>	 in the distant past, we had completely separate tools for pool/depool of cache machines at different levels:
[11:24:13] <bblack>	 1) the files on palladium for pybal/LVS
[11:24:15] <ema>	 ah, enabled: True/False?
[11:24:45] <bblack>	 2) lists of nodes in puppet for varnish<->varnish, which required puppet runs to effectively add/delete/pool/depool/re-weight, etc
[11:25:26] <bblack>	 so "completely disable cp1065" meant disabling/removing in the file on palladium for LVS, and committing a nodelist change to puppet, and making sure puppet ran on all the relevant machines, etc
[11:25:52] <bblack>	 in the new world order, both are hooked up to etcd
[11:26:14] <bblack>	 you can use confctl to make etcd change for pooled/weight for both LVS and varnish purposes (and any others, it's a generic service)
[11:26:34] <bblack>	 but the varnish half switched over first, now we're finally switching over the LVS/pybal part
[11:26:45] <ema>	 aha!
[11:27:20] <bblack>	 if you touch the pooled status on the services varnish-be or varnish-be-rand, that triggers a templated VCL change + VCL reload on the relevant machines to affect the varnish backend lists, basically.
[11:28:21] <bblack>	 and just recently, confctl gained a --find option too
[11:28:38] <bblack>	 so now we can with a single command depool all services for host cp1065.eqiad.wmnet too
[11:29:02] <bblack>	 next stop is hook that into shutdown scripts for the host, and/or for each service, too
[11:29:51] <bblack>	 if we integrate with the service init scripts / unit files, we probably want that to be smart-ish
[11:29:57] <bblack>	 so that it doesn't interfere with local manual hacks
[11:29:59] <ema>	 uh so if you eg. stop varnish the machine gets depooled?
[11:30:06] <bblack>	 yeah
[11:30:27] <bblack>	 we might want some file in /etc to control that though, for when someone's trying to be smarter than the tools manually
[11:30:50] * ema nods
[11:30:58] <bblack>	 e.g. if you want to manually depool a node and then restart varnish a few times testing something, you don't want it self-repooling
[11:31:19] <bblack>	 and we probably don't want nodes self-pooling on a fresh boot by default either
[11:32:10] <bblack>	 but it seems like depool-on-stop for services, and/or depool-all-on-shutdown, are universally good ideas
[11:32:31] <_joe_>	 bblack: that's usually done with piggiback service units
[11:32:39] <_joe_>	 that you can mask/unmask at will
[11:32:43] <bblack>	 ah
[11:32:55] <bblack>	 I was just assuming ExecStartPre= and such
[11:33:29] <bblack>	 anyways, so on with the show!
[11:34:36] <bblack>	 ema: so, puppet repo, conftool-data/nodes/codfw.yaml 
[11:34:45] <ema>	 yes!
[11:34:55] <ema>	 cache_mobile stanza
[11:34:57] <bblack>	 basically copy all the cache_text nodes to cache_mobile, but reduce the service list in cache_mobile for them to just [nginx, varnish-fe]
[11:35:34] <bblack>	 merging that through shouldn't affect ipvsadm yet since by default conftool-sync adds new nodes in the depooled state
[11:36:02] <_joe_>	 bblack: that depends on what you wrote in the service defs, though
[11:36:08] <bblack>	 well true :)
[11:36:14] <_joe_>	 (and yes, it's depooled by default for all caches)
[11:36:19] <ema>	 right, and we can pool them one by one afterwards
[11:36:33] <bblack>	 right
[11:36:53] <_joe_>	 oh jesus, every reboot cycle in codfw renders dead appservers
[11:37:35] <bblack>	 ema: and on palladium, the confctl command to pool up cp2001 is:
[11:37:38] <bblack>	 confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action set/pooled=yes cp2001.codfw.wmnet
[11:37:46] <bblack>	 confctl --tags dc=codfw,cluster=cache_mobile,service=varnish-fe --action set/pooled=yes cp2001.codfw.wmnet
[11:37:58] <bblack>	 which flips on port 443 and port 80 respectively
[11:38:03] <_joe_>	 also add the weight
[11:38:17] <_joe_>	 confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action set/pooled=yes:weight=10
[11:38:25] <bblack>	 oh right, I forgot we used weights last time
[11:38:32] <bblack>	 we have default weights, but we mucked with them on codfw last time around
[11:38:34] <ema>	 for the first machine we did
[11:38:57] <bblack>	 anyways, same commands with set/weight=X to do weights
[11:39:15] <bblack>	 we should probably do like last time and bump them to 10 and ramp in the first one, etc
[11:39:28] <_joe_>	 which machine did you add?
[11:39:38] <bblack>	 when?
[11:39:48] <bblack>	 none yet, there's a puppet commit to go yet
[11:39:54] <_joe_>	 ahhh ok
[11:40:49] <bblack>	 ema: another helpful command:
[11:40:50] <bblack>	 confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action get 're:.*'
[11:40:59] <bblack>	 will list them all, with linebreaks
[11:42:37] <ema>	 oh, no notifications from grrrit-wm here?
[11:42:59] <ema>	 https://gerrit.wikimedia.org/r/265710
[11:43:16] <bblack>	 yeah no grrrit-wm here :)
[11:43:28] <ema>	 alright, we'll survive
[11:43:43] <ema>	 should I move https://phabricator.wikimedia.org/T109286 to In Progress?
[11:44:07] <ema>	 isn't https://phabricator.wikimedia.org/T124165 now done?
[11:44:17] <bblack>	 yeah just fixed that one :)
[11:44:18] <wikibugs>	 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1955124 (10BBlack)
[11:44:21] <wikibugs>	 10Traffic, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 6operations, and 3 others: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1955121 (10BBlack) 5Open>3Resolved a:3BBlack
[11:46:22] <bblack>	 ema: the workboard columns are a mess :)
[11:46:32] <bblack>	 they're mostly wishful thinking these days! :)
[11:46:42] <ema>	 cool, then I'll move it! :)
[11:47:16] <joal>	 Hi bblack, Just saw that blocking tasks for 'mobile cache merge into text' were done ... Will the move happen early necxt weeke?
[11:47:33] <bblack>	 joal: it's starting today, but it won't finish until sometime next week
[11:48:02] <joal>	 Wow, starting on a Friday, you guys don't value your weekends enough ;)
[11:48:23] <joal>	 ok bblack
[11:48:38] <joal>	 I'll monitor on out side
[11:48:42] <joal>	 Thanks for the update
[11:48:47] <bblack>	 well we're starting early on a friday
[11:48:59] <bblack>	 you won't see much change, we're only flipping the smallest dc traffic-wise
[11:49:05] <joal>	 yup
[11:49:07] <ema>	 bblack: do we need to set weights for service=varnish-fe as well? 
[11:49:17] <ema>	 it's probably enough to set it on nginx alone right?
[11:49:20] <joal>	 When I see changes, it means the end of it is soom
[11:49:23] <bblack>	 ema: yeah basically everything we do today has to be done 2x for varnish-fe + nginx
[11:49:35] <bblack>	 ema: oh weights, yeah
[11:49:44] <bblack>	 I guess that doesn't matter since it's mostly redirects
[11:50:05] <ema>	 I mean I don't mind setting them everywhere, I'm just trying to understand :)
[11:53:45] <_joe_>	 joal: well, at least if the pain on the weekends is self-inflicted, it's justified
[11:54:12] <_joe_>	 last weekend I worked for the whole saturday morning basically, and was on vacation from friday to monday :P
[11:54:12] <joal>	 I can understand  _joe_ :)
[11:54:26] <_joe_>	 and... it wasn't our fault, so...
[11:54:39] <joal>	 right
[11:55:52] <ema>	 _joe_: thanks for updating https://wikitech.wikimedia.org/wiki/Conftool
[11:56:32] <_joe_>	 ema: we should keep adding things, please take the time to do it or ask me if something is not clear
[11:56:42] <_joe_>	 I'm notoriously bad at documentation
[11:56:52] <_joe_>	 basically exposes my lazyness
[12:00:06] <ema>	 alright! 265710 merged, can I go ahead and puppet-merge ; conftool-merge?
[12:03:44] <ema>	 bblack: ^ Should I also set all weights to 10 in dc=codfw,cluster=cache_text,service=nginx?
[12:04:04] <bblack>	 not cache_text.  nothing we do is in cache_text
[12:04:10] <bblack>	 but yes
[12:06:14] <ema>	 ERROR:conftool:load_node Backend error while loading node: Backend error: The request requires user authentication : Insufficient credentials
[12:06:20] <ema>	 when running puppet-merge
[12:06:38] <ema>	 bblack: and yes sorry about the cache_text question, I got confused :)
[12:07:26] <bblack>	 ema: in a root "sudo -i" session on palladium, right?
[12:07:33] <bblack>	 I've never run it any other way
[12:07:47] <ema>	 as root, yes
[12:07:57] <ema>	 as in 'sudo puppet-merge'
[12:09:12] <ema>	 bblack: https://phabricator.wikimedia.org/P2517
[12:09:14] <bblack>	 that's different
[12:09:40] <bblack>	 I'm running a puppet-merge now in any case
[12:09:50] <bblack>	 it's syncing your data fine
[12:10:00] <bblack>	 don't use "sudo cmd", use "sudo -i" to get a real root shell, then run cmd
[12:10:26] <ema>	 sounds good
[12:18:09] <ema>	 bblack: let me know when I can start pooling
[12:22:06] <_joe_>	 ema: yeah that's a known problem I should fix
[12:22:31] <_joe_>	 I'm evaluating removing automatic run of conftool-merge from puppet-merge
[12:22:37] <_joe_>	 while I fix it
[12:25:08] <bblack>	 ema: so jsut to confirm, we're at weight=10 on the existing pooled mobile machines, weight=1 on all the new ones, and you're going to pool cp2001 w/ weight=1
[12:25:11] <bblack>	 right?
[12:25:33] <ema>	 correct
[12:29:43] <ema>	 bblack: confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action set/pooled=yes:weight=1 cp2001.codfw.wmnet
[12:29:58] <ema>	 and same story for service=varnish-fe
[12:30:27] <ema>	 I'll !log and start with cp2001 as soon as you give me your OK
[12:30:33] <bblack>	 ok!
[12:32:06] <_joe_>	 ema: http://config-master.wikimedia.org/conftool/codfw/mobile-https and http://config-master.wikimedia.org/conftool/codfw/mobile give a decent idea of the situation
[12:32:26] <_joe_>	 (might not be very freshly updated though)
[12:32:53] <ema>	 cp2001.codfw.wmnet: pooled changed no => yes
[12:33:14] <_joe_>	 { 'host': 'cp2001.codfw.wmnet', 'weight':1, 'enabled': True } in that file :)
[12:33:30] <ema>	 if everything is fine, in 5 minutes (or longer?) I'll bump the weight to 5
[12:33:49] <bblack>	 yeah I'd say wait 5 minutes go to 5, wait 5 minutes go to 10
[12:33:59] <bblack>	 and then once again we'll pause there with just 1 machine for a while and let caches fill
[12:34:16] <ema>	 alright
[12:34:48] <_joe_>	 ema: welcome to the WMF: http://45.media.tumblr.com/5aa1c685be2bcaad16ebeb3944e11847/tumblr_n2zkwrWLvQ1qcj7x4o1_500.gif
[12:34:52] <ema>	 _joe_: would be nice to sort those by enabled, weight
[12:35:10] <_joe_>	 ema: that's just an horrible go text/template
[12:35:29] <_joe_>	 if you want something better, you can query the http interface of pybal on a relevant LVS host
[12:35:37] <_joe_>	 curl <host>:9090/pools
[12:38:53] <ema>	 cp2001.codfw.wmnet: weight changed 1 => 5
[12:40:40] <ema>	 _joe_: confctl output does not mention the dc, might be nice to add. eg: cp2001.codfw.wmnet: weight changed 1 => 5 (dc=codfw,cluster=cache_mobile,service=nginx)
[12:43:20] <_joe_>	 ema: I agree, would you mind opening a ticket?
[12:43:37] <ema>	 will do, migration first :)
[12:43:53] <ema>	 cp2001.codfw.wmnet: weight changed 5 => 10
[12:45:18] <ema>	 bblack: cp2001 done, let's wait for caches to fill up
[12:46:32] <bblack>	 ema: yeah plus we should pause anyways on what's going on in -sec
[12:48:05] <_joe_>	 yup
[12:48:11] <_joe_>	 SNAFU
[13:05:27] <bblack>	 ema: aside from all the general wtf going on, there's a risk they may still roll back from .11 to .10, which would undo the mobile purge fixes as a side-effect
[13:05:48] <bblack>	 but ori's merging a cherrypick to .10 now, so that should insure us against a rollback
[13:20:22] <ema>	 bblack: that's a very effective tl;dr
[13:33:49] <bblack>	 "all the general wtf"? :)
[13:37:03] <ema>	 yeah that too
[13:37:42] <ema>	 but no I was refering to: they might rollback from .11 to .10 and we would be screwed. Merging a cherrypick to .10 though so we're good.
[13:38:19] <bblack>	 so yeah we can continue on our own schedule now basically
[13:38:27] <bblack>	 we kinda know what the other wtfs are approximately
[13:38:50] <ema>	 s/we/bblack/
[13:39:29] <ema>	 as in, I've tried to follow the discussion but some pieces of the puzzle are missing here :)
[13:40:09] <bblack>	 ema: s/we/you/ can carry on pooling in text servers, etc
[13:40:48] <ema>	 sweet!
[13:40:50] <bblack>	 the other two ongoing wtfs are the general .11 mess with authentication/sessions, and some kind of jobqueue-driven massive increase in purge traffic, which is unrelated
[13:41:34] <ema>	 will proceed with cp2004 then
[13:42:57] <ema>	 cp2004.codfw.wmnet: pooled changed no => yes
[13:42:58] <ema>	 cp2004.codfw.wmnet: weight changed 1 => 1
[13:43:40] <bblack>	 we can probably just pool the rest in at 10 like before, with some timing/spacing
[13:44:18] <ema>	 cp2004.codfw.wmnet: weight changed 1 => 10
[13:47:41] <ema>	 cp2007.codfw.wmnet: pooled changed no => yes
[13:47:42] <ema>	 cp2007.codfw.wmnet: weight changed 1 => 10
[13:52:24] <ema>	 cp2010.codfw.wmnet: pooled changed no => yes
[13:52:24] <ema>	 cp2010.codfw.wmnet: weight changed 1 => 10
[13:57:17] <ema>	 cp2013.codfw.wmnet: pooled changed no => yes
[13:57:18] <ema>	 cp2013.codfw.wmnet: weight changed 1 => 10
[14:02:14] <ema>	 cp2016.codfw.wmnet: pooled changed no => yes
[14:02:15] <ema>	 cp2016.codfw.wmnet: weight changed 1 => 10
[14:07:20] <ema>	 cp2019.codfw.wmnet: pooled changed no => yes
[14:07:21] <ema>	 cp2019.codfw.wmnet: weight changed 1 => 10
[14:13:05] <ema>	 cp2023.codfw.wmnet: pooled changed no => yes
[14:13:05] <ema>	 cp2023.codfw.wmnet: weight changed 1 => 10
[14:13:22] <ema>	 and that's it, all nodes added
[14:13:55] <ema>	 now I'll grab a bite and after that I'll start removing mobile nodes
[14:17:35] <bblack>	 \o/
[14:20:37] <wikibugs>	 10Traffic, 6operations: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#1955351 (10elukey) Something that might be interesting:  https://httpd.apache.org/docs/2.4/mod/event.html#how-it-works  Disabling mod_deflate could be good if we plan to te...
[14:36:16] <wikibugs>	 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955365 (10BBlack) 3NEW
[14:39:15] <wikibugs>	 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955373 (10BBlack) From looking at runJob logs, I've initially started to suspect something related to htmlCa...
[14:49:58] <wikibugs>	 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955387 (10BBlack) (also, note for posterity: https://gerrit.wikimedia.org/r/#/c/265713/ was related to this:...
[15:08:18] <elukey>	 ema rocks ;)
[15:09:39] <elukey>	 quick question from T124195 - would it be worth to test mod-event for our apaches? If we disable mod-deflate it might give us some benefits (also had a chat with ori about this)
[15:13:23] <bblack>	 maybe, who knows :)
[15:13:39] <bblack>	 I think in the long term, though, we're trying to get rid of apache on the mediawiki hosts
[15:24:44] <elukey>	 yeah I know ori told me, but I still believe that we could gain some benefit from our old friend httpd :)
[15:29:05] <ema>	 bblack: at your signal!
[15:29:59] <bblack>	 ema: go for it
[15:32:19] <ema>	 cp2003.codfw.wmnet: pooled changed yes => no
[15:33:04] <bblack>	 after, we can go ahead and commit this to puppet too (as in, remove [nginx, varnish-fe] from the 4x legacy mobile machines in the same file as before)
[15:33:07] <bblack>	 Just In Case
[15:33:12] <ema>	 bblack: depooling both nginx and varnish-fe right?
[15:33:17] <bblack>	 ema: yes
[15:38:15] <ema>	 cp2009.codfw.wmnet: pooled changed yes => no
[15:40:08] <ema>	 incoming network traffic on depooled nodes does not really go down
[15:40:28] <ema>	 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp2003.codfw.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Mobile+caches+codfw
[15:41:11] <bblack>	 well it was already fairly low
[15:42:14] <bblack>	 ipvs looks right
[15:42:19] <ema>	 right, and the number of established TCP connections definitely went down :)
[15:42:33] <ema>	 ~1500 vs ~60
[15:43:44] <ema>	 cp2015.codfw.wmnet: pooled changed yes => no
[15:43:45] <bblack>	 another confusing aspect of this, is that we're in a natural ramp-up period right now
[15:44:00] <bblack>	 (as in, normally mobile traffic in codfw would be on a massive upswing due to normal daily cycles)
[15:48:49] <ema>	 cp2021.codfw.wmnet: pooled changed yes => no
[15:49:55] <ema>	 bblack: https://gerrit.wikimedia.org/r/#/c/265742/
[15:50:08] <ema>	 all nodes done
[15:50:31] <ema>	 (codfw)
[15:51:10] <bblack>	 ema: awesome :)
[15:52:30] <ema>	 let me !log this
[16:07:40] <paravoid>	 how are we depooling now? :)
[16:07:54] * paravoid is worried will get rusty :)
[16:08:06] <ema>	 confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action set/pooled=no $host
[16:08:39] <ema>	 and same story for service=varnish-fe
[16:08:43] <paravoid>	 alright
[16:08:46] <paravoid>	 good to know!
[16:08:59] * ema finally answered a question :)
[16:13:58] <bblack>	 paravoid: there's also --find now too, since often a node has many services
[16:14:02] <bblack>	 e.g.:
[16:14:20] <bblack>	 confctl --find --action set/pooled=no cp2003.codfw.wmnet
[16:14:27] <bblack>	 (I think, I haven't run it like that yet myself)
[16:14:43] <bblack>	 but that should depool it from any cluster/service cp2003 is pooled in
[16:17:53] <ema>	 conftool-sync re-removed the nodes, for example:
[16:17:56] <ema>	 codfw: Removing node cp2021.codfw.wmnet from cluster cache_mobile/nginx
[16:18:01] <ema>	 I assume that's OK
[16:19:53] <ema>	 looks like everything is fine with confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action get 're:.*'
[16:29:27] <bblack>	 well the difference is they're removed now instead of just depooled
[16:29:40] <bblack>	 but also importantly, it didn't change the definitions for the backend services, e.g.: confctl --tags dc=codfw,cluster=cache_mobile,service=varnish-be --action get 're:.*'
[16:29:52] <bblack>	 so yeah all looks fine to me
[16:31:30] <bblack>	 so assuming no fallout/complaints over the weekend, next week we can step through the other DCs: ulsfo, eqiad, esams
[16:31:58] <bblack>	 with codfw having primed the hot objects in the eqiad backends over the weekend, ulsfo+eqiad pool changes can be done relatively-quickly, especially eqiad
[16:32:06] <bblack>	 as the misses won't commonly go to the applayer
[16:32:19] <bblack>	 for eqiad they'll fetch over the local network from the be cache
[16:32:28] <bblack>	 for ulsfo they'll fetch from eqiad be cache, which is relatively-small latency
[16:32:40] <bblack>	 esams we might want to go slow initially again
[16:33:09] <bblack>	 because (a) the latency hit pulling from eqiad caches is bigger + (b) the set of wikis/languages active in esams is going to differ more-substantially than the US sites do from each other
[16:33:32] <ema>	 oh I wouldn't have thought of (b)
[16:33:36] <ema>	 makes sense
[16:34:48] <ema>	 bblack: at which point can we remove the backend services in codfw cache_mobile? 
[16:34:49] <_joe_>	 ema: confctl --tags dc=codfw,cluster=cache_mobile,service=nginx --action get all 
[16:36:00] <bblack>	 ema: really yeah but "all" puts it all in one line, vs "re:.*" has linebreaks
[16:36:16] <bblack>	 err whoops, I started one message and ended with another, that was meant for _joe_ :)
[16:36:34] <bblack>	 ema: really we're not going to, basically.
[16:36:59] <bblack>	 ema: once we're comfortable with the state codfw is in, at all DCs...
[16:37:38] <bblack>	 the next step is to basically delete the cache_mobile service from pybal/LVS, and move its IP address to be just another one for the cache_text service
[16:37:57] <bblack>	 and then we can decom the cache_mobile machines, the role::cache::mobile definitions, etc, etc...
[16:38:16] <bblack>	 at which point we have 16 good cache machines we can reallocate elsewhere and reinstall/re-role ...
[16:41:04] <bblack>	 in esams in particular, we might want to swap them out for some others in other clusters, I'm not sure, I haven't really looked
[16:41:22] <bblack>	 but I know esams more than the other sites, we have a blend of hardware generations, and those mobile machines might be better some of the active ones in other clusters
[16:48:04] <bblack>	 hmmm peeked at the hardware, guess not
[16:49:35] <bblack>	 (at least, they're not better than any cache_(text|upload) machines)
[16:59:09] <bblack>	 in the long, long, term, my half-baked plans are that we'd rather have more large clusters and fewer small clusters
[16:59:28] <bblack>	 the ditching of cache_bits and now soon cache_mobile is part of that
[17:00:01] <bblack>	 really we could fold cache_misc into cache_text too, but only if we can get configuration sanity/refactoring/generation to a point where it's not a complete mess of conditionals in the VCL
[17:00:24] <ema>	 the purpose of cache_misc is still unclear to me
[17:00:26] <bblack>	 cache_maps and cache_upload could possible share storage too, but there's a lot to look at there on both sides
[17:00:53] <bblack>	 if everything just worked out perfectly, maybe we could get down to two large clusters for "upload/maps" + "everything else"
[17:01:15] <bblack>	 but there's a lot of things to sort out before that's even really a plan
[17:01:52] <bblack>	 ema: well in a basic sense, cache_text was fronting mediawiki and is the primary cache in that sense, and has a lot of custom hacks related to that
[17:02:03] <bblack>	 cache_mobile was a separate service much like cache_text just for the mobile sites
[17:02:20] <bblack>	 cache_upload is upload.wm.o (very different VCL and storage needs and contention, etc)
[17:02:26] <bblack>	 cache_misc is "everything else"
[17:02:35] <_joe_>	 it's misc services
[17:02:48] <bblack>	 a bunch of miscellaneous services, not usually mediawiki, and not usually requiring much VCL support
[17:03:17] <bblack>	 they mostly flow through a cp cluster at all (as opposed to just being separately and directly exposed on the internet) to standardize our edge stuff
[17:03:29] <bblack>	 same TLS termination, same basic cache protection, same ways to mitigate DoS, etc
[17:04:46] <bblack>	 at this point cache_text isn't just mediawiki, either.  it's also fronting some non-wikimedia services directly, e.g. restbase
[17:05:45] <bblack>	 back on cache_misc, then there's also several services that seem like they could/should be flowing through cache_misc, but intentionally aren't, because they're ops-critical
[17:06:01] <bblack>	 as in, we don't want an outage or misconfig of cache_misc to break the tools we monitor such things with, etc
[17:06:21] <ema>	 right
[17:07:06] <ema>	 looks like some hosts ended up in the "Misc Web caching cluster eqiad" on ganglia and they shouldn't have? https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Misc+Web+caching+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=network_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
[17:07:33] <bblack>	 yeah, there's some issues with ganglia
[17:08:14] <bblack>	 templates/varnish/misc-backend.inc.vcl.erb gives an idea all the services flowing through cache_misc
[17:09:35] <bblack>	 because cache_misc has so many separate unique combinations of "this public domainname maps to this backend service host(s), with this very simple/standard other VCL config", it's the prime candidate for working on: https://phabricator.wikimedia.org/T110717
[17:09:57] <bblack>	 basically auto-generate the per-service basics from declarative config
[17:10:57] <ema>	 nice
[17:11:21] <bblack>	 one of many "nice idea" tickets that are low in the priority bin heh
[17:17:55] <wikibugs>	 10Wikimedia-DNS, 7domains, 10Wiki-Loves-Monuments-General, 6operations, 5Patch-For-Review: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1955932 (10akosiaris) Repeating @faidon's comment from the gerrit change   > Why are we not owning this domain? I don't think we sho...
[17:20:26] <wikibugs>	 10Wikimedia-DNS, 7domains, 10Wiki-Loves-Monuments-General, 6operations, 5Patch-For-Review: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1955936 (10faidon) I discussed this with @JanZerebecki in person during the dev summit. I maintain that we should only be handling d...
[18:53:32] <wikibugs>	 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1956423 (10BBlack) p:5High>3Unbreak! This is getting worse now.  vhtcpd can't forward messages as fast as...
[20:29:25] <wikibugs>	 7domains, 6operations: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#1956930 (10Dzahn) p:5Triage>3Low
[20:53:20] <wikibugs>	 7domains, 6operations: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#1956988 (10BBlack) Those seem amazingly low relative to our overall traffic rates... they might be candidates for parking, IMHO.  I suspect in general typos are less-common than they used to be, because most peopl...
[21:19:44] <wikibugs>	 10Traffic, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1957100 (10Denniss) Happened again today: https://commons.wikimedia.org/wiki/File:KutlugAtaman.JPG Was overwritten, reverted by...
[21:43:09] <wikibugs>	 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957241 (10greg) Anything else needed here? Or is this complete now?
[22:06:12] <wikibugs>	 10Traffic, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1957364 (10BBlack) @Denniss - problems today are unrelated, they're from general random purge loss due to: T124418
[22:17:08] <wikibugs>	 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 6operations: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1957410 (10BBlack) p:5Unbreak!>3High @ori cut the rate down a bit with: https://gerrit.wikimedia.org/r/26...
[22:29:23] <wikibugs>	 10Traffic, 6Zero, 6operations: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#1957475 (10BBlack) 3NEW
[22:29:39] <wikibugs>	 10Traffic, 6Zero, 6operations: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#1957485 (10BBlack)
[22:29:42] <wikibugs>	 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1957486 (10BBlack)
[22:33:05] <wikibugs>	 10Traffic, 6Zero, 6operations, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1957516 (10BBlack)
[22:44:37] <bblack>	 what a day! :P
[22:57:29] <wikibugs>	 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957632 (10hashar) git clone works for me over v6 :-)  There is still one comment that I dont think is formally addressed:  >>! In T100519#171061...
[23:00:21] <wikibugs>	 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957636 (10BBlack) I don't really understand that quoted comment, but the ferm rules do have destination addresses that work at this time, and th...
[23:18:00] <wikibugs>	 10Traffic, 6Phabricator, 6Release-Engineering-Team, 6operations, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1957814 (10chasemp) 5Open>3Resolved that comment is out dated