[11:57:47] <wikibugs>	 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3617323 (10Jan_Dittrich) > You can use eventlogging and wikimediaevents code at this time , there are quite  > a bit of examples of how to run ab tests on discovery's code.  My concern is mainly with...
[11:57:52] <wikibugs>	 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3617324 (10Jan_Dittrich) > You can use eventlogging and wikimediaevents code at this time , there are quite  > a bit of examples of how to run ab tests on discovery's code.  My concern is mainly with...
[12:57:41] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3617370 (10Jgreen) >>! In T176175#3616476, @faidon wrote: > Ah! active/backup is safer indeed, but in my experience, I've seen many more iss...
[13:30:51] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3617443 (10Jgreen) a:05Jgreen>03Papaul
[14:15:36] <ema>	 so, pybal
[14:15:45] <ema>	 the list of commits in master but not in 1.13 is pretty long
[14:16:22] <ema>	 is it time to cut 1.14.1?
[14:16:36] <ema>	 changelog here https://phabricator.wikimedia.org/P6023
[14:17:29] <ema>	 per branching policy, bugfixes should go in 1.13, while new features in 1.14, and a new version should be cut by branching off master
[14:17:49] <bblack>	 is there a 1.14 at all?
[14:17:53] <volans>	 dumb question, why not 1.14.0?
[14:17:56] <ema>	 things like prometheus support, BGP MED and such are cleary new features
[14:18:15] <ema>	 bblack: there's no 1.14 yet, hence my question of whether we want to create it :)
[14:18:25] <bblack>	 ok, I'd call it 1.14.0 too :)
[14:18:33] <ema>	 ok
[14:18:49] <bblack>	 at some point we have to bite the bullet and get past this backlog of feature changes
[14:18:55] <ema>	 indeed
[14:19:14] <bblack>	 the important part isn't so much our strict adherence to a given policy, but that the risk-level of the new deployment makes sense
[14:19:43] <ema>	 in the current situation there's been quite some refactoring going on in master, and cherry-picking changes from there into the 1.13 bugfix branch is becoming harder and harder
[14:19:47] <bblack>	 moving from 1.13 -> 1.14.0, there's a lot of changes wrapped up in there.  How confident are you about how well-tested/reviewed they are?
[14:20:03] <bblack>	 (and how much worse does that get if we put it off longer?) :)
[14:20:41] <ema>	 they should all be fairly well reviewed, but not very much tested
[14:21:19] <ema>	 it certainly does get worse the longer we wait, in particular when it comes to backporting bugfixes!
[14:22:21] <bblack>	 yeah, it's all well and good to nitpick our own past behavior (we probably shouldn't have piled up so many features in master between feature releases)
[14:22:39] <bblack>	 but present reality is what it is, and we have to keep moving things forward somehow with reasonable effort levels
[14:22:47] <ema>	 yes
[14:23:33] <bblack>	 and we can always opt to do some extended testing of 1.14.0 on the backup LVSes before we upgrade the primaries
[14:23:52] <bblack>	 and then just do 1x DC's primaries for a couple weeks for comfort before the rest (easier to depool just ulsfo if it craps out)
[14:24:23] <bblack>	 so, yeah, time to cut 1.14 I think
[14:24:28] <ema>	 \o/
[14:24:31] <ema>	 do we have any way to test the BGP changes properly? I think I've heard the word quagga before?
[14:25:02] <bblack>	 or bird
[14:25:20] <bblack>	 http://bird.network.cz/
[14:25:34] <bblack>	 (happens to be the one I thought looked interesting for anycast DNS stuff)
[14:25:41] <mark>	 there's a quagga instance on the pybal test boxes
[14:25:43] <mark>	 at least one of them
[14:25:45] <bblack>	 oh nice
[14:25:53] <mark>	 i did some testing on it once
[14:26:00] <mark>	 or you know
[14:26:01] <mark>	 you can just, not
[14:26:05] <mark>	 because
[14:26:12] <mark>	 it only affects the initial send on startup
[14:37:00] <bblack>	 ema: you ok with me pushing https://gerrit.wikimedia.org/r/#/c/376751/10 today? further reflections on anything related?
[14:38:08] <ema>	 bblack: no objections, we've had another steep spike a couple of hours ago, high time to merge! :)
[14:38:20] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3617676 (10Cmjohnson) I checked the BIOS Settings, everything is enabled, the standard boot order is correct   1, CDROM 2. FLOPPY 3. USB 4. HARD DRIVE 5. PCI SLOT 1 ETHERNET 10Gb  I...
[14:38:27] <bblack>	 ema: on upload?
[14:38:42] <ema>	 bblack: text (cp1055)
[14:38:48] <bblack>	 hmmm
[14:39:04] <bblack>	 well, it could take some time for yesterday's revert to have full effect there I guess
[14:39:22] <bblack>	 we only changed the keep-time on new objects entering after the revert, not the ones already there
[14:39:23] <ema>	 indeed, it did take a few days for the problem to show up in the first place since the keep merge
[14:43:15] <bblack>	 I puppeted cp1071 immediatley, letting the rest roll out slow, watching 1071 varnish-machine-stats to see any notable diffs
[14:45:55] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1071&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now
[14:46:00] <bblack>	 the diffs are pretty sharp all over
[14:47:13] <ema>	 pybal-test2001 upgraded to pybal 1.14.0
[14:48:12] <bblack>	 disk utilization and system load trending down, bakend nukes way down, backend hitrate way up, allocator failures way down, backend expiry lock operations way down (!!), backend LRU lock ops way down,
[14:48:35] <ema>	 :)
[14:48:45] <bblack>	 the ones that might be concerning in trade though: backend transient usage/allocation is spiking up a bit, backend fetch-no-body 304s are up a bit (maybe should be expected)
[14:50:02] <bblack>	 i went ahead and puppeted 1073 + 1049 as well, as they're currently showing icinga warnings for mailbox lag
[14:50:10] <bblack>	 wonder if this will back them out of that state
[14:50:18] <ema>	 exciting day!
[14:54:20] <bblack>	 it's interesting that the backend client request rate ramped up a bit
[14:54:42] <bblack>	 that might mean all this locking and storage churn was actually slowing down the incoming (fe and/or remote-be miss/pass) requests
[14:55:59] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-3h&to=now&panelId=21&fullscreen&var-server=cp1049&var-datasource=eqiad%20prometheus%2Fops
[14:56:09] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-3h&to=now&panelId=21&fullscreen&var-server=cp1073&var-datasource=eqiad%20prometheus%2Fops
[14:56:22] <bblack>	 ^ so yeah, the ramping mailbox lag dropped off like a rock right after puppeting those two
[14:56:23] <ema>	 wow
[15:00:00] <volans>	 ema, bblack: see -tech
[15:00:50] <ema>	 volans, bblack restarting cp1055's backend
[15:01:30] <volans>	 ema: we got the unlucky user?
[15:01:33] <volans>	 :)
[15:05:03] <bblack>	 ema: looks like 15:02 it hit the puppet change, was that manual?
[15:05:19] <ema>	 bblack: nope
[15:05:35] <bblack>	 trying another manual run for the failure anyways
[15:05:56] <ema>	 re-running puppet now manually though as the last run failed 
[15:05:57] <bblack>	 oh someone else did
[15:06:01] <bblack>	 :)
[15:06:28] <ema>	 ah :)
[15:07:39] <ema>	 all in all the recovery time with manual restart wasn't much different than just waiting for it to get past the drama heh
[15:07:42] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&from=1505825956657&to=1505833623169
[15:07:58] <bblack>	 there's a small dip in our global hitrate too, but I think that's an acceptable/unavoidable consequence
[15:08:41] <ema>	 yup
[15:10:52] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3617763 (10RobH) @bblack: So to confirm, if we disable the memory share, it won't pxe boot.  If we enable it, it will pxe boot?
[15:15:08] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3617766 (10BBlack) I don't know, I hadn't tried re-enabling the memory sharing stuff.  All I really know is the sequence of events last week was approximately:  1. It was PXE booting...
[15:27:00] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3617810 (10Papaul) a:05Papaul>03Jgreen @Jgreen  Complete
[15:30:04] <ema>	 general resource usage went down considerably!
[15:30:48] <ema>	 while be hitrate reached frontend levels (~0.8)
[15:32:24] <ema>	 interestingly on a text node (cp1054), cpu usage/system load haven't changed much, while disk usage has decreased
[15:32:57] <ema>	 probably because of the different object size?
[15:33:45] <ema>	 as in the amount of work to do in case of irregular access patterns on text is less than what's required on upload (evictions, serving misses, ...)?
[15:34:40] <bblack>	 right
[15:35:24] <bblack>	 well there is a small cpu drop on cp1054 that can be seen, mostly in iowait, which makes sense with the disk usage change
[15:35:55] <bblack>	 and rates for things like nuked objects, expiry lockops, lru lockops, all dropped on text as well
[15:37:20] <bblack>	 allocator requests are down to ~1/4 of the previous value
[15:37:50] <bblack>	 basically, in both clusters' cases, the effect is mostly as expected, but even more-pronounced than expected
[15:38:28] <bblack>	 because now eqiad (and to a lesser degree codfw) backend storages are only storing the access patterns of their local frontend misses, not remote DCs (which in eqiad's case is all of them)
[15:39:24] <bblack>	 I think this evidence points to a more-ideal way to deal with this in the future when our backends are TLS-capable (the ATS future, basically)
[15:39:54] <bblack>	 which is probably that we have the local backends contact the applayer directly by default, making each cache DC a bit more independent of the others instead of bottlenecking things down together like we do today
[15:40:26] <bblack>	 and in place of the "backend_warming" config we have now, we instead have the option to (rarely, when warranted) re-route remote backends through core backends as part of warming in either direction.
[15:40:41] <bblack>	 (or perhaps even automatically when either end of those connections has low uptime or something?)
[15:40:57] <bblack>	 (or low levels of storage fill might be a more-reliable indicator)
[15:41:06] <ema>	 nice, yes
[15:44:23] <bblack>	 pontificating on other future idealistic ideas:
[15:44:41] <bblack>	 if this and/or V5's improvements in these areas gets rid of our need for sharing upload's storage into size classes
[15:44:46] <bblack>	 s/sharing/sharding/
[15:45:11] <bblack>	 and we take full advantage of V5's VCL-switching on request attributes...
[15:45:55] <bblack>	 we could even combine text+upload into a single set, at least at the backend level, where we more-dynamically reallocate resources between the two datasets
[15:46:29] <bblack>	 e.g. with a 12-node edge site, the ideal storage split might be more like 4 nodes for text and 8 for upload, instead of 6/6.  We do the 6/6 split for redundancy reasons in the face of depools and extended node outages, etc
[15:46:46] <bblack>	 but if we have the ability to easily runtime-shift a backend from upload to text when a text node dies, it changes things a bit
[15:48:19] <bblack>	 (or eventually putting them together even at the frontend in some way, although manaing FE storage split is tricky in that case.  but it would be nice for HTTP/2 clients to only make a single connection, perhaps)
[15:48:59] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3617887 (10Cmjohnson) @bblack @robh I went through and re-verified all the settings, this generation does not give an option of UEFI or Legacy like the new generations. The bios is v...
[15:52:13] <bblack>	 anyways, before this our two existing major hacks for the mailbox/503 -related issues that are still in play are the upload storage splitting and the weekly backend restarts
[15:52:47] <bblack>	 it might be worth testing (later, after we confirm stability of current stuff) whether we can drop one or both mitigations now
[15:53:41] <bblack>	 or reduce them, anyways (monthly restarts? looser storage splitting into fewer/larger bins?  even before the latest storage-split work, we always had a 2-way split between a normal and a big-object file)
[15:53:48] <ema>	 which would seem reasonable, given that DCs with more regular patterns such as esams/ulsfo never suffered from the issue I think
[15:54:45] <bblack>	 at least a monthly auto-restart schedule is probably reasonable under any scenario, as there's so much potential to cleanse cruft in various data structures and whatnot
[15:56:16] <wikibugs>	 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3617926 (10Nuria) @Jan_Dittrich  : bucketing is available as part of wikimedia events, see an example of usage as part of serach code: https://github.com/wikimedia/mediawiki-extensions-WikimediaEvent...
[16:08:40] <ema>	 alright, pybal 1.14.0 running fine on pybal-test2001 now
[16:08:45] <ema>	 uploading it to copper
[16:08:59] <ema>	 s/copper/apt.w.o/ :)
[16:15:35] <ema>	 prometheus metrics on -test2001 look good! see http://localhost:9090/metrics
[16:33:38] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618086 (10RobH) It looks like these are different versions though:       | <03:00:00> BCM57810 - EC:B1:D7:7B:C6:D8 MBA:v7.10.71 CCM:v7.10.71 |            | <03:00:01> BCM57810 - EC:...
[16:38:11] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618104 (10BBlack) I think @Cmjohnson said before that they're at different revs because they're different pieces of hardware (onboard vs card), and those are the latest revs for eac...
[16:47:43] <ema>	 calling it a (great) day! see you tomorrow :)
[16:48:41] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618162 (10RobH) Well, my thought about the multi function mode being set incorrectly doesnt work.  I make it match all the other ports and it still doesn't pxe boot on eth0.  Settin...
[17:10:58] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618266 (10RobH) So, PXE isn't working now for eth0, mac address ec:b1:d7:7b:c6:d8.  This is the eth0 that is also detected in the OS, so it doesn't appear to be an issue where the B...
[18:12:20] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618576 (10RobH) Ok, so there is something wrong with lvs1007 network firmware/settings.  lvs1007 boot order is missing the network device option.  When selecting it in the one time...
[18:58:05] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618870 (10RobH) Per @bblack's request, I've done a show config script on both lvs1007 and lvs1008 for comparison.  P6026 shows both.  The only difference is the virtualization is fl...
[19:05:04] <wikibugs>	 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, 10Operations: Make maps active / active - https://phabricator.wikimedia.org/T162362#3618894 (10Gehel) a:03Gehel
[19:21:05] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618988 (10RobH) IRC update:  @cmjohnson went ahead and reset bios settings to defaults, and after power cycling the server, it hasn't resolved the network device not showing in the...
[20:09:49] <ema>	 I was thinking, hitrate is `hit / (hit+miss)`, and we're now turning (a lot of?) misses into pass
[20:10:10] <ema>	 no surprise that the hitrate went up then
[20:10:25] <bblack>	 yes :)
[20:10:53] <bblack>	 also, the small drop observed in global hitrates as seen on https://grafana.wikimedia.org/dashboard/db/varnish-caching is misleading
[20:11:44] <bblack>	 it turns out most of that in the global average was driven by just ulsfo's reaction (were there's a more-notable drop), and that in turn is because of the interaction of the new hit-or-pass with the fact that codfw (ulsfo's next backend) had been depooled of frontend traffic since almost 24h ago
[20:11:58] <bblack>	 so, repooling codfw should fix up some of that
[20:12:48] <ema>	 ah!
[20:13:23] <bblack>	 and then also, I'm now circling back around to depooling the legacy ulsfo caches that got replaced by cp4021-28, so that ulsfo-ops can move on with removing nodes and adding the rest of the new ones
[20:13:58] <bblack>	 which incidentally will improve the FE hitrate there anyways, since the new caches have much larger FE mem than the old, and FE cache effectiveness scales with avg(memory) across all the live FEs
[20:14:40] <bblack>	 (and a similar argument applies to the backends.  the chashing is equally-weighted, but the new nodes have twice the BE storage of the old, so removing the legacy nodes from the pool is going to make things a bit better overall)
[20:15:42] <ema>	 there's a failed fetches spike in ulsfo right now
[20:15:52] <bblack>	 if you limit the view on varnish-caching to just eqiad+esams, the results are saner and closer to expectation
[20:15:53] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=text
[20:16:05] <bblack>	 hmmm
[20:16:07] <bblack>	 yeah 4027?
[20:16:20] <ema>	 yes
[20:16:36] <bblack>	 maybe we finally hit a transit spike that mattered to the numa config, looking
[20:17:21] <bblack>	 is that failed-fetches graph fe, be, or both?
[20:17:32] <ema>	 be
[20:17:43] <ema>	 (I think, checking)
[20:17:55] <ema>	 yes, be only
[20:19:01] <bblack>	 yeah be threads skyrocketed
[20:19:08] <bblack>	 be fetch concurrency skyrocketed, etc
[20:19:15] <bblack>	 I wonder what the driver is
[20:19:24] <ema>	 did you just reroute ulsfo to codfw by chance?
[20:19:48] <ema>	 or was it only depooled from user traffic?
[20:19:53] <bblack>	 no, the routing never left codfw yesterday
[20:19:56] <ema>	 ok
[20:20:18] <bblack>	 but the fe user traffic was depooled, and once the hit-or-pass code hit, that means no new cache entries there
[20:20:26] <bblack>	 (and none getting pulled through by ulsfo be anymore)
[20:21:01] <bblack>	 4028 seems far less effected, they should be pretty much the same here
[20:21:39] <bblack>	 err, affected
[20:21:53] <ema>	 --  FetchError     no backend connection
[20:22:50] <ema>	 the 503s continue on 4027, perhaps we should depool? 
[20:23:18] <bblack>	 well it was the depooling of two other text nodes that likely triggered (and/or the repool of codfw users?)
[20:23:35] <bblack>	 I worry another depool might move the problem elsewhere and/or exacerbate, when this might be temporary
[20:23:45] <ema>	 right
[20:24:54] <bblack>	 are we getting messed by max_connections = 1000; on ulsfo->codfw?
[20:25:00] <bblack>	 (per cache)
[20:25:29] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4027&var-datasource=ulsfo%20prometheus%2Fops&panelId=16&fullscreen
[20:25:49] <bblack>	 it is hitting max_connections I think.  that ~8.2K value is close enough.
[20:26:05] <ema>	 something seems to have happened at 20:00, see eg CPU usage
[20:26:05] <bblack>	 but it's also insane, I don't think the lack of a higher max_connections value is the actual cause
[20:26:29] <ema>	 (also backend connections ramped up around that time)
[20:26:36] <ema>	 and mbox lag
[20:26:43] <bblack>	 :50 + :58
[20:26:52] <bblack>	 are the two bumps in cpu when zooming in more
[20:27:23] <bblack>	 :50 being shortly after the first text depool there, but :58 not really correlated
[20:27:46] <bblack>	 perhaps, we're still seeing some lingering of the keep-related problems here in ulsfo as well?
[20:28:01] <bblack>	 (interaction with the new pressure of all the depooling)
[20:28:21] <ema>	 the codfw repool happend give or take at 20:00 right?
[20:29:15] <ema>	 not that I have an explanation of why it would matter, heh
[20:29:17] <bblack>	 yes, maybe more like 20:01 to really see it via authdns
[20:29:26] <bblack>	 but dns repool is gradual over ~10m anyways
[20:29:45] <wikibugs>	 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3619277 (10Tbayer) >>! In T135762#3617324, @Jan_Dittrich wrote: >> You can use eventlogging and wikimediaevents code at this time , there are quite  >> a bit of examples of how to run ab tests on dis...
[20:29:46] <bblack>	 it does add traffic onto the codfw backends that the ulsfo backends are moving requests through
[20:29:48] <ema>	 503s have stopped on 4027
[20:30:56] <ema>	 right, it does add traffic to codfw but why would a single ulsfo host be affected?
[20:31:04] <bblack>	 it's possible this is just a bad response to cache misses on some level
[20:31:40] <bblack>	 that what should've just been a spike in ulsfo->codfw connections/fetches turned into escalating mayhem because of some limitation on sockets or throughput or max_conns or whatever
[20:32:13] <bblack>	 single ulsfo host part, I donno.  but 4028 (in a similar situation in the same cluster) shows some of the same effects as 4027 in stats, but was milder and didn't induce a 503 spike
[20:32:23] <ema>	 interesting
[20:32:48] <bblack>	 ditto 4009
[20:32:57] <bblack>	 (legacy host, but still same text cluster)
[20:33:17] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4009&var-datasource=ulsfo%20prometheus%2Fops&panelId=16&fullscreen&from=now-6h&to=now
[20:33:25] <bblack>	 ^ 4009 backend connections spike, but not to the point of failure
[20:33:52] <bblack>	 whereas the connection count there on 4027 would be failure-inducing just from the max_connections values on the backend defs to cp2xxxx (at 1k/host)
[20:34:20] <bblack>	 oh wait I jumped the gun there, that's a frontend spike, but not a backend one like 4027
[20:34:48] <bblack>	 4028 is like 4009
[20:34:55] <bblack>	 fe spike, but no be spike
[20:43:10] <bblack>	 anyways, no more text depools are due, just uploads, but I'll wait a while longer than I was planning before those
[20:46:37] <ema>	 ok!
[20:48:27] <wikibugs>	 10Traffic, 10Operations, 10Phabricator, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3619326 (10Dispenser) The account was registered on August 15 and blocked on August 23.  According to [[https://meta.wikimedia.org/wiki/CheckUser_policy|CheckUser policy]] this...
[20:56:00] <wikibugs>	 10Traffic, 10Operations, 10Phabricator, 10Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#3619353 (10mmodell) >>! In T112765#2509512, @BBlack wrote: > There's a little bit of refactoring work (already in-progress) to do on the Varnis...
[21:02:40] <ema>	 bblack: cp4009's turn
[21:05:37] <ema>	 bblack: I'd restart varnish-be to reduce user impact
[21:08:08] <ema>	 done
[21:16:56] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3619453 (10Tgr)
[21:18:01] <bblack>	 ok
[21:18:26] <bblack>	 well, hopefully this is some fallout of the lingering keep-problems and/or the depoolings
[21:18:39] <bblack>	 in which case, whatever, we'll deal and it will go away
[21:19:19] <bblack>	 at least cp4009 involvement rules out the numa stuff
[21:20:16] <ema>	 right
[21:21:25] <wikibugs>	 10Traffic, 10Operations: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3619481 (10BBlack) Going to repool this today on the assumption it was genuinely part of T175803
[21:21:33] <ema>	 bblack: cp4027 again, it seems
[21:22:28] <bblack>	 hmmmm
[21:22:53] <bblack>	 I can repool the two other legacy nodes that are currently out?
[21:23:05] <bblack>	 it's possible this is unrelated, but that seems like dubious thinking
[21:23:11] <ema>	 let's try, yes
[21:24:04] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3619506 (10Tgr)
[21:26:13] <ema>	 it's recovering already
[21:30:33] <bblack>	 it's odd
[21:30:40] <bblack>	 I wonder if there's something wrong with depooling itself
[21:31:03] <bblack>	 (the mechanisms affecting confd/pybal, and/or affecting the go-templated VCL, etc)
[21:31:33] <bblack>	 because we've always had 6 legacy nodes there doing fine in ulsfo/text.  We added 2x new beefy ones some time ago, and now trying to remove 2x of the legacy ones in exchange
[21:31:37] <bblack>	 it should be pretty painless :P
[21:32:10] <bblack>	 something is odd anyways, maybe something that will make more sense after sleeping on it, I donno
[21:34:58] <bblack>	 the large volumes of backend conns, close_wait states, capping out near max_conns, etc
[21:35:12] <bblack>	 it's like it's connecting to bad backend definitions or something crazy like that
[21:35:50] <bblack>	 or something else is being exhausted and thus delaying/failing conns/reqs, causing a cascading effect of more conns
[21:36:08] <bblack>	 something else being something dumb like file descriptor limits or ephemeral port limits or whatever
[21:58:28] <wikibugs>	 10netops, 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3619622 (10ayounsi) 05Open>03Resolved > The fix has been committed on version 15.1X49-D110 which currently is expected to be released on 13th of Sept 2017. Firew...
[22:18:50] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3619784 (10Anomie)
[22:24:38] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3619820 (10Anomie)
[22:34:26] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to svwiki - https://phabricator.wikimedia.org/T176082#3619897 (10Johan) (Removing tag I assume was here simply because it was in the parent task and automatically included; this shouldn't be in Tech...