[11:38:42] funny bug of the day: https://gerrit.wikimedia.org/r/#/c/299518/ [11:43:32] hehe [12:09:02] so we're still running with file storage instead of persistent on misc [12:09:41] I'd say we should either revert the change (CR opened https://gerrit.wikimedia.org/r/#/c/299526/) or perhaps consider making it permanent for other v4 clusters as well [12:09:49] bblack: ^ [12:14:08] yeah.... [12:15:30] well if we go the file route, we could also drop all our older patches to v4 I think. They're all persistent improvements. [12:16:09] except for that nuke fail logging thing, which I think was also persistent-related (in the sense that persistent had known issues at some point in the past with nuking) [12:18:14] I still think I'd rather push towards non-persistent, but a couple of things need to happen first: [12:18:48] 1) Do at least a slightly more-formal writeup of the tradeoffs and expected fallout in corner cases [12:19:25] 2) Validate how quickly we can roll through wiping a DC's persistent storage, by doing it a few times and progressively bringing the timing down until it causes pain. [12:20:01] If we can do it comfortably over the course of, say, half a day, that probably seals the deal. [12:20:20] lower would be better, but we could live with half a day [12:21:16] maybe solve the current Age-rollover mystery before experimenting, too [12:23:20] yeah I agree that the age rollover thing should be solved first [12:23:56] thing is, I almost forgot about the s/persistent/file/ switch on misc, and started wondering why varnish was failing to start on my test instance on labs [12:24:06] heh [12:24:18] and the reason was that I've replaced GB with MB in the file storage size for the generic case [12:24:28] while we overrode that with G in the misc case [12:24:38] that's why I was running out of disk space :) [12:27:30] in other news: the whole cache_upload vcl v4 is merged and everything looks good [12:30:28] awesome :) [12:55:07] are you dudes going to merge https://gerrit.wikimedia.org/r/#/c/299532/ today? no objection on my part just curiuous [12:55:12] and TIL you can link to bugs in one commit [13:05:54] chasemp: the plan is tomorrow [13:06:14] kk [15:30:07] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2471213 (10Joe) So, by combining elukey's patch and the original patch for not... [15:50:36] hi folks [15:50:54] I'm looking into upgrading cr1/cr2-eqiad in the next weeks [15:51:23] while we do that, it's probably safer/better to move as much traffic as possible out of eqiad [15:52:15] as we'll be running with a reduced amount of transits and reduced redundancy [15:53:30] so setting eqiad frontends to down in gdnsd is what I'm thinking for now -- would that overlap with any other work you're currently doing? [15:55:24] probably, but we always have overlapping things. It's doable :) [15:56:08] The thing that worries me most is the "weeks" part - for that whole window we're in an increased-risk window because it will be harder to tolerate e.g. an esams outage. [15:56:17] anything we can do to reduce the window would be awesome. [16:04:13] it won't last for weeks [16:04:50] it will start (and end) somewhere in the next couple of weeks or so, hopefully :) [16:05:19] we'll need to upgrade and reboot cr2, then upgrade and reboot cr1, then install new cards (and, unfortunately, reboot) cr2, then install new cards and reboot cr1 [16:05:41] each of these is going to take, say, 1-2hrs [16:05:53] so 4x2hrs total? something like that [16:06:08] 10Traffic, 06Operations, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2471349 (10BBlack) 05Open>03Resolved a:03BBlack [16:06:54] the cr1 ones are blocked on the level3 wave provisioning, because cr2 currently holds the only link to knams/esams :) [16:07:03] (and I don't want to complicate this further by having to drain esams in the middle of all this) [16:10:37] ok sounds good :) [16:10:46] I'll prepare a plan and warn in advance [16:11:08] this is the pre-advance warning :) [16:59:06] 10Traffic, 06Operations, 06Release-Engineering-Team, 05Security: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2471564 (10demon) [23:23:56] 10Traffic, 10domains, 06Operations, 06Project-Admins: create a project for tasks related to WMF domain names - https://phabricator.wikimedia.org/T87465#2473520 (10Danny_B)