[00:00:00] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Krinkle) [00:00:03] 10Traffic, 10Operations, 10SRE-swift-storage, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10Krinkle) [00:00:22] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Krinkle) [00:00:25] 10Traffic, 10Operations, 10SRE-swift-storage, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10Krinkle) [06:27:55] Krinkle: we've seen some small delays introduced by DNS queries performed by ats-tls [06:29:01] DNS has been reenabled on ats-tls as a workaround for a bug that's been already solved [06:29:10] so we will disable it again [14:07:00] vgutierrez: what kind of queries was it performing though? Surely not the address to localhost varnish-fe? [14:07:15] the PTR for localhost :( [14:07:28] localhost being the 10.x main IP address of cpX [14:07:41] that's why it can be safely disabled [14:57:25] in general it's the right thing to do. It's a cache and it's expecting upstreams to be various origins with real hostnames [14:57:36] and you have to follow dns changes for failover events, etc [14:57:58] it's just that in this particular case, it's all traffic to the same local machine via IP address. [14:58:32] so there's a flag to disable dns lookups for cases like this. We had it turned on for this case before, but it caused the triggering of a bug in the loadbalancing over the ports. [14:59:24] now we have fixes for that, so we'll get back to the no-dns-lookups state eventually here [20:06:26] Heya traffic folks, any blockers for me resuming my cp firmware updates? [20:06:44] everything went without incidient afaict yesterday [20:06:59] https://phabricator.wikimedia.org/T243167 [20:09:54] i think everyone is done, vgutierrez was the one last working https://phabricator.wikimedia.org/T242093 [20:10:08] so if no one objects, i'll resume bios updates via the first link in about 30 minutes. [20:10:38] go ahead [20:10:54] I'll continue the reimages tomorrow EU morning [20:10:58] thx :) [20:11:36] sounds good :) [20:11:42] cool thx =] [20:12:13] so i only did eqiad ones yesterday, and ill finish those up today and start on esams, since they went without issue [20:12:23] now im intrigued if they crash. [20:12:54] last one logged to the crash task was on the 9th, i started updates on 10th [20:13:16] they'll all be done by tomorrow or thursday (tomorrow i go onsite in ulsfo so i may not get bios updates done) [20:13:52] rephrase: esams/eqiad all done by thursday. i suspect we will want to expand this to all r440 cp systems at all sites shortly. [20:16:01] esams and eqiad are the only ones that have them [20:16:09] codfw, ulsfo, eqsin are all previous-config R430s [20:16:37] codfw I think is (over-?)due for a refresh [20:16:53] I think we have a recent ticket open for that, which will put it on esams-like hardware for the first time (R440) [20:17:41] ahh, ok [20:18:26] yeah T242044 for codfw refresh [20:18:50] stashbot: hi [20:18:50] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [20:56:31] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqiad, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [21:07:28] bblack: FYI for all the pages i just repooled two cp systems and depooled two [21:07:45] cp108[45] returned to service, depooling cp108[67]for firmware update via T243167 [21:07:52] I have no idea if it is related. [21:08:06] well stop for now please [21:08:20] don't depool new things if we have pages till we understand [21:08:22] indeed, stopped. also note cloud is experiencing issues os i doublt its me [21:08:31] yep, already was my plan no worries [21:08:49] i did that before hte pages. [21:09:09] how many have been done already? [21:09:34] today? just 2 returned to service [21:09:37] ok [21:09:38] and the two going now [21:09:41] from my paste above [21:09:49] but it was right before the issues started [21:09:58] the return of the 2 to service [21:10:07] and taking down the other 2 once the first were green in icinga checks [21:10:31] have you start anything other than a depool on the 67 ones? [21:10:52] the 67 are already mid firmware update [21:11:04] but i can try to stop them rebooting into the os or stop puppet run if you like [21:11:21] i just cannot interrupt them mid firmware or bad things happen. [21:11:46] they were mid firmware update before icinga started alerting =[ [21:12:04] it takes like 6-10 minutes for it to flash. [21:12:30] bblack: Ok, they are booting back up now. I'm not touching anything, but unless you ask me to, they will reboot into the OS and run puppet. [21:12:43] that's ok [21:12:59] ok, im not touching anything else other than to admin log my returning them to service [21:13:10] ill also note that you are aware so folks dont immediately yell at me ;D [21:24:07] robh: how many did you do yesterday? [21:24:39] I guess you're working serially, so 1075-1083 was yesterday, and 108[45] was started today, and now 8[67] are about done? [21:24:40] cp1075-cp1083 [21:24:51] well, those are done and the OS is back up [21:24:58] right [21:24:58] and they show green in icinga [21:25:05] have you typed "pool" on any of these after completion? [21:25:38] bah, no i reread the directions and i see it [21:25:42] and i caused an outage, fuck. [21:25:50] i misparsed this as puppet returns to service [21:25:52] it's ok, at least this one's easy to understand [21:26:39] fml i had like a 5 year streak of no outages [21:26:43] it happens [21:26:43] =[ [21:27:48] So this is wholly my fault. I misparsed the directions as icinga checks being green meant it was ok, and that the repool command was part of the initial puppet run. It isn't and the directions don't state such, they are clear and I misread them. [21:27:59] so i have not been repooling things post firmware. [21:28:21] it's ok we can patch this up [21:28:28] it's actually refreshing to have such a simple outage :) [21:29:06] robh: did 108[67] both finish up now? [21:30:42] i didn't repool them [21:30:44] via repool [21:30:54] bblack: agreed! [21:31:19] it's a tooling issue, not a user one [21:31:44] ^^^ [21:31:46] this [21:32:13] one of the ways in which our infrastructure is far too unforgiving [21:32:26] right [21:32:35] In my too fast reading of the directions I must have brain fuzzed myself into thinking since it depools automatically it must repool automatically [21:32:42] but the directions were clear on it not being the case [21:32:53] yeah we don't repool automatically [21:33:11] because that can cause unpredictable messy issues too [21:33:15] and the next line stating icinga green i just took what i wanted out of those two sentences and made much easier and incorrect directions =P [21:33:37] what's missing here is probably two key things in the tooling: [21:33:43] i assume if i had run depool on these before powering off [21:33:45] it owuld have said 'nope' [21:33:46] ? [21:33:54] 'depool' should give at least a warning if too many nodes from that service are already depooled [21:33:58] well [21:34:13] the depooling is happening automatically on shutdown, so there's no real chance for the user to see what's happening there [21:34:15] or it should kill the shutdown command if depool cannot run [21:34:19] (it's triggered during the shutdown sequence) [21:34:45] otherwise the procedure is to manually run depool then we see the fail to depool seems like the immediate fix? [21:34:47] bblack: sure, but we've had very similar outages on other services where that would have helped [21:35:01] or does depool not give that warning when manually run? [21:35:03] an icinga check for "too many servers depooled" would also be reasonable, I think [21:35:08] pybal also has a notional minimum pooled percentage [21:35:12] yeah [21:35:17] but the logic on that is a known-mess [21:35:26] I'm not even 10)% sure it will recovery its own pool state from this correctly [21:35:29] *100% [21:40:05] the pybal pool-state issues aside (we do have related alerts, but there's a lot of ??? about the pybal pool-state code....) [21:40:37] a set of icinga alert that just polls etcd for any cluster going below a default (or per-cluster configurable) percentage of pooled hosts would've helped here [21:41:46] +100 [21:41:51] I have that as an AI already :) [21:50:14] Ok, so we are repooled and back to normal? If so, I'm going to go take a walk cuz I'm still quite frustrated with myself. =P [21:51:11] robh: don't beat yourself up, it's an easy mistake to make, and there's lots of room for our tooling to have caught this too, which it didn't [21:51:21] very very yes [21:51:30] robh: we are back to normal, everything is fine, take a walk, be kind to yourself [21:51:35] i appreciate you guys being cool about it =] [21:52:15] s/guys/folks [22:00:21] anyways, let's leave the final pair of nodes for tomorrow, as they're probably some of the ones with the best/freshest contents right now :) [22:01:02] hitrate recovery is easiest to see here: [22:01:03] https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&fullscreen&panelId=8&from=now-1h&to=now&var-cluster=All&var-site=eqiad&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [22:01:54] overall was ~90% before things melted, and it's back to ~83% now [22:02:12] so I'm assuming it will continue recovering hitrate fine at this point [22:02:26] if the miss rate was going to melt something, it would've been in the first 5-10 minutes after dns repool