[11:54:28] 10netops, 10Monitoring, 06Operations: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264#3231238 (10ayounsi) a:03faidon Re-assigning to @faidon as he has a check ready. [12:26:22] _joe_: I don't see any recent cron email from root@cp, do you have a failure example? [12:33:01] oh it's literally spam :) [13:08:56] _joe_: I see that the restarts happened regardless of confctl errors, I take it that confctl exits with 0 in such cases? [13:09:51] the restart scripts run with set -e, they should have stopped when trying to depool otherwise [13:13:33] also, the issue seems fixed now, I've tried de/repooling a maps backend by hand and it worked fine [13:30:34] <_joe_> ema: i fixed it yesterday [13:30:45] <_joe_> tonight, actually [14:08:33] 10Traffic, 06Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3231482 (10BBlack) Yeah we should discuss our options a bit here re: minimizing ulsfo downtime, I think we have a few options for how we arrange this. There some complicating factors with the misc... [14:11:44] ema: re TCP BBR eval now that we're past 4.9, we should get some solid post-switchback stats for comparison first. All things considered I think that puts us turning it on Monday May 22. [14:12:20] (and sure we can do a little pre-evaluation one minor cluster node somewhere to verify it doesn't wreak havoc, but the bottom line is the only real test is going to be navtiming impacts on the real world) [14:12:51] yeah [14:13:14] we still have to upgrade the load balancers to 4.9 BTW [14:13:26] I don't think they impact the BBR thing, but yes in general [14:13:30] dns too :) [14:14:00] I'd do stretch for some of them :) [14:14:07] for some LVSes? [14:14:14] or DNSes? [14:15:16] we're due for reinstall of lvs1007-12 once the cabling and switch config looks sane [14:15:34] (really we can start looking at that anytime after today's MW switch is done I think) [14:21:52] 10Traffic, 06Operations, 06Performance-Team: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3231583 (10BBlack) @Gilles - FYI the kernel upgrades that were blocking this are done, and we're tentatively looking at turning on BBR on May 22, so that we have a wee... [14:22:28] either, although LVSes means we'd have to build packages pybal for both jessie and stretch (unless you upgrade all of them) [14:22:48] we should probably start with recdns [14:22:56] but yeah, let's go with whatever it makes sense [14:23:00] as they're using jessie-backports to get a stretch-like powerdns recursor now anyways [14:23:14] and we're going for a stretch kernel [14:23:17] (4.9) [14:23:18] right [14:51:30] 10Traffic, 10ChangeProp, 10ORES, 06Operations, 10Scoring-platform-team-Backlog: Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#3231630 (10Ladsgroup) [14:54:06] 10Traffic, 10ChangeProp, 10ORES, 06Operations, 10Scoring-platform-team-Backlog: Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#3231644 (10Joe) Yes, my only doubt with this proposal is exactly that we want to be active/active but to being able to serve all the t... [15:11:02] 10Traffic, 10ChangeProp, 10ORES, 06Operations, 10Scoring-platform-team-Backlog: Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#3231630 (10BBlack) To re-iterate what @Joe is saying a little differently: the point of cross-dc active/active (which is a goal for al... [15:30:01] bblack: yo, im going to leave for ulsfo shortly but we can discuss before =] [15:30:16] my plan today was just to unbox and verify [15:30:21] and put in top of rack without wiring them [15:30:45] I didn't think today was ideal for depooling the caching site entirely, and it'll take a bit to actually depool and wipe [15:30:59] the only time limit right now is moving the servers out of storage [15:31:02] robh: basically what I was wondering about power (we can't turn on all old+new at once) - if we decom 8x of the old ones first (without downtime), leaving 12x old + 12x new in play, can all of that come up at once? [15:31:17] well, then i cannot wipe the disks of the ones we decom [15:31:28] since we're unplugging them immediately [15:31:29] yeah but we can do those first, without the new ones [15:31:35] what I mean is like: [15:31:38] ahh, so decom 8 old today [15:31:42] wipe them over 24h [15:31:44] 1) Step 1 - decom 8 and wipe them (no downtime) [15:31:46] and then the new ones can take their places [15:31:51] yeah [15:31:55] 2) Step 2 - can we power on 12x new while 12x old still running? [15:32:06] we have next to no available power [15:32:09] so no to 2. [15:32:11] ok [15:32:23] we cannot power on anything new without removing old things, there is room for like 1 or 2 systems maybe [15:32:31] but i plan to use the 1 system overhead to power on and test them one by one =] [15:32:43] ok yeah so for today just verify they look sane and set bios stuff I guess? [15:32:58] and i'll also try to get mgmt stuff ready to go on each one as well [15:33:02] at the end, if you can leave 1x of them alive (e.g. cp4021), can use it to get some software config validation done, too [15:33:15] i can aim for that yep [15:33:20] ok thanks [15:33:21] so you have a test system =] [15:33:39] cool [15:34:11] maybe we can split up by-cluster too after the initial 8x decom [15:34:19] (later) [15:34:29] so the sequence would be like: [15:34:40] 1) Decom+wipe 8x machines for good (no new ones come online yet, no downtime) [15:35:02] 2) Bring up 6x of the new machines, make the software transition for cache_text, then decom and wipe 6x old ones [15:35:16] 3) Bring up remaining 6x of the new machines, make the software transition for cache_upload, then decom and wipe 6x remaining old ones [15:35:35] that seems feasible to me with our power overhead. [15:35:57] ok [15:36:10] I'm going to label all of them with asset tags, but not bother with their cp sequence numbers until we go to move them into serivce [15:36:14] other than cp2021 [15:36:28] that way can keep sequencing sane for our use when we move them into serivce =] [15:36:31] I was gonna say on naming, it's probably easiest if they alternate racks [15:36:54] can do whatever you think will result in the most sane result set [15:36:54] rackA: 21, 23, 25, 27, 29, 31 rackB: 22, 24, 26, 28, 30, 32 [15:37:10] thats different than everywhere else but we can do if you want [15:37:15] any reason? [15:37:26] everywhere else tends to keep the sequence in rack whenever possible. [15:37:32] then when we split the clusters over the racks, it works out like text: 21-26, upload: 27-32 [15:37:41] ah [15:37:47] it doesn't really matter either way, it's just convenient [15:37:49] yeah can do [15:38:10] that'd be difficult for our larger deployments, but for 2 rack deployments should be feasible [15:38:24] wilco =] [15:38:41] yeah I still need to think about how this new scheme works in core dcs with 4x rows [15:39:01] 6x nodes doesn't map well to them, maybe we go 8 and give them more breathing room in a core DC anyways [15:39:58] (so it'd be 16 total nodes there instead of 12, 4 per row) [15:40:45] back on ulsfo numbering: I think I've brought up before wanting to do the exact opposite of course [15:40:55] (which was to encode the row-number into the NNNN) [15:41:03] (for clusters like these) [15:41:48] a scheme like that would leave ulsfo as: rackA: 21-26, rackB: 30-35 ? (the off-by-one sucks since cp4020 already exists) [15:42:06] and that system doesn't scale as we replace nodes either :) [15:42:32] you could also do 4100-5 + 4200-5 [15:42:51] but we might have a lot of built-in assumptions about the second digit being zero (in code or in our fingers as we type things) [15:47:29] ema: so re: maps+misc and cluster-merging and ulsfo [15:47:50] misc, we're not going to be ready for anytime soon I think. We can just decom ulsfo-misc, and change the GeoIP resolution for it to codfw's IP @ ulsfo [15:48:24] maps, I think I can whip up a pretty quick set of changes to migrate it to the upload cluster. neither's VCL is complicated much. [15:49:13] this also treads into the storage stuff [15:49:39] I had talked about re-running 60d stats to get a little better modern picture of the storage binning (and making it wipe itself on restart to make changes easier) [15:50:10] should include maps in the stats, which will raise bin0/1/2 vs 3/4 a bit probably [15:50:44] (in total object view anyways, probably minimal impact on request-rate stats since maps is so small on reqs) [16:02:14] bblack: nice! [16:03:07] re: BBR, instead, we're not currently monitoring any interesting TCP metric. I was now thinking of adding `ss -i` to prometheus (rtt, but also the other metrics might be useful) [16:03:53] prehaps the per-port average? [16:28:11] well we do have node_netstat_TcpExt and node_sockstat_TCP metrics in prometheus right now so it's not really fair to say we're not monitoring any TCP metric :) [16:28:48] but I mean we probably want to focus more on RTT for BBR evaluation [16:48:06] 10Traffic, 06Operations, 06Performance-Team: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3232221 (10ema) [16:50:35] (loss rate and end-to-end metrics too of course!) [16:50:38] it might be interesting [16:50:50] but really we already have the overall metrics that those will influence [16:50:57] which is navtiming [16:51:15] right [16:51:52] I was thinking of per-machine metrics so that we can compare two hosts in the same cluster (w/ and w/o BBR) [16:51:54] https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&panelId=10&fullscreen&orgId=1&var-metric=loadEventEnd [16:52:10] that sort of thing [16:53:08] mostly I tend to think we compare with/without over the time domain, but we could try a half-cluster first or something, too [16:53:20] but rolling it out slowly also blurs the metric distinction, so it's all a tradeoff [16:53:30] (the global metrics I mean) [16:54:36] I donno, we could probably come up with a rational justification for any course of action here [16:55:25] I tend to lean before "test just one live higher-traffic host to see if obvious breakage, then flip the sysctl switch for everything and watch navtiming before/after the event vs a week prior" [16:56:13] but I can see also "flip the switch for half the machines in all the clusters and do some kind of with/without comparison between sets of hosts" [16:56:50] but until we flip the switch for the rest of the hosts, navtiming is a blur of both (including a single client getting both types for upload+text separately) [16:59:31] while I'm clicking around navtiming: something interesting seems to have happened in Australia at 16:25 [16:59:37] https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-geolocation?var-metric=loadEventEnd&orgId=1&from=now-3h&to=now [17:01:55] bblack: another option is a mix of both: we could add avgrtt monitoring, flip the switch on one high traffic machine only, keep an eye on it for a couple of days and then flip the big switch :) [17:31:43] yeah [17:34:57] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3232427 (10Smalyshev) 05Open>03Resolved a:03Smalyshev I think everything is fine, we can close this? [18:00:46] !log restart cp2005 backend (lag) [18:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:14] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3232646 (10Dzahn) we did not make an RT user, but RT is still used for maint-announce mails which affect networking. we should probably create one (and/or switch... [19:13:18] 10Traffic, 06Operations, 10ops-ulsfo, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3232726 (10RobH) IRC update: We'll split the numbering of odd and even in the odd and even racks. so rack 1.23 has cp2021, 1.22 has cp2022 | rack | hostname | | 1.23 | cp20... [19:33:56] 10Traffic, 10DBA, 06Operations, 06Performance-Team: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3232826 (10Gilles) p:05Triage>03Low [19:34:17] 10Traffic, 10DBA, 06Operations, 06Performance-Team: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3224448 (10Gilles) a:03aaron [19:36:52] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3232843 (10Dzahn) 05Resolved>03Open [19:37:33] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10Dzahn) [19:37:59] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10Dzahn) p:05Triage>03Low [19:38:02] 10Traffic, 06Operations, 05Multiple-active-datacenters: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#3232863 (10Krinkle) [19:38:08] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10Dzahn) a:05BBlack>03None [19:59:00] 10Traffic, 06Operations, 07Availability (Multiple-active-datacenters): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#3233024 (10Krinkle) [20:11:03] OMG jenkins and the stupid arrows [20:25:16] bblack: I've prepped https://gerrit.wikimedia.org/r/#/c/351707/ for BBR, tomorrow I was planning to: 1) stop the expiry thread prio experiment on cp2024 2) finish the 4.1.6 upgrades (text and upload) 3) merge the last part of the ttl cap/keep swap https://gerrit.wikimedia.org/r/#/c/343845/ [20:25:55] let me know if you disagree with any of the above :) [20:31:13] ema: here you go https://gerrit.wikimedia.org/r/#/c/351225/ [20:59:13] ema: sounds ok to me [21:04:20] ema: also I have a patch pending to put maps through upload at https://gerrit.wikimedia.org/r/#/c/351663/ , and I'm already re-running oxygen stats on fresher data from maps+upload [21:05:04] and this if it seems ok to you: https://gerrit.wikimedia.org/r/#/c/351684/ [21:37:56] 10Traffic, 06Operations, 10ops-ulsfo, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233451 (10RobH) cp4021-cp4032 have been racked, but ONLY cp4021 is accessible to mgmt and network. There isnt enough power overhead in the racks to wire up all the new systems... [21:38:20] bblack: ^ cp4021 is all set for testing [21:38:26] the rest all check out with the right hardware [21:38:40] but i havent wired them up permanently, cp4021 isnt permanent either but good enogh for now [21:38:43] typos =P [21:38:49] permanent, enough [21:38:58] bleh [22:41:08] robh: thanks [22:55:07] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3233723 (10BBlack) I've re-run the binning analysis, with a few minor changes: 1. Fresher data (61 days ending 2017-05-02) 2. Included maps.wikimedia.org data a... [23:15:54] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3233742 (10BBlack) Reformatting this a bit for comparison, and using the "new" binning (which splits 0-1K from 1K-16K): **Storage Size Percentages** (how much... [23:50:05] robh: I'm looking at switch config now... seems like you deleted a description for cp4012 to add cp4022? [23:51:19] robh: (does that mean 4012 is hooked up to some other unlabeled port?) [23:56:11] anyways, digging around with bios and interfaces for pxe, etc