[04:16:17] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [04:22:49] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [05:36:25] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [05:48:34] 10Traffic, 10Operations, 10ops-eqsin: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10BBlack) 05Open→03Resolved Seems to be working fine after replacement! [06:26:05] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [06:50:56] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [07:24:50] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [08:31:29] bblack, vgutierrez, cp5007 and 5006 status led is blinking orange, (instead of steady blue), any idea what that mean? [08:33:04] hum, icinga is happy and cables are set, so I guess I can still go [08:57:51] XioNoX: IIRC those usually means that something's is wrong for the hardware and you should check System Event Log [08:58:09] https://www.dell.com/support/manuals/us/en/04/poweredge-r430/r430ownersmanual/diagnostic-indicators-on-the-front-panel?guid=guid-7c7a87cf-08e8-43d0-8489-a73e2cd220dd&lang=en-us [09:03:50] I'm leaving the DC but can go back tomorrow if needed [09:04:56] XioNoX: check what dcops suggest ;) [09:05:02] that's my 2cent [09:36:29] volans: yeah but DCops are all sleeping :) [09:37:26] eheh, in the meanwhile have a look at hardware logs and see if anything stands our [09:37:29] *out [09:40:03] 10Traffic, 10Operations, 10ops-eqsin: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10ayounsi) return shipment ticket 1-185737841426 opened with Equinix, DHL should pick up the defective disk in the next few days. [09:47:16] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) Also audited all the spares we have onsite: https://docs.google.com/spreadsheets/d/1FKYVQJePjTQ7nVwYv4oDC6Gszk7RLrkq5ySN0fjvSoY/edit#gid=2057953856 Labelled the f... [09:58:24] Opened https://phabricator.wikimedia.org/T216691 [09:58:28] 10Traffic, 10Operations, 10ops-eqsin: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10ayounsi) p:05Triage→03High [10:00:59] bblack: I'm keeping eqsin depooled for the LVS work (and to have a 2nd pair of eyes to look at if everything looks good), but to me it can be repooled. I didn't do the failover tests, as LVS are not configured for cr2, but they can be done remotely. [10:25:17] FYI, effect of the depooling on perf https://grafana.wikimedia.org/d/000000143/navigation-timing?refresh=5m&orgId=1&var-source=navtiming2&var-metric=responseStart&var-percentile=p50 [12:24:09] ok, I'll repool tomorrow morning Singapore time if I don't hear anything or nobody does it before me [13:40:56] XioNoX: yeah, I can take a look this morning here and repool for now [13:46:21] XioNoX: for the lvses, they're already connected to their top-of-rack switch I think? So it really is just software changes for which router they speak to (we can shift lvs5003 then lvs5002 to cr2 I guess) [14:08:08] re: cp5006/7, they both have EDAC correctable events in their dmesg output from back in Dec/Jan, so they probably have matching SEL entries, etc [14:08:27] not critical at the moment [14:11:17] bblack: yeah LVS should only be software changes, I have a CR ready on the task [14:11:41] I'm making tasks about cp5006/7, then will try a repool [14:11:51] oh you have a task above [14:12:41] eh I'll make separate ones per machine in case things go differently for each and put in data details, etc, and link them to that [14:21:08] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10BBlack) [14:21:18] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10BBlack) [14:22:05] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10BBlack) [14:22:07] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10BBlack) [14:22:10] 10Traffic, 10Operations, 10ops-eqsin: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10BBlack) [14:24:08] XioNoX: we seem to have lost purges (multicast?) in eqsin, some time after the depool, and never came back [14:25:28] 24h view here, you can see the depool traffic drop for regular traffic, but purge still chugging along in the next one down until ~05:53 : [14:25:33] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [14:28:18] XioNoX: so probably stall repooling on that. for that matter if we've missed several hours of purges now we should probably wipe caches once its fixed. [14:31:32] looks like the main transport link didn't move. the tunnel did though, but I don't think purges were going over that anyways. [14:38:27] restarting vhtpcd on one of the text nodes didn't change anything (just in case it had something to do with some bug about subscriptions, that would maybe force a fresh subscribe) [15:17:54] bblack: so that graph should not be flat even if the site is depooled? If so I'll investigate it tomorrow [15:33:27] bblack: I think I fixed multicast, cr2-eqsin's PIM was not configured to use the (new) cr1-cr2 link [15:37:39] I'm going to sleep, don't hesitate to ping/call me or leave scrollback to read [15:56:22] XioNoX: ok thanks! [15:56:44] XioNoX: and confirmed, multicast is showing back up again :) [15:57:32] so I'll probably wipe the caches since we missed a large window of purge anyways, and then see how it goes with bringing the site back in a little later [15:57:53] we're about halfway down the daily downslope in traffic rate, so I might wait a few more hours so the missrate isn't so shocky [16:07:39] have we ever experimented with performing a cache warmup, btw? [16:08:58] not an artificial one, no [16:09:10] in theory, we can record varnish and replay to warm, but nobody's ever worked on it [16:09:26] (there's probably lots of subtle issues there, though) [16:10:11] in the distant past when warmup from a dead cold state was much more problematic, we've temporarily tweaked the geoip map by hand to only bring in a few countries to a repooled site initially to warm it up without full traffic load [16:10:40] subtle issues in traffic engineering? :) [16:10:58] but that's hacky and very manual and ugly [16:11:07] yeah ofc [16:11:30] one of the things on the radar for gdnsd (while revamping the geoip and other plugins in other ways too), is to give the geoip stuff some weighting knobs, so it can be automated [16:12:01] e.g. bring a site back from the depooled state, but at initial X% weight and gradually ramp it in, which basically grows a geographic bubble of clients mapped to it, starting with the closest ones [16:12:22] sure, makes a lot of senes [16:13:13] currently it's not generally a huge problem though. we come back from truly dead-cold rarely, and can usually afford to wait for a low-ish point in the region's daily traffic cycle [16:13:52] yeah, depooling an entire site seems like a pretty rare event. (although if/when we add a bunch more POPs we might have to get better at it) [16:14:42] and then we have a few mitigating factors that help: varnish is pretty good about coalescing during the initial ramp-in so that we only cold fetch unique objects once, and our transport links are decent, and we're also fetching from core DCs' backend caches, so anything popular enough in eqsin that it was also being hit in codfw or eqiad, will pick up a hit there and only pummel the wan link, but [16:14:48] not the app services. [16:15:27] (but they do have different patterns of popular languages/projects/URIs, so that only covers some of the most globally-hot stuff) [16:20:50] yeah, that all makes sense. and it does seem better and maybe even easier to get the gdnsd doing weighting than it does to figure out which classes of traffic it makes sense to play back [16:20:52] thanks for continually entertaining my silly questions :) [16:21:13] I got lost quantifying "rare" [16:21:40] in the past 12 months, we've depooled a DC 17 times. So not as rare as we'd like :) [16:22:10] that's more than I expected [16:22:21] there was both a dc failover to test, and ulsfo physically moved sites as well [16:22:51] then combine in random outage events and downtimes we do for some kind of hardware work or a planned outage problem from major transits, etc [16:23:24] but, in almost all cases we didn't come back from cold either [16:23:55] usually a depool is relatively-short, and we never actually lose connectivity to the core network or purges, and there's still a fair chunk of warm content when we bring it back. [16:27:28] I don't think I have anything good to track history on for cold bringups, but I suspect they're more like ~1-2/year at worst [16:28:34] purge loss is a separate issue on its own, and I think once we eventually have reliable purging (e.g. kafka queues), we'll have a reasonable time window over which they can be replayed. [16:29:19] that and once we've nailed down other purge/ttl -related issues, we'll probably push on the applayer side to reduce purge volume substantially as well. [16:30:59] 10Traffic, 10Operations: Content purges are unreliable - https://phabricator.wikimedia.org/T133821 (10Bawolff) [16:52:36] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) Please note that Dell support typically requires the following steps to be taken for any memory replacement: * Update bios firmware on host to latest revision ** current version is 2.9.1,... [16:52:41] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) Please note that Dell support typically requires the following steps to be taken for any memory replacement: * Update bios firmware on host to latest revision ** current version is 2.9.1,... [16:57:17] does anyone know if there is a good way to notify spikes of network usage? [16:59:55] do you mean for a given host? or...? [17:00:14] I am creating those spikes [17:00:35] Is there someone that would be worried I should contact in advance? [17:00:48] do you mean on wan transport links, or local in a DC? [17:01:16] local to a dc [17:01:22] for transport links if we really think we might saturate them, probably nice to drop a note in here or -netops and maybe have someone keep an eye on the links in librenms [17:01:49] local to a DC, I think it would be hard to impact more than the specific hosts you're working on by saturating your own switch ports or whatever. [17:01:50] no, normally we don't touch wan, unless there is an emergency which you would know by other ways :-) [17:02:10] ok, thanks [17:02:28] context- I am testing new backups, new backups may generate lot of internal traffic [17:03:43] with emergency I mean something like "everthing is broken" [17:04:32] oh, I didn't know about netops, thanks for that too [17:13:52] it doesn't see a ton of traffic since we only have 1x netopsen + whomever drops in there, but it's at least sometimes where network stuff gets coordinated :) [17:15:09] I tried #wikimedia-networking first, that was my mistake, so I came here [17:40:52] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) I've updated the bios to the latest revision, 2.9.1 POST shows no errors, but I'm going to wipe the SEL and run (dells) hardware test suite. [17:41:03] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) ` /admin1-> racadm getsel Record: 1 Date/Time: 07/25/2018 16:19:36 Source: system Severity: Ok Description: Log cleared. ----------------------------------------------------... [17:52:21] 10Traffic, 10Analytics, 10Analytics-Cluster, 10Operations: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783 (10Milimetric) 05Open→03Declined >>! In T56783#2688311, @BBlack wrote: > Or is this basically now an off-topic ticket going nowhere? My money's on... [17:53:41] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) Ok, running task comment of steps taken: * updated bios * rebooted into hardware tests * POST shows no memory errors: ` Testing Memory... Testing Memory... 10% Complete Testing Memory...... [17:55:53] 10netops, 10Operations: Fix codfw x-connect 65373 - https://phabricator.wikimedia.org/T215193 (10Papaul) CyrusOne Checked the reading from the fiber patch panel in A8 same readings. So they are still going to run some test out of the cage. [18:00:26] 10Traffic, 10Operations, 10ops-eqsin: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) So I updated the bios on cp5007, and this happened in post: ` UEFI0107: One or more memory errors have occurred on memory slot: A1. Remove input power to the system, reseat the DIMM module... [18:07:08] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) ` 3 $> ssh root@cp5007.mgmt.eqsin.wmnet root@cp5007.mgmt.eqsin.wmnet's password: /admin1-> racadm getsel Record: 1 Date/Time: 10/31/2017 14:19:03 Source: system Severity: O... [18:11:40] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) bios update successful. I've cleared the SEL so I can launch Dell hardware testing utility. [18:14:42] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) Please note the hardware testing is still running on this system. I'm monitoring its serial output, but @ayounsi shouldn't modify the system until I update (or unless he attaches a crash... [18:14:45] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) Please note the hardware testing is still running on this system. I'm monitoring its serial output, but @ayounsi shouldn't modify the system until I update (or unless he attaches a crash... [20:06:47] 10Traffic, 10ExternalGuidance, 10Operations, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) @BBlack in https://gerrit.wikimedia.org/r/490120 I checked in with @Pginer-WMF to... [21:43:56] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) So, hardware testing completed, both the quick and in depth testing offered by the Dell utility selected during POST. However, previous SEL entries (posted above) show issues in dimm sl... [21:45:41] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) a:03ayounsi This passed all in depth Dell hardware test utilities, and issued no further errors since I cleared the log and ran hardware tests. Since our onsite time is limited, I'd rec... [21:47:00] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) a:03ayounsi Since onsite time is limited, it may be best for Arzhel to swap dimm A1 to A2, and swap dimm a5 to a4. This moves two questionable dimms to two slots that haven't reported e...