[01:44:13] 10Wikimedia-Apache-configuration, 10HHVM: Transition to HHVM broke old links to wiki.phtml - https://phabricator.wikimedia.org/T122629#3623467 (10MaxSem) 05Open>03Resolved Yesterday's stats: ``` > select count(*) from webrequest where webrequest_source='text' and year=2017 and month=9 and day=19 and uri_p... [03:35:53] 10Traffic, 10Operations: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623529 (10BBlack) [03:43:10] 10Traffic, 10Operations: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623545 (10BBlack) Actually, seeing the same on several cp402x. Depooling ulsfo, maybe switch issue? [04:05:22] 10Traffic, 10Operations, 10Patch-For-Review: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623552 (10BBlack) So, the same basic issue appears to have happened for almost all of upload@ulsfo (cp402[12356]) at about the same time. cp4021 was the lone exception. cp402[78] in text@... [04:06:45] 10Traffic, 10Operations, 10Patch-For-Review: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623553 (10BBlack) asw-ulsfo has some other alerts going on, aside from the expected link loss to various flapping or supposedly-down hosts, e.g.: ``` Sep 21 03:57:51 asw-ulsfo alarmd[1458... [04:13:24] 10Traffic, 10Operations, 10Patch-For-Review: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623555 (10BBlack) ... and now we've lost the cr1-eqiad <-> cr1-codfw link ... ? ``` cr1-eqiad xe-4/2/0: down -> Core: cr1-codfw:xe-5/2/1 ``` [04:32:04] 10Traffic, 10Operations, 10Patch-For-Review: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623559 (10BBlack) Recoveries of whatever the hell is happening in ulsfo: ``` 04:26 <+icinga-wm> RECOVERY - Juniper alarms on asw-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms 0... [04:34:36] 10Traffic, 10Operations, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623560 (10BBlack) [07:06:08] 10Traffic, 10Operations, 10Pybal: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388#3623639 (10ema) [07:06:21] 10Traffic, 10Operations, 10Pybal: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388#3623652 (10ema) p:05Triage>03High [07:06:49] 10Traffic, 10Operations, 10Pybal: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388#3623639 (10ema) [07:36:42] 10Traffic, 10Operations, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623695 (10ema) >>! In T176386#3623552, @BBlack wrote: > Inbound network traffic to all the upload@ulsfo nodes was ramping up to unusual values ahead of the n... [07:37:42] 10Traffic, 10Operations, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623696 (10ema) p:05Triage>03High [08:29:35] 10Traffic, 10Operations, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623763 (10ema) >>! In T176386#3623552, @BBlack wrote: > So, the same basic issue appears to have happened for almost all of upload@ulsfo (cp402[12356]) at ab... [08:52:44] text@ulsfo repooled [08:54:00] so [08:54:02] I'm just catching up [08:54:12] anything that I can help with? [08:54:30] paravoid: hey! all details here https://phabricator.wikimedia.org/T176386 [08:55:00] yeah I started reading that [08:55:05] I've now repooled text as it's entirely unaffected by the issues described there [08:55:56] the asw-ulsfo PEM 1 + ripe-atlas down issue is UL losing one of the two PDUs [08:56:39] and cp4007 was probably caught in that, probably one of its two PSUs is broken, so it works only with the other one (and died when power failed) [08:57:08] ok [08:57:58] now the interesting part is that all the kernel oops happened at the same time, way before the traffic increase [08:58:42] so one theory could be that the issues were not directly traffic induced, but perhaps caused clients to retry by establishing new connections? [09:01:59] 10Traffic, 10Operations, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623529 (10faidon) >>! In T176386#3623559, @BBlack wrote: > Recoveries of whatever the hell is happening in ulsfo: > ``` > 04:26 <+icinga-wm> RECOVERY - Junip... [09:02:39] my theory is that someone went to our rack and moved cables [09:02:48] oh [09:02:50] we got the emails for the PDU failing [09:03:01] Dear WikimediaFoundation, [09:03:01] This is to inform you that your PDUb in cab 1.22 at 200 Paul lost power. We are at your cab and trying to restore the power and will [09:03:04] update you as soon as we do. [09:03:15] "We have restored the power in your cab 1.22 b-side and all of your devises that were connected to that side up and running." [09:03:47] I'll paste them in the task [09:03:53] thank you [09:04:33] that happpened at 03:57 UTC, though? [09:05:18] kernel oops on cp4022,23,25 and 26 at ~03:18, traffic rampup at 03:45ish [09:06:40] so we might well be talking about two different things [09:08:57] pdub-122.sf8 first flapped around 20:56 UTC yesterday for 2 minutes, then down again at 4:03 this morning with the ack-email 2 minutes later if I'm reading correctly the emails [09:09:11] 10Traffic, 10Operations, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623810 (10faidon) Confirmed from UnitedLayer email: ``` Assad Kermanshahi, Sep 20, 21:13 PDT Dear WikimediaFoundation, This is to inform you that your PDUb... [09:10:24] cp4022 has [09:10:24] Sep 21 03:32:25 cp4022 kernel: [1776014.452396] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [09:10:28] Sep 21 03:33:22 cp4022 kernel: [1776071.601750] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [09:10:33] etc. [09:10:47] suggests that it was losing link [09:15:37] paravoid: at 03:33 the system wasn't feeling all that well https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1505957138781&to=1505971639249&var-server=cp4022&var-datasource=ulsfo%20prometheus%2Fops [09:16:58] but yeah the first 'link is up' message on cp4022 was actually earlier, right after the oops [09:17:07] [Thu Sep 21 03:17:56 2017] bnx2x 0000:05:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [09:18:21] can't correlate this graph with atop's [09:19:30] wait 299% [09:19:36] says atop at the 03:40 mark [09:19:40] but not grafana/prom [09:20:08] atop shows prometheus-node-exporter at 34%/23% CPU too though? [09:23:13] so the prometheus CPU graph is taken from: [09:23:17] sum by (mode) (irate(node_cpu{mode!="idle",instance=~"$server:.*"}[5m])) / scalar(count(node_cpu{mode="idle",instance=~"$server:.*"})) [09:23:53] heh ok [09:24:21] atop's 299% is also across all processors (so it also says idle 4144%) [09:24:23] it's divided by CPU count, is that where the difference comes from? [09:24:43] 299/48*100 = 6.22%, not too far off grafana's [09:24:48] ok [09:25:26] [ 299/(48*100) of course :) ] [09:25:31] anyhow [09:25:41] looks like something wonky with the connectivity there [09:26:33] yeah, oops at 03:17:55, link flap at 03:17:56 [09:27:04] (why does it only say link is up BTW? shouldn't we get a message for link is down too?) [10:42:42] 10Wikimedia-Apache-configuration, 10Wikidata, 10Wikimedia-Site-requests, 10Patch-For-Review, and 3 others: wikidata.org/entity/Q12345 should do content negotiation immediately, instead of redirecting to wikidata.org/wiki/Special:EntityData/Q36661 first - https://phabricator.wikimedia.org/T119536#1828684 (10... [10:45:16] 10Wikimedia-Apache-configuration, 10Wikidata, 10Wikimedia-Site-requests, 10Patch-For-Review, and 4 others: wikidata.org/entity/Q12345 should do content negotiation immediately, instead of redirecting to wikidata.org/wiki/Special:EntityData/Q36661 first - https://phabricator.wikimedia.org/T119536#3623924 (10... [10:54:23] bblack: https://gerrit.wikimedia.org/r/#/c/379512/, maybe switching "pass-only" mode helps varnish-be when it starts going nuts? [10:54:42] in any case, it seems interesting to give a try to the exp admission policy :) [10:57:14] as in it's not really pass only, there's still a probability > 0 for objects of say, size 5 to be admitted, but it's 0.006 [11:00:49] (if you set cache::admission_param to 1, that is) [11:05:06] noop, except where cache::admission_param is set (cp1008) https://puppet-compiler.wmflabs.org/compiler02/7961/ [13:39:16] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, and 2 others: Make maps active / active - https://phabricator.wikimedia.org/T162362#3624370 (10Gehel) We are ready to make maps active / active. Patch https://gerrit.wikimedia.org/r/#/c/379530/ is ready to be merged, but I'll let the traffic team (@ema / @BBla... [13:40:14] ema: ^ we are ready to make maps active / active. I'll let you do the merge (you understand the deployment process there much better than I do). But I can be available when you do it, just in case... [13:49:22] oh? [13:49:36] I thought I was just reading some ticket updates recently saying it looked unlikely to happen! :) [13:57:12] so the tiles are consistent now, etc? [13:57:15] gehel: ^ [13:57:48] bblack: yep, we are happy enough with the current situation [13:57:55] ok :) [13:58:30] there might be minor inconsistencies if a user gets tiles from both clusters, but minor enough to not matter [13:59:07] well, most of the time they shouldn't. but there will be corner cases where a user is bouncy between the two sides on some timescale [13:59:30] yeah, so minor inconsistencies for a minority of users, this shoudl be good enough [13:59:49] (could be an hours timescale all the time as they drive around their home city which sits on a US state border. could be a days timescale for travelers with laptops and browser caching of tiles?) [14:00:36] I need to run get my son in a few minutes... [14:00:40] (the worst case, which is in fact known to happen, is some ISPs recdns sucks and proxies the requests n behalf of that user out of distinct exit points hundreds of miles apart, resulting in that user's traffic switching geodns DCs every 10 minutes or so) [14:01:23] the inconsistencies we expect is something like labels not appearing correctly if split over 2 tiles in some cases [14:01:48] so you can semi-rely on the geo split of users for optimization purposes, but there will always be corner cases of users that flip-flop on some short-ish timescale [14:02:04] we really don't even have any great numbers on how prevalent that is for our particular traffic in 2017 [14:02:30] I only guess at it from gathering data on what things looked back, back in 2010 or so with a different set of DCs at a different company [14:02:34] our assessment, is that even if we do have users getting tiles from multiple servers, it's only a minor issue [14:02:39] ok [14:09:01] ema: the exp(-size/c) code existing at all is awesome :) [14:09:39] bblack: shamlessly copied from Danielsberger [14:09:41] :) [14:09:42] but: [14:10:24] 1) The intent was to use that for the frontend with the size-scaling, not the backend. with hit-or-pass we shouldn't have a real need for a pass-knob on the be anyways? [14:10:48] 2) On the FE, it has to replace the existing hacks like 4-hit-wonder and the absolute size limit on upload-fe, etc [14:11:06] 3) On text, lots of objects have no content-length at all, sadly [14:12:53] I think there was a (4) last time I looked at the ticket, but I've lost it or it's too early or something [14:13:11] basically, there's a lot of tricky work around the edges of the exp(-size/c) idea [14:13:16] sure [14:13:47] lack of CL is in general a problem [14:13:57] so many things get easier in caching and proxying if every object has a CL [14:14:15] but on the other hand, chunked streaming transfers are more efficient throughout the stack, which is why we don't demand it [14:14:21] I was thinking yesterday of a way to tackle 503s due to -sfile (to simplify), and I thought that it could be useful to have an emergency knob to turn the backend into "pass mode" so to say [14:14:41] then I've started thinking of probabilistic pass mode [14:14:46] technically, nothing prevents sending a CL ahead of a chunked+streamed response [14:14:57] and daniel's work came to mind [14:15:12] but usually the whole reason the layer beneath is chunked+streamed is because it has no idea, it's generating output dynamically as it goes or whatever [14:16:32] I know at one point I had a theory that if we turned off do_stream at the backend-most caches, it would force varnish there to slurp up the whole thing locally before forwarding, and Varnish would/could tack on the CL, and then we can stream the rest of the way up the stack back to the user [14:17:22] (before varnish4 most of our traffic didn't use do_stream at any layer, so probably the latency hit of dc-local non-streaming fetch from backendmost to applayer is acceptable cost) [14:18:01] we even have some code based on that theory, but only in cache_misc I think, so who knows if it actually worked as intended [14:18:42] yeah the stuff in modules/varnish/templates/misc-backend.inc.vcl.erb [14:20:25] anyways [14:20:29] ------- [14:20:45] the ulsfo mystery last night is still a mystery [14:21:16] even if it was caused by DDoS on port 443 [14:21:23] yep, I've left upload depooled without rebooting the hosts so that we could discuss a course of action [14:21:29] my bet is that someone physically fiddled with our equipment [14:21:52] paravoid: robh was at our DC and left the site like 2h before this started [14:21:56] I can't believe it's all one big coincidence all in 30 minutes [14:22:00] it was most likely not caused by a DDoS, traffic went up way after the hosts started misbehaving [14:22:19] ema: right, the traffic increase could've just been the effect of failure retries at various levels, etc [14:22:21] yeah, and UL's PDU failed 30 mins after the oops [14:22:48] unless it was misbehaving before it failed and this affected the servers in some bizzare way [14:23:09] and I guess, we can theorize the oops happening on cache_upload but not _text is just that upload has much higher bandwidth to start with? [14:23:34] that's my theory yes [14:23:44] but it's just a wild theory, can't substantiate that [14:23:59] I'm also not really sure about what the oops really meant. it sort of seems like it could be the case that the NIC was timing out throttling its own tx queues [14:24:16] one second after the oops there's a link flap too [14:24:17] which it might do if link was blipping or there was ethernet-level pausing/pacing happening? I donno [14:24:44] yeah, besides the oops there are "Link up" events multiple times [14:24:46] in dmesg [14:24:51] so yeah, it could just be that the level of xmit traffic on upload makes the blip un-tolerable to the xmit queues, whereas text can survive it [14:25:02] all those hosts have numa_networking isolate [14:25:09] yes [14:25:14] that was my theory too [14:25:22] but I don't think it's actually the problem [14:25:42] nope, just a data point [14:25:52] actually at first I though the 503 issue was recurring, then I blamed NUMA, then eventually it was just "wtf too many possibilities, but probably not those things" [14:26:21] but we could pull the NUMA isolation from half the cluster or whatever, it would be nice for comparison anyways [14:26:37] but, maybe later [14:28:05] OT: the fact that the pybal bgp state machine doesn't go to IDLE from ESTABLISHED is no bueno, I guess? [14:28:07] anyways, when rob gets online he may have more insight about the local situation in ulsfo [14:28:27] (or can get more insight) [14:29:18] there are some strange possibilities of course [14:30:39] e.g. it could be DDoS->upload@ulsfo, NUMA delay on be traffic for that causes xmit queue timeouts under heavy pressure, netdev watchdog flaps ethernet link status, host loadaverages go through the roof from all the chaos, power supply draw goes up from load, PDU fails and/or breaker flips off somewhere as a result [14:33:16] UL promised they'd get back to us with a root cause for the PDU failure [14:38:35] gehel: so, can merge now on the maps a/a thing? [14:41:04] bblack: I'll be back in ~1.5h if you want me around when merging, but feel free to move ahead without me... [14:42:52] gehel: I'll wait :) [14:43:21] ema: re exp(-size/c) - another issue is drand48() probably isn't ideal with heavy threading like varnish [14:43:49] it might "work", but it might screw with the distribution, since drand48() isn't thread-safe and saves internal state to move the rng forward between what it thinks are serialized calls [14:44:09] (serialized single-thread calls, that is) [14:44:31] drand48_r then? [14:44:45] erand48()? [14:44:57] oh I guess drand48_r is glibc [14:45:22] yup [14:45:40] and I guess technically, even regular erand48() isn't really thread-safe, just thread-safer [14:46:05] but somewhere with any approach, we'll need local per-thread buffer space for the RNG inside of varnish [14:47:02] I'm surprised varnish doesn't already supply a worker-thread-safe rand() of some kind. surely it does? [14:47:10] because it exposes one at the VCL level after all [14:47:28] so there's std.random [14:47:49] perhaps we can use that? [14:47:57] https://github.com/varnishcache/varnish-cache/blob/master/lib/libvmod_std/vmod_std.c#L125 [14:48:13] VRND_RandomTestableDouble()? [14:49:24] hmmm I guess that's master/v5 [14:50:04] lol, libvmod_std's random() in varnish4 is just a drand48() wrapper :P [14:50:23] sweet [14:50:57] technically since it's an RNG, if the inputs are getting perturbed by other threads in parallel all the time, you can kinda hadwave it all away and say "meh the output is still random" [14:51:14] but a lot of math goes into randomness working properly, I wouldn't be surprised if racy use of it introduces biases [14:51:50] plus, all the threads will be hitting those shared memory rng-state buffers, causing mem/cpu-cache-line contention, etc [14:52:01] it's just better if you have a thread-aware RNG [14:52:23] (it's also pretty simple to implement a very fast and high-quality one!) [14:53:39] we could make our own proper rngs in a vmod [14:53:45] basically, steal the code from gdnsd that does the same [14:54:16] https://github.com/gdnsd/gdnsd/blob/master/include/gdnsd/misc.h#L108 [14:54:55] + state init for per-thread stuff @ https://github.com/gdnsd/gdnsd/blob/master/libgdnsd/misc.c#L114 [15:48:30] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3624714 (10Cmjohnson) @robh I want to confirm that I do have 2 spares still in their original packaging in storage. TA3716160376 TA3716160364 [15:57:05] 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3624760 (10RobH) [16:07:18] robh: so once you're ready [16:07:23] a bunch of ulsfo stuff to talk about [16:07:37] sure! [16:07:55] we had a strnage outage there last night a couple hours after you left, with weird symptoms and timeline, but eventually involving some kind of PDU outage [16:07:58] https://phabricator.wikimedia.org/T176386 [16:08:08] yeah i saw that, but i didnt actually touch anything int he racks ;] [16:08:09] we don't know what the causes and effects are really [16:08:11] so it wasnt me! [16:08:26] yeah, hopefully ulsfo will have some answer about the PDU fail [16:08:34] (which came back on its own a bit later, from our perspective) [16:09:02] the cabinet doors do rest against the fiber stack in each rack [16:09:03] if they say the PDU fail was due to excess load, then that may push back on us (that we're running too close to the limit during this transition, and excess real cpu load -> pdufail) [16:09:07] due to how terribly they are layed out [16:09:15] but it came back without replacement of fibers so it likely wasnt a break [16:09:23] the just mailed us [16:09:24] 10Traffic, 10Operations, 10Pybal, 10fundraising-tech-ops: pybal vs firewall failover - BGP session down - https://phabricator.wikimedia.org/T173028#3624822 (10ema) p:05Triage>03Normal [16:09:25] they* [16:09:36] ahhh [16:09:37] The power supply on cp4008 went as we were bringing your b-side power back up. [16:09:40] but instead of saying what happened with their PDU, they're telling us what happened with our PSUs [16:09:47] yeah, old cp systems have been taking down the pdus regularly [16:09:58] when they have a psu short [16:10:00] it seems to kill power [16:10:09] (this is like the 4th or 5th time this has happened with the old cp systems there) [16:10:23] when it happens i have to pull the psu and let the system run with reduced power capabiltiy [16:10:28] s/capabiltiy/redundancy [16:10:42] so this makes sense. [16:10:53] it still seems pretty statistically oddball to me [16:11:12] even though these are older machines, our psu failure rate seems crazy for it to be "natural", seems more like their PDU is killing our PSUs :P [16:11:18] i dunno about the outage part, but a failing psu in a cp system has done this exact same issue with our PDU towers in the past. [16:11:20] or that [16:11:34] but I guess we'll know for sure when we see it happen with a newer system [16:11:34] but its all the systems in the same batch that are very old, so its hard to tell. [16:11:49] anyways, all of those are due for decom sooner rather than later [16:11:49] 10Traffic, 10Operations, 10Pybal, 10fundraising-tech-ops: pybal vs firewall failover - BGP session down - https://phabricator.wikimedia.org/T173028#3624832 (10ema) I've just seen this bug while testing the BGP interactions between pybal-test2001 and quagga, the assertion can be removed as being in ST_IDLE... [16:11:57] and we didnt move to the 1st floor in 200 paul until AFTER the warranty expired in much of ulsfo stuff [16:12:06] so this power issue could just be on the first floor, and killed off our old shit [16:12:26] its still very odd. [16:12:37] if we assume the new PSU-short -> PDUfail was the root cause of all issues last night, it's still a strange sequence of events, as the network card flaps on cp402[2356] happened a while before the PDU-fail [16:13:15] but perhaps we're just not looking at exactly when the PSU died [16:13:48] maybe PSU died (not in our logs in icinga), caused power output of the PDU to be flaky/unreliable, which caused the ethernet switch and/or other hosts to act flakey, then later the PDU failed completely [16:14:11] Record: 12 [16:14:11] Date/Time: 11/21/2016 22:12:28 [16:14:11] Source: system [16:14:13] Severity: Critical [16:14:13] but maye we can get that from drac-level logs/ [16:14:15] Description: The power input for power supply 2 is lost. [16:14:18] yes, that :) [16:14:29] but nov 2016? :) [16:14:34] yeah the dates are all wrong on cp4008 [16:14:38] heh [16:14:42] the service event log should have the power loss [16:14:46] well can we calculate the real date from the current one there? [16:14:48] but the most recent one states that date [16:15:12] im not sure why its wrong there [16:15:15] ive not see that happen [16:15:18] basically, it would be interesting to see the SEL's real-world timestamp for the PSU fail, relative to the networking issues [16:15:34] im a bit surprised its the wrong timestamp... at least i assume it is [16:15:43] separately, on other ulsfo issues: [16:16:03] I know you already saw https://phabricator.wikimedia.org/T176366 . We probably won't be ready to pull the plug on those until mid-late next week. [16:16:38] then there's https://phabricator.wikimedia.org/T174891 (cp4024 looks like early hardware fault, needs some kind of debug/replacement of whatever) [16:16:53] so a single PSU failed, and that means that we lose the whole PDU to all servers? [16:16:58] how does that make any sense? [16:17:02] and https://phabricator.wikimedia.org/T175585 (cp4021 has a bad dimm, but it's error-correctable so far) [16:17:26] paravoid: the claimed pattern in the past is when the PSUs fail they short out and break the whole PDU they're attached to, or something like that [16:17:39] we haven't seen that anywhere else [16:17:43] paravoid: I'm just relaying what has been happening at ulsfo over the past 6-8 months. every time we've had a PDU tower actually fail [16:17:46] its had a bad psu [16:17:49] yeah it all sounds shady to me [16:17:52] i made a pretty big deal about this =P [16:17:58] yeah, maybe UL needs to get better PDUs though [16:17:59] because its odd as hell [16:18:05] I really think statistically it's more likely that their PDU is faulty and it's killing our PSUs one by one [16:18:11] yeah [16:18:12] they've also replaced the PDU once [16:18:13] that too [16:18:22] when we killed it outright [16:18:24] they said [16:18:36] unitedlayer doesnt really do proper trouble tickets for everythign like equinix though [16:18:41] so im not sure i ever got it in writing for them [16:18:47] but even if it's random, we can't be losing a PSU on cp4008, and then losing cp4007 [16:18:48] i can dig for it if needed. [16:18:49] and ripe-atlas etc. [16:18:52] that's just insane [16:19:36] I'm not sure how we're going to demonstrate they are causing the issue iwth their PDU over our old PSUs failing. [16:19:48] (not disagreeing with you faidon, seeking how best to resolve ;) [16:20:03] it is insane that we have this amount of failures. [16:20:05] even so, why are our old PSUs always only failing on this one rail/side/whatever? like, all the B-sides or whichever it is [16:20:15] you'd think random failure of aging PSUs would be random between the two sides [16:20:49] well, i think we had one fail on the other side of the psu slot [16:20:53] but the actual cables are a mess [16:21:01] so a psu 1 could be plugged into tower 2 for all i know [16:21:06] they are not labled or organized at all in ulsfo. [16:22:14] which is likely what happened there, iirc we've only ever had power loss on the second tower in 1.22 [16:22:24] but meh, its so messy =P [16:22:54] paravoid: Do we want them to replace the PDU tower due to our suspecting it is an issue? [16:23:59] is it an option to just bring our own? [16:24:12] buy some high quality pdus and plug into their L6-30 and take over this end of the issue? [16:24:24] or they own the racks and pdus and just don't work that way? [16:24:38] Not sure, we'd maybe have a tough time getting our servertechs to fit [16:24:46] they seem to barely fit the apc ones and they are thin [16:24:51] i can ask [16:24:58] (plus our own pdus mean more power data) [16:24:58] are they using the tall skinny ones that go up the back side edges of the rack or whatever? [16:25:06] or like 2U rackmount pdus at the bottom? [16:25:11] 0u on sides [16:25:14] ok [16:25:27] though we lose the bottom 4 u of the rack to the way the power plugs are setup [16:25:32] unitedlayer is terrible. [16:25:44] I've used both, just whatever would work with the setup [16:26:01] they dont run any UnitedLayer equipment in the rack other than the PDUs [16:26:49] https://www.amazon.com/APC-AP7960-Switched-24-Outlet-Protector/dp/B0000CNZL0 or whatever appropriate specs-variant, for zero 0 and fully managed :) [16:26:59] would be my (totally outdated and half-informed) opinion [16:27:38] well, we'll try to fit a servertech first i think [16:27:38] anyways, we'll be pulling the plug on most of the legacy hosts pretty soon [16:27:41] since we use them everywhere [16:27:52] if we get PSU/PDU fail with the new hosts, I think that will be pretty damning [16:27:59] so do you guys want me to start with unitedlayer about replacing the pdu with their own, or we put our stuff in? [16:28:05] or do we just want to wait? [16:28:29] well the issue is getting ridiculous, but if they're blaming legacy PSUs failing and we're about to yank all the legacy stuff anyways [16:28:38] it's probably easier to do that first and then make the argument when it fails again [16:28:46] unfortunately, we've not had an issue like this elsewhere in our deployments. at past jobs any dirty power issues we had killed servers a LOT faster than they die in unitedlayer [16:29:04] they die but months apart in UL [16:29:17] which is awfully far apart to generate a distinct reasoning so far. [16:29:29] in the meantime, hopefully we get the 8x decom from that new ticket soon, and that lets us rack the last 4 CPs (and then some?) and move on with this process [16:29:41] but the 4024/4021 hw issues are separate, probably need dell tickets and some form of hw replace [16:30:17] (or does dell refuse to RMA correctable dimm errors until they get uncorrectable?) [16:30:26] so 4021 is a memory issue and I can get a new dimm dispatched today [16:30:58] 4024 is a "wtf kernel crashing with odd hardware-looking issues", I don't know that we know which component [16:31:07] maybe some dell diagnostics or the SEL can say more? [16:31:09] 4024 hw fault throws a cpu error as part of it [16:31:16] but yeah, not sure, ill have to prep and run dell hw tests [16:31:38] bblack: I'm back and ready when you are... [16:31:49] is it ok for me to wait to go onsite for the hw tests until after the new memory arrives or is this more urgent? (do i need to go back down today or can it wait for friday or monday when the new memory arrives?) [16:31:52] sorry for intervening sporadically, in between meetings [16:31:54] gehel: ok, gimme like 5-10 mins and we'll go [16:32:01] sure, take your time... [16:32:28] robh: it can wait for efficiency of ulsfo trips I think [16:32:41] cool, ill get a new dimm dispatched for cp4021 now [16:32:46] self dispatch yay [16:33:13] it'll arrive either tomorrow or monday, and illl go fix cp4021, and run hw tests on cp4024 [16:33:38] ok, thanks! [16:33:47] we'll need to depool cp4021 before dimm replace, it's live [16:33:58] maybe best to wait for monday even if it arrives tomorrow [16:34:14] just because so many risk/fail-balls are being juggled with caches and/or ulsfo at present [16:34:20] duly noted [16:34:21] 10Traffic, 10Operations, 10ops-ulsfo: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3597148 (10RobH) Please note this host is still pooled and active, and will need to be depooled before it is taken offline for dimm replacement. [16:34:37] bblack: cool with me, lets plan for a monday ulsfo trip post ops meeting. [16:35:29] ok [16:41:06] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576064 (10RobH) Dell has preboot diagnostics as an option (epsa) and then the code translation at https://www.dell.com/support/home/us/en/4/pre-boot-analysis When I'm onsite next Monday, I'll reboot this... [16:41:52] ill run the dell epsa testing suite on cp4024 since the actual error doesnt tell us much [16:41:58] other than its likely angry with a cpu or mainboard [16:42:19] so [16:42:26] what's the theory for yesterday's outage? [16:42:39] I don't have a great one yet [16:42:39] a bunch of different servers had issues with their network [16:42:51] then 40 minutes later a power supply in one of them was shot [16:42:56] coincidentally [16:43:11] also, ripe-atlas-codfw had another spate of probe fails around the same time [16:43:15] which brought down the PDU, which brought down ripe-atlas and cp4007? [16:43:20] and we lost an eqiad<->codfw link around the same time [16:43:26] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3624959 (10ayounsi) [16:43:28] was that unrelated? [16:43:35] it's all suspicious in time [16:43:35] was there a maintenance for that? [16:43:43] I think ema mentioned a maint window for that [16:43:51] for the eqiad<->codfw link? [16:44:01] even so, what's up with codfw ripe probe issues recurring around the same window? [16:44:24] it makes me think there was a triggering network event at the start of this, e.g. some traffic spam [16:44:43] https://groups.google.com/a/wikimedia.org/forum/#!msg/ops-maintenance/_vFlzsGqd5A/XM2nHAIoCgAJ [16:44:47] was the one ema mentioned [16:45:03] ok [16:45:11] so that makes sense for the eqiad<->codfw link [16:45:40] but look at the early timelines in the top of https://phabricator.wikimedia.org/T176386 [16:46:01] how does 03:20 <+icinga-wm> PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:46:08] end up "coincidentally" right in the middle of that ulsfo problem? [16:46:08] yeah confirmed that ^ maint above is for cr1-eqiad<->cr1-codfw [16:46:28] is there some problem with our monitoring of that stuff that makes it confusing? [16:46:32] that alert is tripping far too often these days [16:46:44] the threshold is 19, I wonder if like 15 are permanently failed [16:46:52] and then we lose another 5 and it trips or something [16:47:04] so not sure if related but we also had a mysterious ripe-atlas issue with IPv6 only a few days ago [16:47:20] ok so maybe let's ignore ripe-atlas issues for the moment [16:47:49] so icinga on irc is reporting a maps (cache_upload) failure at 3:16 as leading indicator [16:48:06] we know the bnx2x NETDEV WATCHDOG thing happened on 4/5 cache_upload@ulsfo all right around 3:17 [16:48:44] which is followed up by storms of link flaps for them, and pybal randomly de/re-pooling them all over for several minutes as pybal monitors randomly fail/recover over and over [16:49:09] but then the PDU event is much later [16:49:11] Sep 21 03:57:51 asw-ulsfo alarmd[1458]: Alarm set: PS JAVA color=RED, class=CHASSIS, reason=FPC 2 PEM 1 is not powered [16:49:13] Sep 21 03:57:51 asw-ulsfo craftd[1459]: Major alarm set, FPC 2 PEM 1 is not powered [16:49:35] (long after the initial bnx2x event, and also long after depooling traffic which reduced network+cpu loads back to minimal) [16:49:41] well "long" as in 40 minutes [16:50:18] ema: ripe-atlas IPv6 was not their fault? https://www.ripe.net/support/service-announcements/ipv6-connectivity-issues-for-multiple-services [16:50:28] I think the netdev watchdog happening later than the maps failure is telling [16:50:37] some queries were already starting to fail before that fault hit [16:52:08] gehel: it's going [16:52:14] sok, looking... [16:52:16] volans: hey, yes! Right day, right time (10:20ish UTC) [16:53:48] bblack: you're not forcing the puppet run, right? So it will take some time for all cache to route traffic to codfw? [16:53:54] I am forcing [16:53:59] it's not done yet though [16:54:10] ok, let me know... [16:54:35] the inter-cache routing doesn't change. just some of the queries that flow via codfw caches would normally forward -> eqiad-caches->eqiad-maps, and now they drop straight in from codfw-caches->codfw-maps [16:54:39] it's done [16:56:24] 10Traffic, 10Operations, 10ops-ulsfo: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3625017 (10RobH) Dell service request: SR954179119. Replacement dimm should arrive either Friday or Monday. I'll be onsite Monday to replace the defective dimm (after depooling the serve... [16:57:17] so if we form a straw-man theory that this is all about the PDU fail (only) [16:58:07] it has to be that the PDU was supplying flaky power during the ~3:15-3:57 window leading up to failure, and that flaky power was causing the other issues by making the switch (ports) flaky or the hosts flaky [16:59:02] why would it make the switch ports flaky though? [16:59:14] the junipers getting fauly input power? [17:00:01] but then afaik only those 5/6 upload caches had eth link flapping [17:00:06] not text, not lvs, etc [17:00:28] but then again again, those upload caches transmit significant more packets/bandwidth than anything else there [17:01:46] I don't think the PDU fail is the primary root cause of all of this, but I'm trying to put up my best theories anyways just to shoot it down [17:02:51] if the PDUs aren't the initial cause of the problem, then we have something else going on at the network-traffic and/or host levels that caused the initial problems [17:03:04] and the PDU fail 40 minutes later is just an annoying coincidence [17:03:28] (or perhaps only incidentally related as the power loading shifting up and then back down from the event caused some power component somewhere to kick the bucket or whatever) [17:05:52] if we can find evidence for it, I could believe that a real or unintentional DDoS of traffic towards cache_upload [17:06:09] could trigger the netdev watchdog timeouts and link flaps, either in the general case, or specifically because of the NUMA hacks [17:09:47] bblack: I do see increased traffic on maps/codfw, but all look good. Thanks for the merge! [17:20:38] confirmed that the bnx2x strangeness was only on 4/5 pooled upload@ulsfo nodes (4022,3,5,6, but not 4021) [17:20:53] 4021 does show some of the same stats bump on inbound network and such, just didn't apparently reach the big problem point [17:23:51] https://grafana.wikimedia.org/dashboard/db/ipvs-backend-connections?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-server=2&from=1505962749828&to=1505967528736 [17:23:56] so ipvs backend conns, the timing looks off there [17:24:51] (I mean, timing looks not right for inbound traffic being the initial trigger) [17:25:41] it is all around the time ulsfo is in one of its daily ramps up from lower traffic to higher traffic [17:25:54] but the ~3:15 point where the bnx2x stuff first hits, the ipvs connections look pretty normal [17:26:52] I that that graph showing increasing network conns around 3:40-3:50-ish is just the fallout of crazy pybal de-re-pool flapping in the face of all those cache_upload link flaps [17:26:58] and ends when I dns-depooled ulsfo later [17:27:32] so we're back to the bnx2x netdev watchdog being somewhere near the start of the timeline, without crazy inbound traffic to trigger it [17:31:12] 10netops, 10Operations: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3625182 (10ayounsi) Created the following ROAs via RIPE's website for our two least used prefixes: |AS number|Prefix|Up to| |AS43821|2a02:ec80::/29|48| |AS14907|2a02:ec80::/29|48| |AS43821|18... [17:33:54] we also have logs of actual 503s from upload@ulsfo starting about the same time as the bnx2x kernel entries. the first definitely-suspicious datapoint on those being at 3:15, with 1min resolution (so could've actually started at 3:14:01 or whatever) [17:34:30] zoom of that here: [17:34:32] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&var-site=ulsfo&var-cache_type=upload&var-status_type=5&from=1505963547355&to=1505963956581 [17:35:58] is there any evidence that one server (say, the one with the failed PSU, cp4008 I think?) was first? [17:36:26] so maybe that one misbehaved and as a backend caused issues with the other ones somehow? (fuzzy, I know) [17:37:37] also [17:37:56] are we sure about the theory that cp4008's PSU failed first, then brought down the PDU down? [17:38:00] "The power supply on cp4008 went as we were bringing your b-side power back up. [17:38:14] failed as they were bringing it back up, so later? [17:38:38] 4007/8 were depooled long before this event [17:38:46] so they're not involved in any of the real traffic here [17:38:51] oh ok [17:39:00] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=upload&from=1505962177698&to=1505966852754 [17:39:07] this is the broken-out fetch fails during that 503 period [17:39:30] it's interesting that of the 5x pooled upload servers there (21-26, but 4024 is dead/depooled at present for unrelated reasons) [17:39:42] 21 is the only one that didn't suffer the bnx2x oops and link flaps [17:39:52] yet it has more fetch-fails there than the others [17:40:35] I guess that's explainable by the link-flapping ones falling out of service completely (at the ethernet level, and also randomly off/on at the pybal-monitoring level) [17:40:46] thus getting less traffic through to a valid 503 about into this data [17:42:47] root@cp4008:~# bmc-device --get-sel-time; date [17:42:47] SEL Time : 09/21/2017 - 17:42:11 [17:42:47] Thu Sep 21 17:42:38 UTC 2017 [17:42:51] did someone fix that now? [17:42:56] or today I mean [17:43:20] don't know, not I [17:43:24] robh: ^ [17:43:38] i did not [17:44:05] i did racadm getsel on it to read the event log [17:44:09] so cp4008's iDRAC timekeeping is fine [17:44:10] but did not correct its time yet [17:44:18] so its fine and it didnt log the power event? [17:44:20] and the last SEL entry is back from Nov 2016 [17:44:21] that seems borked. [17:44:23] seems so [17:44:30] other hosts in the rack logged their loss of power [17:44:47] ie: cp4021 saw the power loss in the SEL normally [17:44:50] that one did too... in Nov 2016 [17:44:50] if it was just 1 server, I'd be quick to blame the NUMA thing as having some kind of random timing/bandwidth-fail that broke the tx queue [17:45:05] so maybe it's just failed since then? [17:45:09] and UL just noticed it today [17:45:13] er, yesterday even [17:45:15] but how did 4/5 pooled uploads get a broken tx queue timeout all in a <1min window together, without a spike in input traffic to drive it? [17:45:24] sorry, cp4024 saw the power event not 4021 [17:45:56] oh, so it may not be cp4008 at all that caused it if it didnt fail just then [17:46:03] see above [17:46:09] 20:38 < paravoid> "The power supply on cp4008 went as we were bringing your b-side power back up. [17:46:17] it didn't sound like UL said that cp4008 was the *cause* [17:46:25] ok, true [17:46:32] they've yet explain to us what happened [17:46:32] I think 4007 was the first host to show DOWN in icinga due to what looked like powerfail [17:46:33] not 4008? [17:46:40] bblack: yes, that's unrelated [17:46:51] so the power-related theory thus had been [17:46:54] oh right [17:47:05] that 4008 PSU failed and brought down the PSU that killed 4007 and others [17:47:13] cp4008 PSU failed -> PDU failed -> brought down cp4007 (has only one working PSU), ripe-atlas-ulsfo, caused Juniper yellow alarm [17:47:42] (red) [17:47:43] but both the fact that cp4008 PSU failed last night (and not in Nov 2016) is uncertain [17:48:02] and also UL never stated it was the cause for their PDU issue [17:48:12] so let's ask them what caused the PDU issue [17:48:52] so uptimes info, because I was curious, on 4021-28 (other than dead 4024) [17:49:07] 6/7 (the 2x texts, and 4/5 uploads) have up 21 days, 23:32 [17:49:15] 1x upload has up 21 days, 3:15 [17:49:19] robh: will you ask them? [17:49:22] but the 1 oddball is 4022, not 4021 [17:49:28] I can too, but probably better coming from them since you have a relationship already [17:49:39] I'll reply back and cc you both on my reply [17:49:44] oh, its ccing noc [17:49:44] nm [17:49:45] also ask them if anyone was near our rack [17:49:45] (in case this was some NUMA-net fail that happens X seconds after initial boot or whatever - seems unlikely) [17:49:53] or messing with anything related to our patch panels [17:49:56] circa 03:00 UTC [17:50:44] I'm still thinking that someone was just working on something and moved or messed with both our fibers and the PDU's power [17:51:04] oh also [17:51:21] note that for the very same PDU, we got from UL's icinga a DOWN and then an UP, way before any of that started [17:51:38] 20:56 -> 20:58 [17:52:05] (UTC) [17:52:33] We were advised yesterday by Assad (via this ticket thread) that UnitedLayer would be providing details as to the cause of our power loss event in 1.22 tower B. We've lost power to that tower many times in the past months, which tends to burn out some of our PSUs when they come back. [17:52:33] Are there any power figures/graphs that can demonstrate any kind of short or undue load that may have lead to this? Was any work being done on any adjacent cabinets or power circuits? [17:52:37] seem good? [17:52:57] their initial email said they would provide details, so just gently reminding them ;] [17:53:01] I'd remove the "We've lost power to that tower..." sentence [17:53:13] done [17:53:13] as to not lose focus on the issues of yesterday [17:53:19] we can complain about it in a subsequent email [17:53:33] ok, sending =] [17:53:43] (unless folks think i should add more info?0 [17:53:45] ) [17:54:25] (nope? ok gonna send.) [17:54:32] not from me [17:55:00] keep it short and vague and let them provide details seems best. [17:55:02] setn. [17:55:03] sent [17:55:20] of course it happens 2 hours after i leave said site [17:55:48] wild theory: [17:56:12] the PDU had an issue during the night (03:56 UTC) [17:56:14] cabinet door hits cable stack and wiggles plugs in sockets is my only odd theory. [17:56:22] er [17:56:24] and it fianlly causes short 2 hours later seems odd. [17:56:34] 20:56 UTC, 03:56 (at night) their time [17:56:41] in their morning, someone went there to investigate [17:57:08] tried to plug a cable or something, wiggled the fibers [17:57:31] maybe they started replacing it [17:57:49] and half an hour later they unplugged the power and it warned, or it finally failed [17:58:31] (every time i close the cabinet door i look there i hang out a bit since it compresses so many cables when it closes) [17:58:43] since once it did unplug a system from their pdu. [18:03:24] the ipvs conns stats are wrong, I'm not sure exactly how [18:03:27] https://grafana.wikimedia.org/dashboard/db/ipvs-backend-connections?orgId=1&from=now-5m&to=now&panelId=1&fullscreen&var-datasource=eqiad%20prometheus%2Fops&var-server=2 [18:03:40] ^ that's showing eqiad upload port 443 currently at "3.880K" [18:04:11] there's actually about 350K or so active conns for that service right now, looking at live ipvsadm [18:05:31] I started wondering because I know I saw numbers closer to 200K in ulsfo during the problem, but the history graphs for that time don't show anything like that [18:58:40] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 12 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#3625335 (10dr0ptp4kt) Note to future selves: we'd probably want the filename to contain the human readable File: name so people and machines... [19:20:31] bblack,ema: if I schedule the esams AS# renumbering on week 41 (like Oct 10 or 11) would one of you be around to help with depolling, reconfiguring pybal with the new AS#, testing, etc... ? [19:23:08] yes [19:23:15] XioNoX: ^ [19:23:25] thx! [19:30:36] yay, finally merged https://github.com/Juniper/ansible-junos-stdlib/pull/109 [19:51:57] awesome :) [20:11:31] 10netops, 10Operations, 10fundraising-tech-ops, 10hardware-requests, 10ops-codfw: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3625467 (10RobH) [20:16:58] 10netops, 10Operations, 10fundraising-tech-ops, 10hardware-requests, 10ops-codfw: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3625517 (10ayounsi) a:05ayounsi>03Papaul thanks Rob! [20:21:54] 10netops, 10Operations, 10fundraising-tech-ops, 10hardware-requests, 10ops-codfw: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3625550 (10ayounsi) [20:26:05] 10netops, 10Operations, 10fundraising-tech-ops, 10hardware-requests, 10ops-codfw: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3625559 (10ayounsi) [20:49:11] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3625620 (10Jgreen) @ayounsi ports are connected! When you have some time let's start setting them up? [22:53:55] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3625942 (10Reedy) [22:58:04] 10netops, 10Operations: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3625949 (10ayounsi)