[00:39:24] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) [00:39:27] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10BBlack) 05Open>03Resolved Seems to be working fine now, thanks! [00:40:40] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) [00:41:02] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10BBlack) [00:41:05] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) 05Open>03Resolved These are fully in-service. Will file separate ticket(s) about decomming various older cp10xx machines. [06:24:24] bblack: ah, nice! (re: puppet runs) [07:08:09] goodbye cache_misc! [07:08:10] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&var-site=All&var-cache_type=misc&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&from=1533816610197&to=1533870979464 [07:10:43] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [07:50:34] ema: you still planning on reimages today? [07:50:49] gehel: nope, just reboots [07:51:06] volans_ did a fix to wmf-auto-reimage yeesteerday, but it seems it introduced another bug. I'm having a look [07:51:18] ema: ok, so you should not be impacted! [07:52:08] :) [08:19:04] 10Traffic, 10Operations: cp3040: kernel crash in ipsec code shortly after reboot - https://phabricator.wikimedia.org/T201666 (10ema) [08:19:41] 10Traffic, 10Operations: cp3040: kernel crash in ipsec code shortly after reboot - https://phabricator.wikimedia.org/T201666 (10ema) p:05Triage>03Normal [08:21:55] vgutierrez: thoughts on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451654/ ? [08:23:46] I've set profile::trafficserver::backend::outbound_tls_cipher_suite to the current ATS default for now [08:23:54] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451654/4/hieradata/role/common/trafficserver/backend.yaml [08:37:13] ema: taking into account that we control both sides of the connection [08:37:23] ema: I'd suggest something like the strong settings in ssl_ciphersuite [08:39:31] ema: -ALL, only those in strong, and I'd avoid DHE-DSS stuff [08:39:55] being server side and I think that all of our servers support AES-NI I'd get rid of chacha20 as well in this case [08:40:52] vgutierrez: so maybe we need another compatibility mode in ssl_ciphersuite, something stricter than "strong" [08:41:16] like "we-control-both-sides" :) [08:41:34] yup [08:42:01] we could use strong iif we could do conditional TLS handshaking like google does [08:42:19] and offer Chacha20 only to clients that aren't able to do AES-NI [08:42:28] but I suspect that's a boringssl feature [08:42:56] > offer Chacha20 only to clients that aren't able to do AES-NI [08:43:00] we are the client [08:43:13] I know [08:43:37] but if you use the strong settings from ssl_ciphersuite you won't benefit from AES-NI [08:44:12] so client-side, right now we should use strong - chacha20 [08:45:12] ema: oh.. and I'd keep it simple and shorter, just 'ECDHE-ECDSA-AES256-GCM-SHA384' + 'ECDHE-RSA-AES256-GCM-SHA384', [08:45:30] to be able to talk to servers using EC or RSA certificates and that's it [08:46:09] ema: if we check other usecases where we control both sides, like varnishkafka, we only offer 1 ciphersuite [08:48:09] aka ECDHE-ECDSA-AES256-GCM-SHA384 [08:49:01] ok, so: [08:49:02] profile::trafficserver::backend::outbound_tls_cipher_suite: 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384' [08:49:05] ? [08:49:21] if it's openssl syntax: '-ALL:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384' [08:50:18] right! [10:24:48] "we control both sides" doesn't necessarily mean both sides are completely-optimal. We may well have applayer services ATS needs to contact which are implement in Java or worse [10:25:43] also, I think the benefits of AES-NI are overrated. It's premature optimization these days, as both are cheap on a modern fast CPU. [10:26:42] (and quite likely, way more bogomips were wasted on some other processing of the content than on encrypting it) [10:27:47] [AES-NI overrated: I meant AES-NI'd AES vs unaccelerated ChaCha, not AES-NI'd AES vs unaccelerated AES] [10:28:47] ultimately in the most-common cases our servers (that ATS communicates with) will be the deciders, since they're likely to be tlsproxy and use prefer_server_ciphers [10:30:25] anyways, none of that changes what you've done, it's fine either way. [10:31:00] maybe shave a few lines of code though and just hardcode tlsv1.[01] to off on the backend side of ATS though [10:31:17] I can't imagine any world in which we want to support connecting to a <1.2 service [10:31:19] bblack: do you know how google edge layer picks Chacha20 for non AES-NI clients? [10:31:35] at least ssllabs it's reporting that behaviour [10:31:44] vgutierrez: the same way we used to: hack your ssl layer to watch the client's pref list [10:32:17] the client sends the list of ciphers, generally in a preferred order, and the server picks one (based on its own server-preferred order that the client never actually sees) [10:32:18] and assume that a client without AES-NI support it's going to list Chacha20 in top of AES? [10:33:11] right [10:33:58] instead of straight-up server pref, or straight-up client pref, boringssl or patched openssl says "ok we consider the top strong AES/ChaCha options equivalent - we server-prefer them to anything else, but if both are available from the client, let the client's order choose" [10:34:33] we used to patch that in when we first turned on chapoly as well [10:35:23] but porting that nonstandard behavior around got annoying, and it's not the direction openssl is going long-term (see my very long-lived github issue w/ openssl on adding equal-preference cipher groupings) [10:35:43] bblack: btw, I don't know if you had the chance to take a look at https://code.fb.com/networking-traffic/deploying-tls-1-3-at-scale-with-fizz-a-performant-open-source-tls-library/ [10:35:50] I ended up making a judgement call at some point that we should just drop the shenanigans and prefer chapoly to AES outright [10:37:23] the logic goes something like: (a) If you're super-paranoid, ChaPoly could be considered stronger than AES (b) If you have a crappy CPU, ChaPoly speeds things up for your crappy client (c) If you have a decent enough CPU to have AES built-in, you probably also have enough CPU to do ChaPoly fine anyways [10:38:27] that's of course end-user-side logic [10:40:03] but really it's all the same for ATS outbound too. AESNI-v-ChaPoly cpu% diff is not going to keep anyone awake at night, and arguably ChaPoly is stronger in some sense (in the sense that in some paranoid delusional world where the AES design has been backdoored, or was long-ago broken in secret, ChaPoly likely hasn't suffered either fate) [10:42:59] there are some other more-esoteric reasons to like chapoly more than AES too, about future risks [10:43:11] and batch risks, and combining the two [10:44:33] batch attacks effectively reduce the key strength of AES similarly to quantum attacks [10:44:45] but batch attacks don't reduce chapoly strength [10:45:23] so in a non-quantum non-batch world: AES256 is overkill at ~256bit strength, ChaPoly is the same, and AES128 is just fine at 128 strength. [10:45:53] in a batch-attack world: ChaPoly is still ~256bit, AES256 is 128-bit, and AES128 sucks. [10:46:10] in a post-quantum world: ChaPoly is ~128bit, AES256 is ~128-bit, and AES128 sucks. [10:46:30] in a batch+post-quantum world: ChaPoly is ~128bit and fine, AES256 sucks, and AES128 sucks harder [10:47:11] not that post-quantum anything is a factor today, because our key exchange algorithms are all trivially broken by post-quantum attacks. [10:47:21] but batch attacks might be depending on your paranoia level [10:47:58] https://blog.cr.yp.to/20151120-batchattacks.html [10:48:54] yup, I'm aware of that blogpost. Some could say that djb has personal interests in weaking AES in favor of chapoly }:) [10:49:12] sure :) [10:49:21] [10:49:27] *weakening [10:49:56] in the net of it, I just trust ChaPoly more, and I think even if you have AES-NI on hand ChaPoly doesn't cost much [10:49:56] according to djb we shouldn't use any nist ec-curve BTW [10:50:19] we're now what, something like ~70% ChaPoly with our clients, and we didn't see a CPU blowup for preferring it over the AES we were doing with most before that. [10:50:57] vgutierrez: yeah we don't have a choice on the EC curve limiting at this point, too many clients only support NIST. But at least we do prefer x25519. [10:51:08] yup [10:51:17] 75% according to our lovely prometheus/grafana dashboard [10:54:04] :) [10:55:44] .006 [10:55:52] the moving average is apparently still moving somewhere [10:57:16] ema: so, I need to do the needful things for this itwiki/sitemap thing [10:57:43] ema: which means adding a new text/misc backend that's categorically odd with respect to recent transitions. [10:58:16] ema: can I just operate on the text stuff and ignore misc at this point, or do I need to keep their configs diffing appropriately or something because of other ongoing something? [10:59:10] ema: also, https://gerrit.wikimedia.org/r/c/operations/dns/+/451695 ? I haven't seen any complaints yet [11:02:01] bblack: oh BTW, ShakespeareFan00 reported at 09:35 UTC today slow performance regarding image upload [11:02:09] bblack: dunno if it could be related to your change [11:02:35] who knows, that's a very vague report :) [11:02:39] indeed [11:03:23] if his definitely of slow is he couldn't break 32MB/s upload on his super fast low latency connection, then yeah it's probably my change the other day. [11:03:30] err s/definitely/definition/ [11:04:26] maybe if he's uploading from some hosted server in virginia, otherwise most don't have that kind of upload-side (or if they do, they don't have low enough latency and the right BDP tuning of their buffers to take advantage of it) [11:05:39] the change only affects outbound-side TCP flows from the caches, but uploads to commons involve such a flow, from the cache->mediawiki [11:07:49] if we later push down to 100Mbps (12.5MBps) such effects will get even more pronounced for fast users. But IMHO it's still the right move. We can't have arbitrary random user actions taking such large fractions of bandwidth shared by thousands or millions. [11:08:16] if you really need your petabyte collection of images to upload in reasonable time, we do have some sideloading options that are employed for such things :P [11:53:52] bblack: you can ignore misc at this point and do the needful on text :) [11:58:40] bblack: I'm proud of the +8 -109 diffstat here https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451826/ , does it look ok to you? [11:59:30] but that /^cp/ hostname regex in a manifest [11:59:32] heresy! [12:01:11] ema: so this removes all the ": on" settings. if I want to do another bypass run on the numa_network disables again, I guess I'll need to add back files for just the ones I'm flipping. [12:01:20] should be ok! [12:02:40] bblack: there's only 6 caches left for numa_network reboots, it be a relatively quick thing [12:02:47] oh nice [12:03:06] s/it be/it should be/ [12:03:48] or you could've gone reggae with s/thing/thing, mon/ [12:05:32] haha [12:39:58] is 5006 being reimaged? [12:40:12] I see 3040/3041 [12:40:30] ? [12:40:45] oh nevermind, I failed at counting [12:41:39] 3044/3041 are the two hosts left rebooting for numa_networking [12:41:50] time to go back to: https://www.youtube.com/watch?v=2AoxCkySv34 [12:42:00] :) [12:42:31] 3040 seems to be mid-reimagine though/ [12:42:42] err no, something else [12:43:08] 3040 had troubles today https://phabricator.wikimedia.org/T201666 [12:43:37] stale lockfile [12:43:41] or something [12:44:04] ok, depooled? [12:44:28] nope, it's pooled [12:44:28] newp [12:44:37] ok I'll make puppet work or something, it's borked on some stuck state [12:45:34] yeah so the puppet issues seem to have nothing to do with the troubles mentioned in the ticket [12:46:21] I think it just crashed during a puppet run [12:46:27] and left a stale lock, which I just removed [12:47:12] ok [12:47:33] ok past that confusion [13:12:34] yeah so, apparently fast storage != lack of mailbox lag [13:12:42] that's the second one to alert since switching over to them [13:13:47] I'm kinda giving them a pass for all the churn of the pool transition (which churned storage strangely, and also changed cron timings around randomly as we went, resulting in some going longer between reboot than they should), etc, for now and hoping the 3.5-day restarts will patch them over as things stabilize. [13:15:11] I donno though, cp1088's backend did restart ~2 days ago, it shouldn't be bad yet :/ [13:15:20] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1088&var-datasource=eqiad%20prometheus%2Fops&from=now-7d&to=now&panelId=21&fullscreen [13:20:22] it's interesting that the expiry lockops rate drops off to a constant value once the ramp kicks in [13:23:03] also I'm guessing, when looking at "bytes available" graph, the numbers are just-wrong by some factor [13:23:06] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1088&var-datasource=eqiad%20prometheus%2Fops&from=now-7d&to=now&panelId=36&fullscreen [13:23:15] even when it tops out right after a restart, those available bytes numbers are too low [13:23:34] bin2 showing 10.6GiB [13:24:04] when its actual size for cp1088 is 565G [13:24:23] probably something wierd going on with unit/time conversion [13:25:48] in any case, across the board on various uploads, it seems like bin4 fills quickly [13:26:04] I wonder if bin4 is particularly-problematic with constant churn on super-huge files [13:27:29] bin4 is for objects >64MB, and then we do have a cap where we hit-for-pass any object >=1GB [13:28:28] bin4 in esams is ~44G, and new eqiad is ~90G [13:28:59] we could either re-shuffle our size breakdown a bit and give more headroom in bin4 due to its special nature [13:29:19] or reduce our HFP cap, maybe 512MB? [13:29:47] I donno, I'm out on like 3 limbs here [13:30:29] possibly with the larger storage size and a single drive, the new eqiads just need finer-grained binning to avoid stupid varnish problems, too [13:33:33] sorry I was lost in puppetlualand [13:33:37] an alternative theory is that the expiry thread has it easier when it all it has to do is process natural expiries, and it's got a harder job with more risk of lag once nukes start happening [13:34:13] we do run out our storage completely in cache_upload before it all naturally expires, I think, which gets us into nuke-based freeing [13:34:35] we could just reduce our maximum TTLs until the expiry thread can always maintain free space before nuking happens [13:35:16] it's kinda not a great tradeoff for hot objects though [13:35:48] which might otherwise last the longer TTL in an LRU situation, but when you control by reducing TTL it's more like LRI (least-recently-inserted) [13:39:44] yeah only 1d 17h uptime and lagging already :( [13:40:33] yeah I think with bins/storage being twice as large per node, it just makes the expiry thread's stupid book-keeping overhead much higher [13:40:51] almost doesn't matter if the storage is fast, it's all about that thread and the mailbox and the locks [13:41:32] we might try to reduce storage size and see if things change, so sad [13:41:46] or just split it further [13:41:52] right, better [13:42:12] without redoing all the crazy math and estimations: we know the bins were set for a 4x size range (max size = 4*min size) [13:42:35] we could split each of the current bins into a pair of equal-sized bins with 2x size ranging [13:43:09] it wouldn't be perfectly optimal, but it might be an interative improvement on present problems [13:43:18] s/interative/iterative/ [13:45:33] +1 [13:45:50] oh, another thing I wasn't factoring in though [13:46:06] with eqiad depooled and backend_warming on, they're getting the full global client miss blend [13:46:13] that alone could be exacerbating mailbox lag there [13:47:11] oh right [13:47:21] so try disabling backend_warming? [13:47:35] well I'm sure that would make the lag go away for now [13:47:49] because they'd have virtually zero new storage allocations happening with eqiad depooled still [13:48:03] yes [13:48:23] but yeah, if we have no plan to repool eqiad before next week (we don't), may as well kill warming for now to avoid the alerts and possible fetchfails [13:48:35] can turn it on for a day (or half day) before an intended repool to refill [13:48:44] yup [13:49:27] ok to reboot cp1008 now? needs a reboot to pick up the SSBD-enabled microcode (package got installed there after the latest kernel reboot) [13:50:29] I think so [13:51:57] k, doing that now [13:53:19] ema: I guess monday-ish, we can start tearing down cache_misc and all related bits [13:53:40] maybe check for traffic first, I'm sure someone hardcoded some IP somewhere [13:53:56] or caches dns results indefinitely heh [13:54:53] moving the IP would be ideal, but the whole thing is messy at the LVS level [13:55:17] and hopefully, not really necessary [13:55:32] bblack: there's almost nothing hitting misc anymore with the exception of very few phab hits [13:56:03] https://bit.ly/2MwMwJ5 [13:56:16] but yeah we can tear down the lvs layer of it, then spare-out the hosts for decom, then clean up puppet removing all the cache_misc bits there, etc [13:56:58] phab might be browser tabs open a long time that are still trying to poll the notification thing [13:57:09] yep [13:57:49] mbox lag on 1088 is now recovering (at some pace) with warming off [13:58:06] interesting that it can't insta-recover, I think that speaks to the expiry thread being effectively ratelimited in how fast it can work [13:58:15] well not the notification thing, they're hitting / :) [13:58:20] oh [13:58:28] well, I guess they're in for a surprise! [14:00:03] haha, they're spiders [14:00:13] and check_http :) [14:00:40] 'Sogou web spider/4.0' apparently [14:01:17] I'll check again on Monday but I think we're good [14:01:21] so my plans for the itwiki/sitemaps thing evolved a bit [14:01:52] now the problem I'm down to is this: I've created a situation where a request will come in for a traditional cache_text domain, and we need VCL to rewrite the URI to a domain handled by the alternate_domains ex-cache_misc stuff [14:02:15] so I'll have to find an appropriate hook that's before the alt-domains-switching hook that we tried to put before almost everything else lol [14:02:41] what's the itwiki/sitemaps thing? [14:03:25] it's complicated, you probably don't want to know, but: https://phabricator.wikimedia.org/T199252 [14:03:43] there was the EU copyright protest [14:04:00] the wiki-level admins protested by editing site-level JS to block the site [14:04:11] then when they reverted, they found that google was still caching it as all blocked [14:04:41] then 40 rabbitholes, then "oh we need to be exporting standardized sitemaps URLs to crawlers like google to update them better, but MediaWiki in our installation doesn't have a sane way to do that" [14:05:17] /o\ [14:05:39] so now we're at: we can generate sitemap xml files and stuff them somewhere in a directory hierarchy by wiki hostname, like "/foo/it.wikipedia.org/", and https://it.wikipedia.org/sitemap.xml has to magically map to produce those files contents without any redirecting [14:06:22] I've made a sitemaps.wikimedia.org microsite to hold them, which is conveniently already a cache_misc-like backend for cache_text [14:06:36] so yeah, now I just need to hook the rewrite before the alt-domains switching [14:08:17] ah, I got confused. the CPU in cp1008 is simply too old, so wasn't covered by the Intel microcode release [14:13:43] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10BBlack) Status update: * `dumps.wikimedia.org` really didn't work out well. The... [14:14:54] rebooting cp2005, one of the last 7 hosts requiring a kernel update [14:16:52] moritzm: yeah, it's going away Soon. It's just not high enough priority while we have so many other concurrent changes and decoms going on. But I plan to re-purpose cp1099 to replace cp1008. [14:18:27] I'm surprised ema didn't already throw his keyboard out the window sometime over the past few days with all the reimages and reboots and upgrades and hardware replacements and numa changes and... [14:19:43] :) [14:20:02] sure, thanks! so far coverage of microcode updates is quite good, only the really old systems (typically those with warranty expired around 2014) are not supported by the SSBD update [14:47:04] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @BBlack to confirm on the third bullet, the current itwiki map sho... [14:49:39] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10BBlack) @Imarlier - Ok thanks! Before I push the rewrite buttons in https:/... [15:12:14] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1088&var-datasource=eqiad%20prometheus%2Fops&from=now-24h&to=now&panelId=21&fullscreen [15:12:22] ^ the mbox lag recovery is pretty interesting [15:12:34] given how low-rate the misses through eqiad are anyways, and now we're not even storing any new ones [15:12:47] that expiry thread is so awful [15:14:52] bblack: could you check https://gerrit.wikimedia.org/r/#/c/operations/dns/+/451607/ ? [15:15:02] bblack: mainly the new iface naming on the A records [15:15:21] it's not beautiful [15:18:44] lol is that really the iface names? [15:19:27] enp59s0f0 enp59s0f1d1 enp175s0f0 enp175s0f1d1 [15:19:39] that is so much better than eth0 eth1 eth2 eth3 :P [15:20:16] I wonder why the second ports add a d1? and why 59 and 175 have basically-nothing to do with the physical slot layout? [15:20:23] 'can you check if enp175s0f1d1 is blinking?' [15:21:07] don't forget to download our 67MB xml file first that maps all the "physical" slot numbers in the name to the actual physical slot numbers on the back of the machine. [15:21:13] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Looks good to me: ``` imarlier@WMF2024 ~/dev/src/mediawiki-docker-... [15:21:38] the whole point was supposed to be that the name made physical sense [15:21:53] they never said to whom though [15:23:30] haha [15:23:30] bblack: yeah.. those are the actual names :_( [15:24:08] really they should hash whatever bus/slot/port/fiber and use it to index /usr/dict/words or something, so we can just call the interfaces memorable names like "fish" [15:24:18] hahahahaha [15:24:27] genius [15:24:41] and the manufacturers should put tiny dot-matrix LED displays on the backs of the cards that display whatever word was picked by the OS [15:25:05] bonus points for swear words [15:25:11] that reminds me of https://what3words.com/ [15:26:07] bblack: I'm around if you want to do the authdns reboot/etc [15:27:04] something's wrong [15:40:11] bblack: https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [15:41:00] 1089 had an interesting rampup in lag [15:41:38] also n_ru_limited is at 3 (?) since a while [15:42:26] what does n_lru_limited even mean? [15:42:38] the lag seems more like an effect than a cause probably [15:43:16] and the cp1089<->appservers connections flattened at 1k between 15:22 and 15:25 [15:43:21] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=6&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=cp1089&var-layer=backend&from=now-1h&to=now [15:44:00] yeah [15:44:05] appservers + api [15:44:12] all spiked up to a plateau there [15:44:34] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=6&fullscreen&from=1533913816534&to=1533915277140 [15:44:42] there's another spike right now [15:45:28] this spike is 1085 [15:45:34] same basic thing, different server [15:45:58] keep in mind this could be client-induced too [15:46:15] if it's randomized pass-traffic hitting somewhere and stalling at the MW level and soaking up applayer conns [15:47:58] check logstash hospital/slowlog stuff? [15:49:31] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Engineers (Wong Kee Heng & Kelvin Goh Keng Yew) from Unisys (sub-contracted by Dell for Pro support) will be onsite on Monday, August 13th between 1500 and 1700 Singapore lo... [15:53:10] ema: varnishslog during the last burst, shows a ton of: [15:53:12] es.m.wikipedia.org/w/load.php?modules=skins.minerva.icons.images&image=language-switcher&format=original&lang=es&skin=minerva [15:53:23] as opposed to just random $stuff [15:55:22] it was back on 1089 during the third (much smaller) spike [15:56:47] during the ~15:24 one, varnishslowlog had a ton of: [15:56:48] pt.m.wikipedia.org/w/load.php?debug=false&lang=pt&modules=startup&only=scripts&skin=minerva&target=mobile [15:56:57] there seems to be some kind of pattern here [16:25:54] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) p:05Triage>03High [16:26:11] 10netops, 10Operations, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) [16:26:14] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [16:26:24] bblack, mark, https://phabricator.wikimedia.org/T201694 [16:40:01] XioNoX: asw2-a5-eqiad xe-0/0/18-28 (11 ports) are all on the to-be-decommed list of old CPs [16:40:04] we could use those [16:40:40] also the 6 ports xe-0/0/8-13 for lvs1007-12 could be unplugged [16:40:44] (also asw2-a5) [16:42:13] thx, will still have to run cross rack uplinks, at least 4 to 5 is not that long [18:11:34] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10BBlack) Seems to be working now after the fixup above, will need to cleanup... [18:38:23] bblack: a datapoint about your idea of using 10G VC links for (some) 1G members, asw2-a5 (old 10G switch) never used more than 8G on its uplinks: https://librenms.wikimedia.org/graphs/to=1533926100/id=6086/type=port_bits/from=1502390100/ [18:47:55] nice [19:06:18] The more I think about it the more I'm wondering if we shouldn't get rid of VC/VCF totally [19:22:30] bblack: some ideas: https://etherpad.wikimedia.org/p/77SMYoPXek [21:21:09] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Confirmed that https://it.wikipedia.org/sitemap.xml is returning....