[04:01:10] so, upload@ulsfo didn't even make it 24h [04:01:33] no bnx2x oops this time, but loadavg went up, pybal was seeing intermittent monitor fails and depooling them semi-randomly [04:01:42] eventually the pattern got bad enough to alert and page [04:02:11] it may be that the transmit timeout we saw before (but not this time) was just fallout of excessive load, but it was caught earlier this time? [04:03:07] ipvsadm at the time I looked was showing 4/5 pooled with ~116K active conns per node [04:03:14] which seems way too high [04:03:38] but probably lots are getting double-counted as random de/re-pools happen and clients reconnect and leave dead tcps behind, etc [04:05:43] the ratio of established:time_wait in tcp stats for those upload@ulsfo seems unusual compared to esams, even before problems became apparent [04:06:29] we may be running into some numerical limit here, where 5 nodes in ulsfo just can't cope with its peak conns, e.g. due to timewait on the nginx<->varnish sockets or some open fds limit at some layer, etc? [04:06:58] (in which case the 6 nodes we've always run with there were already close to the limit, just hadn't quite fallen over?) [04:07:33] the more-obvious theory is that the numa stuff is sending things off the rails. it's the easiest thing to try: turn it back off in hiera and pool it all back up and see what happens. [04:07:42] but for now it's late here and I'm just leaving ulsfo dns-depooled [04:10:14] (turning off numa isolation requires a reboot of the boxes after applying the puppet changes, numa_networking is set for them in hieradata/regex.yaml) [11:12:20] 10Traffic, 10Operations, 10Pybal, 10fundraising-tech-ops, 10Patch-For-Review: pybal vs firewall failover - BGP session down - https://phabricator.wikimedia.org/T173028#3642334 (10ema) 05Open>03Resolved a:03ema Fixed in pybal 1.14.0. [12:32:08] disabled numa networking on all boxes except for cp4021 [12:32:27] cp402[2-6] reboot in progress [12:40:58] done [12:41:07] repooling ulsfo [12:48:18] so cp4021 seems to be receiving less traffic than the other 4 [12:48:40] which is interesting, being the only node with numa_networking: isolate [12:49:06] over the past 10s, the average traffic received was ~18M [12:49:17] vs. ~35M of the other nodes [12:50:01] the amount of traffic sent is instead comparable [12:52:12] as dns propagation continues, all hosts are receiving more traffic, but still cp4021 less than others [12:54:48] ~60M vs. ~115M now [12:55:27] oh wait [12:55:42] the other nodes have all been rebooted, hence more cache misses of course [12:56:14] (hence more bytes received) [13:03:33] meanwhile, pybal@ulsfo is doing good [13:16:01] 10netops, 10Operations, 10fundraising-tech-ops: remove fundraising firewall rules related to ganglia - https://phabricator.wikimedia.org/T176319#3642627 (10Jgreen) a:03Jgreen [13:22:10] ema: thanks :) [13:22:37] so, if the same thing plays out for 4021 again, we'd expect it will flap in pybal state eventually (when ulsfo's back at high load again) [13:22:45] with just it flapping, might not even notice unless tailing pybal.log [13:23:06] which I'm doing anyways to keep an eye on the new pybal! [13:23:10] :) [13:25:48] ok I'm looking at puppet logs now, looks like removed 27+28 too [13:26:02] I'm not even sure what that does there without reboots [13:26:32] (I know at least some parts of what gets un-done on removing "isolate" require reboot, but probably not all) [13:26:44] oh! [13:26:56] should we reboot them too? [13:28:15] well, they're pooled in text. rebooting them is one thing we could do. [13:28:22] I'm still peeking at some related things, though [13:28:59] yeah, so.... not all of the "isolate" stuff actually gets un-done in this case anyways [13:29:14] I think sysctl + sysfs undo themselves by next reboot since they manage dirs [13:29:20] but the grub bootparam doesn't work that way [13:29:28] let me fix that up first [13:29:33] ok! [13:31:18] yeah I think the bootparam for isolcpus is the only faulty bit here [13:31:42] I mean technically, we could try to do some runtime ensure=>absent sort of stuff in other areas, but it's pointless as isolcpus can't be changed without a reboot anyways [13:44:03] for some reason the grub::bootparam ensure=>absent isn't working yet, trying to dig through why [13:56:03] ESC[0;36mDebug: Augeas[grub2 isolcpus](provider=augeas): sending command 'rm' with params ["/files/etc/default/grub/GRUB_CMDLINE_LINUX/value[. = \"isolcpus\"]"]ESC[0m [13:56:06] ESC[0;36mDebug: Augeas[grub2 isolcpus](provider=augeas): Skipping because no files were changedESC[0m [13:56:38] yet /etc/default/grub still contains (before and after): [13:56:40] GRUB_CMDLINE_LINUX="console=ttyS1,115200n8 elevator=deadline tcpmhash_entries=65536 isolcpus=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46" [13:58:49] oh, now I understand [14:00:03] so grub::bootparam operates in two different modes which use different augeas selection methods to match. the "k=v" mode for things like isolcpus and the "just_k" mode for single word boolean options like "quiet" or whatever [14:00:21] it picks a mode of operation depending on whether the $value is set [14:00:40] so if you have a setter that uses $value, and a removed that just has ensure=>absent, they don't really match up [14:00:54] the lack of a $value in the ensure=>absent invocation causes the wrong kind of augeas execution [14:01:08] oh so you need the same $value in ensure=>absent too? [14:01:20] well, you need some $value, doesn't have to be a sane $value [14:01:27] just has to pass the $value != undef check [14:01:45] ah! [14:02:14] that seems like some kind of interface issue in grub::bootparam, but there's probably not an easy solution except to make the mode of operation more-explicit. [14:02:30] it can't guess/infer which way to operate during ensure=>absent [14:20:19] so as it turns out, all current callers of grub::bootparam are doing k=v style options. the only single-word booleans we set are in grub::defaults and they don't actually use grub::bootparam to set them (e.g. killing "quiet" and "splash") [14:20:55] so I had this whole elaborate change going that split grub::bootparam into two separate defines for the different use-cases, but seems silly [14:21:18] can just make any (future, nonexistent today) single-word callers be explicit about why the value is lacking with a new defaulted param [14:31:00] 10netops, 10Operations, 10fundraising-tech-ops: reconfigure networking on frack-eqiad management interfaces - https://phabricator.wikimedia.org/T176972#3643004 (10Jgreen) [14:49:00] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3643083 (10ema) p:05Triage>03Normal [14:49:22] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3643085 (10ema) p:05Triage>03Normal [14:49:48] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Reading-Infrastructure-Team-Backlog, and 2 others: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3643087 (10ema) p:05Triage>03Normal [14:50:01] 10Traffic, 10Operations, 10Phabricator, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3643088 (10ema) p:05Triage>03Normal [14:50:26] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations: Like nan.wikipedia.org, redirect other nan.*.org to the proper zh-min-nan.*.org domains - https://phabricator.wikimedia.org/T173966#3643092 (10ema) p:05Triage>03Normal [14:50:54] 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3643093 (10ema) p:05Triage>03Normal [14:51:42] 10Traffic, 10Operations, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3643097 (10ema) p:05Triage>03Normal [14:52:07] 10Traffic, 10Analytics, 10Operations: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3643098 (10ema) p:05Triage>03Normal [14:52:30] 10Traffic, 10Cloud-Services, 10Operations, 10Puppet, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412#3643099 (10ema) p:05Triage>03Normal [14:52:45] 10Traffic, 10Cloud-Services, 10Operations, 10Puppet, 10Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411#3643100 (10ema) p:05Triage>03Normal [14:53:18] 10Traffic, 10DNS, 10Operations, 10User-fgiunchedi: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460#3643101 (10ema) p:05Triage>03Normal [14:59:23] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-eqiad: connect second interface for each frack to opposite switch for each eqiad host - https://phabricator.wikimedia.org/T176975#3643120 (10Jgreen) [15:42:27] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3643328 (10RobH) This did fail hardware testing. Service Tag : 3NC7KH2 Error Code : 2000-0251 Validation : 127076 I'll open a dispatch for whatever this error code is. [15:54:02] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3643431 (10BBlack) Nice. Dell.com has a page on these here: http://www.dell.com/support/manuals/us/en/19/poweredge-vrtx/servers_tsg/psaepsa-diagnostics-error-codes?guid=guid-9afeed67-a47c-4afd-83d8-04301eb... [15:58:46] 10netops, 10Operations, 10fundraising-tech-ops: reconfigure networking on frack-eqiad management interfaces - https://phabricator.wikimedia.org/T176972#3643480 (10Jgreen) [16:00:12] 10netops, 10Operations, 10fundraising-tech-ops: reconfigure networking on frack-eqiad management interfaces - https://phabricator.wikimedia.org/T176972#3643004 (10Jgreen) I did frdb1003 first: /admin1-> racadm setniccfg -s 10.64.40.199 255.255.255.192 10.64.40.193 Static IP configuration enabled and modifie... [16:10:23] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3643517 (10BBlack) Oh sorry, my comment was redundant to your edit :) [16:17:36] 10netops, 10Operations, 10fundraising-tech-ops: reconfigure networking on frack-eqiad management interfaces - https://phabricator.wikimedia.org/T176972#3643556 (10ayounsi) [16:25:56] ema: so, isolcpus issue now fixed via grub::bootparam changes. I'm going to step through some depooled reboots of 4022-8. [16:26:48] ok [16:31:52] * ema stares at pybal logs [17:02:11] * ema stopped holding his breath :) [17:02:46] bblack: it looks like max_connections is not a mandatory setting [17:02:56] the docs aren't clear on what happens if you don't specify it though [17:03:08] does it mean no limit at all or is there an obscure default? [17:05:34] who knows [17:05:40] I've tried to see if I could get via varnishadm the backend settings currently configured, but nope [17:10:09] ok I'm out of here! see you tomorrow :) [17:11:09] cya! [18:16:06] https://mnot.github.io/I-D/variants/ [18:22:19] interesting! [18:22:51] (but still, not quite ideal for the session cookie case in particular) [18:24:26] but really, as hard as the "session cookies but not other cookies" problem is to fully solve across our stack, we're probably better off putting the effort into splitting up the un-customized content into separate fetches that are cacheable [18:25:14] the snippet with the user-specific into lives at /w/foo.php?getmystuff and returns no-cache headers, and the article text and such lives at a different URL that's cacheable for all regardless of login. [18:25:23] s/into/info/ [18:25:53] (or in that sort of case, you could even have short cacheability of user-specific stuff with a full Vary:Cookie and not care too much) [18:25:59] hahahaha [18:26:14] parsed article content depends on user preferences [18:26:25] yeah it shouldn't :( [18:26:29] welcome to MediaWiki :( [18:27:42] but Variants is still handy [18:27:50] I could see using it with a custom header for the mobile use-case too [18:28:38] (but we can also just have our own custom stuff for that, regardless of RFCs) [18:28:53] I really want to have time to push for moving past the m-dot mess [18:29:22] 10netops, 10Operations, 10fundraising-tech-ops: reconfigure networking on frack-eqiad management interfaces - https://phabricator.wikimedia.org/T176972#3644090 (10Jgreen) Here's what I ended up doing for the HP boxes: 1) install hponcfg from HP deb repo http://downloads.linux.hpe.com/SDR/repo/mcp/ 2) hponcf... [18:29:30] we could still do the UA detection for mobile in varnish even, and the honoring of a cookie that forces desktop-vs-mobile override. [18:29:38] we're much better prepared to something like this than a couple years ago [18:29:44] but have it all on en.wiki and have varnish+MW honor some shared Vary header to tell the difference [18:30:15] (and then of course support m.dot as a legacy entrypoint for a long time, first directly and then later as a redirect) [18:30:17] unfortunately, as long as the sentiment is "let's dumbify desktop to unify it with mobile" it's not going anywhere [18:31:35] well, you could argue the case that "desktop-vs-mobile" is the wrong split anyways, and call it more like "Simple-vs-Advanced" Simple is simplistic and works well for newbies and anon readers even on desktop. Advanced-mode brings in all the bells and whistles, supports editing and gadgets and such better, might not be as mobile-friendly (but is usable there) [18:32:11] and just default to simple for everyone who doesn't click a button to set a preference cookie for Advanced [18:32:24] (or has an account with such a setting. maybe logged-in defaults to advanced even) [18:32:36] unfortunately, "simple" cuts of so much stuff that users are highly unlikely to go past newbies [18:33:34] maybe the whole model of trying to upconvert newbie readers into editors is busted? (I have no idea, I'm just throwing out wild random thoughts) [18:33:56] but perhaps most readers will never be the type to be a good editor. at most a minor-correction-suggester for an editor to look at. [18:34:38] maybe you capture the 1% of readers who are potential good editors some other way that targets them (how?), moving them over to the better UI in the process. [18:35:27] if we captured 1% of readers I would be on the moon because that's so much better than things right now [18:36:11] :) [18:37:11] honestly, Timeless is so much better at being responsive without cutting functionality [18:37:40] https://test.wikipedia.org/wiki/Main_Page?useskin=timeless [18:39:19] in any case, in the really-actionable world of today: we could have the same basic mobile-v-desktop split we have now, without using the m-dot subdomains (just cache variants on the same URL, negotiated privately between MW+Varnish, with Varnish doing UA detecting and cookie-preference honoring) [18:40:12] Timeless does look really nice :) [18:46:44] 10netops, 10Operations, 10fundraising-tech-ops: reconfigure networking on frack-eqiad management interfaces - https://phabricator.wikimedia.org/T176972#3644180 (10Jgreen) 05Open>03Resolved This is done, passwords are updated too. [19:19:30] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-eqiad: connect second interface for each frack to opposite switch for each eqiad host - https://phabricator.wikimedia.org/T176975#3644371 (10Jgreen) a:05Jgreen>03None [20:19:53] bblack: a REST API user reports that he recently started to get 429s at relatively moderate request rates [20:20:17] looking at our metrics, it seems that RB itself has not sent 429s since the 23rd [20:20:29] were there any changes in how Varnish rate limits the REST API? [20:22:09] example HTML response looks like it was likely emitted by Varnish: https://pastebin.com/ZwSj2DuC [20:22:49] RB would send a JSON error [20:24:31] gwicke: are you sure they aren't actually running afoul of the rates? [20:24:38] that IP belongs to Apple [20:24:38] I don't see anything obviously related in SAL, but the sudden disappearance of RB 429s on the 23rd is suspicious [20:25:45] he said that he stayed below 200/s [20:25:51] 100-140/s [20:27:23] the relevant ratelimit code hasn't changed since ~June in git blame [20:27:24] RB always used to send the occasional 429 (avg 3/s over the last 90 days), but as I said that completely stopped on the 23rd [20:27:48] also, 100-140/s would be over the long-term limit, which is 100 [20:27:52] the code comment says: [20:27:54] * RB and MW API, Wikidata: 1000/10s (100/s long term, with 1000 burst) [20:28:27] hmm, I thought we had set this to 200/s [20:28:36] to match the long-standing documentation [20:28:36] but keep in mind that's a limit on miss+pass traffic, it doesn't count hits [20:28:55] so a shift in how cacheable the user's requests are could make a big difference, too [20:28:56] yeah [20:29:37] we don't have many clients sailing that close to the limit for extended periods [20:34:39] for now, they will throttle further to less than 100/s [20:35:28] do you see any issue with bumping the Varnish limit a bit to perhaps 150/s? [20:36:10] that would make it really unlikely that anybody doing less than 200/s would hit the Varnish limit [20:52:03] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3644703 (10ayounsi) > To be escalated to JTAC JTAC noticed that the control link went down as the same time as the data/fabric link because of missed heartbeats, which shouldn't happen a... [21:24:22] 10netops, 10Operations: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3644811 (10ayounsi) It seems like Junos' `local-as` feature isn't working as expected. Global AS of 43821, remote side with `peer-as 43821`, and the local side with: `local-as 14907` -> BGP session doesn't establish... [21:28:05] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3644816 (10Jgreen) >>! In T174218#3644703, @ayounsi wrote: >> To be escalated to JTAC > JTAC noticed that the control link went down as the same time as the data/fabric link because of m... [21:35:36] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3644832 (10RobH) a:03BBlack Followup testing passed without further errors: All tests passed. @bblack: Perhaps it was a transient error solved by the bios firmware updates? It went from 2.4.2 to 2.5.4 i... [21:41:49] https://blog.filippo.io/we-need-to-talk-about-session-tickets/ [21:53:12] 10netops, 10Operations: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3644900 (10ayounsi)