[08:46:57] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: Create wikibase/wikiba.se-deploy repo - https://phabricator.wikimedia.org/T176841#3638715 (10Ladsgroup) [09:53:50] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3638883 (10daniel) p:05High>03Unbreak! I see the fix got me... [09:55:06] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3638887 (10Ladsgroup) It takes around half an hour (to an hour)... [09:56:03] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3638889 (10Lucas_Werkmeister_WMDE) It seems to be live on some... [09:57:42] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3638891 (10Ladsgroup) Also I think some are behind varnish, e.g... [10:35:44] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10elukey) All the appservers are now returning the goo... [10:47:14] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3638995 (10elukey) p:05Unbreak!>03Normal [10:48:04] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10elukey) I believe this is only a matter of cleaning... [13:19:52] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3639504 (10Lucas_Werkmeister_WMDE) Is there any way to find out... [14:10:11] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3639766 (10daniel) @elukey, @Lucas_Werkmeister_WMDE, @Ladsgroup... [14:22:29] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3639786 (10elukey) >>! In T163922#3639504, @Lucas_Werkmeister_W... [14:30:16] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3639831 (10Lucas_Werkmeister_WMDE) If the TTL isn’t too long (I... [14:52:52] bblack: ok to upgrade lvs1007 to pybal 1.14? [14:56:02] and then in case of no explosions we could carry on with lvs400[24] (upload@ulsfo is still depooled) [14:56:19] ema: I guess so, yeah. assuming it's even booted and functional at present [14:56:28] it seems to be [14:56:35] yeah it's currently up [14:56:41] we should probably repool upload@ulsfo too heh [14:57:27] but we should probably restart them first at least [14:57:33] agreed [14:57:37] upgrading lv1007 meanwhile [14:57:54] I'm ok with going with the theory that this was somehow induced by some local power/physical issue at ulsfo, till it happens again [14:58:32] (although I think it's still possible that heavy traffic + NUMA tuning is a catalyst for that particular ethernet oops, but we'll just have to see in the absence of other mitigating factors) [15:01:02] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3639940 (10akosiaris) >>! In T163922#3639831, @Lucas_Werkmeiste... [15:10:50] sad_trombone.wav, as go.dog would say: https://gerrit.wikimedia.org/r/#/c/381003/ [15:11:28] lol [15:13:24] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: Create wikibase/wikiba.se-deploy repo - https://phabricator.wikimedia.org/T176841#3639955 (10Dzahn) New Gerrit repos (projects) might have to be requested on wiki instead, afaict. https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests [15:27:00] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3639998 (10BBlack) @RobH any updates here on diags? [15:27:45] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3640003 (10RobH) I stupidly forgot this machine with the rest of the repairs I was doing! I'll go back onsite to work on this soon! [15:36:50] bblack: ok, so pybal 1.14.0 seems to be doing its thing. Tested extensively on pybal-test200[123] including the BGP part. Should we update lvs400[24] before repooling? [15:40:20] oh, right, I wasn't really reading properly earlier, I assumed you were saying lvs400[34] [15:41:06] I guess lvs400[34] doesn't really buy us a lot of testing over lvs1007 or pybal-test? [15:42:11] not very much, no [15:42:47] well, I guess lvs400[34] for a period would give us a chance to observe some dumb failure that icinga picks up on a live backup [15:42:53] but still [15:43:34] ema: ok, +1 [15:48:00] bblack: nice, I'll start with lvs4004 [15:51:21] looking good, proceeding with lvs4002 [15:53:21] bblack: looks good, all upload@ulsfo hosts are individually depooled except for cp4021. OK to pool one of the others? [15:54:18] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3640078 (10Lucas_Werkmeister_WMDE) Okay, in that case we can cl... [15:55:52] ema: should pool them all (2356), before doing DNS [15:56:09] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3640079 (10Ladsgroup) Nope :D We need this for all clients not... [15:58:57] cp4022 repooled, all good on the LVSs [15:59:23] proceeding with 3, 5, and 6 [16:01:48] ok, only pybal logs are about cp4024 which is failing the checks (as we know) [16:02:57] how can I check the router BGP status? [16:03:21] I'm a quagga master by now, but still [16:05:07] bblack@cr1-ulsfo> show bgp neighbor 10.128.0.12 [16:05:09] yeah `show route` is a bit too verbose :) [16:05:13] ^ where that IP is lvs4002s [16:05:52] also: [16:05:53] bblack@cr1-ulsfo> show route 198.35.26.112 [16:05:53] quagga emulates cisco [16:06:00] bblack@cr1-ulsfo> show route 2620:0:863:ed1a::2:b [16:06:05] show bgp summary [16:06:10] will show you the bgp session status [16:06:18] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3640108 (10Lucas_Werkmeister_WMDE) [16:06:59] show bgp summary | match 64600 [16:07:21] ah nice [16:08:05] is "validation-state: unverified" normal in show route's output? [16:08:30] and same on cr2-ulsfo of course (lvs400[12] go to cr1, 34 to cr2) [16:08:50] we have a feature ticket somewhere to make pybal talk to multiple peers, so both routers talk to all pybals [16:11:40] XioNoX: watch out man! Your job is at risk, I'm proficiently running show commands [16:13:04] lol [16:13:33] ema: don't tell people that's 90% of the job [16:13:59] ema: "validation-state: unverified" yeah, that's normal, you can ignore [16:15:31] XioNoX: cool. This morning I've spent some time wrestling with quagga.conf. The final gotcha is that if you want to advertise v6 routes over v4 peering you need to specify the v4 IP of the peer under address-family: ipv6 [16:15:50] good catch! [16:15:54] see https://wikitech.wikimedia.org/wiki/PyBal#Testing [16:20:35] uh [16:20:48] cr2-ulsfo seems to have MED info about 198.35.26.112/32? [16:21:03] [BGP/170] 00:29:07, MED 10, localpref 100 [16:21:03] AS path: 64600 I, validation-state: unverified [16:21:03] > to 10.128.0.14 via ae1.1211 [16:21:20] same for 2620:0:863:ed1a::2:b [16:21:39] that's because they're the backup LVSes, and the MED is being set from the router config instead of sent by pybal, I think [16:21:47] ah! [16:22:14] so before flipping bgp-med = 42 in pybal.conf we need to update the backup router's config? [16:22:22] `set policy-options policy-statement LVS_import term secondary then metric add 10` [16:23:26] ema: this pybal version has med, but not per-route med, right? [16:23:45] only global I think [16:23:48] right [16:23:59] I don't think there's any requirement to mess with the router config immediately [16:24:13] but we only want to add our own pybal-global med to the config on the backups [16:24:38] oh then I have to amend https://gerrit.wikimedia.org/r/#/c/380516/ [16:25:11] oh [16:25:14] you can just set it to 0 [16:25:19] well, you can add different numbers to both, that's fine too I guess [16:25:28] yeah [16:25:49] hieradata/role/common/lvs/balancer.yaml: profile::pybal::primary: true ? [16:25:54] (seems to set all as primary?) [16:26:08] oh the hieradata regex, I see it now [16:26:10] yeah and then regex.yaml for the others [16:26:28] yeah later that will get refactored in funnier ways anyways, whatever works for the present day [16:27:34] (eventually, we'll lose the concept of whole pybal instances being pri/sec. instead certain routes (e.g. upload-lb) will be primary in lvsX and secondary in lvsY via per-route MED, and all of the live LVSes will be the MED-derived primary for some address) [16:28:13] (or could be anyways, we could and might choose to still have one extra LVS which is not primary for anything, to make changes/upgrades simpler) [16:28:27] meanwhile: tested BGP failover by stopping pybal on lvs4002, works as expected [16:29:09] so to recap the MED config patch: https://gerrit.wikimedia.org/r/#/c/380516/3/modules/profile/manifests/pybal.pp <- here I should set med to 0 for the primary, right? [16:29:38] ema: yes, I think so. [16:30:07] maybe something smaller than 100 for secondary too? I don't know (but maybe others do) how a larger value might interact with some other values like the "localpref 100" over on the routers [16:30:19] it doesn't [16:30:23] oh ok [16:30:35] it's purely to find a winner when all else is equal [16:30:45] med = Multi Exit Descriminator [16:31:10] to indicate a preference when you have multiple otherwise same routes [16:31:35] whereas localpref happens before you get to that stage? [16:31:53] * ema likes to live dangerously and edited the patch through gerrit's UI [16:32:17] bblack: yes [16:32:18] bblack: https://www.juniper.net/documentation/en_US/junos12.3/topics/reference/general/routing-ptotocols-address-representation.html [16:32:46] old junos since I had it bookmarked, but probably hasn't changed much :) [16:33:10] bblack: yes [16:33:17] so simple! :) [16:33:46] some transits use the IGP cost for MED [16:33:46] IIRC, NTT does that [16:33:51] so at the very end, all else equal, it will just pick the one with the router with the lowest ip ;) [16:34:05] so in theory, we peer with NTT in Ashburn, Chicago, Dallas [16:34:22] but yeah, this will only compare to other pybal routes for the same lvs ips [16:34:42] ok cr[12]-ulsfo and lvs400[24] look fine to me with the new pybal! Time to go now o/ [16:34:44] and NTT would send us their cost to reach e.g. a customer in Houston from each of those sites [16:34:55] ema: grafana dashboard up yet? ;p [16:35:03] * mark impatient [17:13:34] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3640374 (10daniel) Actually, as per the RFC, this is for all Wikimedia wikis. It's independent of Wikibase... [19:47:00] 10Traffic, 10Operations: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3640874 (10Dzahn) [19:48:08] 10Traffic, 10Operations: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3640894 (10Dzahn) p:05Triage>03Low [23:50:57] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3641515 (10RobH) Sometimes entering the serial console registers a keystroke, so I have this running via a screen session on iron. That way reconnecting wont input a keystroke and cancel testing. I'll che...