[02:24:32] 10netops, 10Operations, 10ops-eqsin: return faulty MX104 to Juniper - https://phabricator.wikimedia.org/T189060#4034060 (10Papaul) [02:38:46] 10netops, 10Operations, 10ops-eqsin: return faulty MX104 to Juniper - https://phabricator.wikimedia.org/T189060#4034064 (10Papaul) 05Open>03Resolved {F14673832} [03:28:05] 10Traffic, 10Incident-20150423-Commons, 10Operations, 10RESTBase, and 4 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#4034097 (10Krinkle) [05:16:06] 10Traffic, 10DNS, 10Mail, 10Operations: SPF for Greenhouse - https://phabricator.wikimedia.org/T189065#4034192 (10tstarling) How about I change the task title so that it can stay open? Because the real problem here is that outbound email is broken, I don't care whether SPF or a subdomain is used to fix it.... [05:16:29] 10Traffic, 10DNS, 10Mail, 10Operations: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4034194 (10tstarling) [08:59:15] https://phabricator.wikimedia.org/P6821 per-service-MED crash on pybal 1.15.1 @ pybal-test2001 [09:40:18] not a particularly lucky release :) [09:43:20] *sigh* [09:43:39] newbie release :( [09:52:41] https://github.com/wikimedia/PyBal/blob/master/pybal/ipvs.py#L198 [09:52:50] that is the offending line [09:53:01] the obvious fix is replace .get per .getint() [09:53:27] the issue is that .getint doesn't allow None as a default value [09:53:51] https://github.com/wikimedia/PyBal/blob/master/pybal/util.py#L26-L33 [09:54:46] this could be addressed in several ways.. [09:55:09] catch the KeyError and set med to None [09:56:09] cast med to an int in bgpfailover.associateService (https://github.com/wikimedia/PyBal/blob/master/pybal/bgpfailover.py#L141)... [09:58:14] or even get in the LVSService __init__() the global config and set the default value as the global BGP MED value [10:04:02] catching the KeyError seems reasonable and explicit to me [10:04:55] yep [10:05:39] the one thing I don't like is that our BGP attributes classes are lazy regarding value validation [10:05:58] this is pretty similar to the crash that we had on lvs5003 on Tuesday [10:06:29] BGP Update generation crashes because self.value on some RandomBGPAttribute class is not what it should be [10:39:54] as discussed: https://gerrit.wikimedia.org/r/#/c/417228/ [10:47:20] and now per-service-MED works [10:47:20] B>* 208.80.154.224/32 [20/10] via 10.192.16.139, eth0, 00:01:30 [10:47:20] B>* 208.80.154.254/32 [20/50] via 10.192.16.139, eth0, 00:01:30 [10:47:23] \o/ [11:07:20] quick question, is 0 a valid value for a MED? [11:08:22] yes [11:09:50] (if you're curious about MED, this juniper doc pretty good: https://www.juniper.net/documentation/en_US/junos/topics/topic-map/bgp-med.html ) [11:10:29] * XioNoX goes back to vacations [11:11:34] lol [11:12:05] thx, I was curious if maybe 0 (or the famous -1) could have been used instead of None for the 'no med' state [11:17:38] that would require further refactoring [11:21:31] sigh.. I was trying to validate multi BGP peering using pybal-test2002 with IPv6 and IPv6 but I hit another issue [11:21:34] File "/usr/lib/python2.7/dist-packages/pybal/bgp/bgp.py", line 1478, in connectionMade [11:21:37] self.factory.bgpId = IPv4IP(self.transport.getHost().host).ipToInt() # FIXME: IPv6 [11:21:40] :_( [11:22:31] pybal_bgp_session_established{local_asn="64496",peer="2620:0:860:102:10:192:16:140"} 0.0 [11:22:34] pybal_bgp_session_established{local_asn="64496",peer="10.192.16.140"} 1.0 [11:22:39] looks good, but not enough :) [11:32:08] vgutierrez: is the KeyError case also covered by unit tests? [11:32:59] ema: sure [11:33:33] https://gerrit.wikimedia.org/r/c/417228/2/pybal/test/test_ipvs.py#139 [11:33:34] that one :D [11:33:38] very nice [11:34:49] I'll validate bgp multi peering before releasing 1.15.2 [11:35:07] I'd rather not see 1.15.3 this week O:) [11:36:56] yeah :) [11:49:18] pybal_bgp_session_established{local_asn="64496",peer="10.192.16.141"} 1.0 [11:49:21] pybal_bgp_session_established{local_asn="64496",peer="10.192.16.140"} 1.0 [11:49:28] gobgpd is increadibily easy to setup [11:49:45] vgutierrez@pybal-test2003:~$ bin/gobgp neighbor 10.192.16.139 adj-in [11:49:46] ID Network Next Hop AS_PATH Age Attrs [11:49:48] 0 208.80.154.224/32 10.192.16.139 64496 00:00:37 [{Origin: i} {Med: 10}] [11:49:51] 0 208.80.154.254/32 10.192.16.139 64496 00:00:37 [{Origin: i} {Med: 50}] [11:50:13] and pybal-test2002 [11:50:14] B>* 208.80.154.224/32 [20/10] via 10.192.16.139, eth0, 00:01:04 [11:50:14] B>* 208.80.154.254/32 [20/50] via 10.192.16.139, eth0, 00:01:04 [11:50:25] looks like BGP multi peering works as expected <3 [12:36:31] yes, but ipv6 peerings are not supported [12:38:32] fixing that is not really urgent, as all our infrastructure has ipv4 too and (unlike routers) pybal does multi-protocol on a single session [12:45:38] yesterday I had two hours for tech work btw and I started splitting out attributes/exceptions from bgp.py, you may have already seen [12:48:52] that's nice [13:05:06] i feel like a broken record [13:05:13] but when i wrote that code, twisted didn't support ipv6 [13:05:18] so I didn't even bother and left a fixme! [13:08:11] i think i will add unit tests for the code I split off in bgp.py (so attributes, exceptions) [13:08:26] because moving that code has high risk of breakage, some constant/variable not moved along/defined, etc [13:08:34] and we should add coverage anyway [13:09:57] oh I wasn't blaming you, I just tried that as a short path to validate BGP multi peering on pybal [13:11:18] yeah [13:11:45] so I spawned in 2 secs one gobgp instance: https://github.com/osrg/gobgp [13:12:36] heresy! [13:13:04] oh.. I was about to send a CR to replace pybal bgp implementation with gobgpd [13:13:07] :P [13:13:17] careful now [13:14:27] O:) [14:28:07] BTW, is somebody has op privileges here, feel free to update the channel logs link: https://bit.ly/2oSKhFv [14:30:38] s/is/if/g [14:41:31] 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4035164 (10BBlack) Ping monitoring for this anchor merged in with: https://gerrit.wikimedia.org/r/#/c/417267/1/modules/netops/manifests/monitoring.pp What we're missing in configuratio... [14:43:13] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#4035174 (10BBlack) Reminder: after hardware level is fixed and the host is installed, we'll need to uncomment its entry in `hieradata/common/cache/upload.yaml` before it will successfully puppetize and join... [14:45:15] 10Traffic, 10Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#4035182 (10BBlack) a:03ayounsi We're still missing rancid definitions in puppet's `modules/rancid/files/core/router.db`, ping @ayounsi [14:49:08] 10Traffic, 10Operations: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4035206 (10BBlack) [14:49:11] 10Traffic, 10Operations, 10Patch-For-Review: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#4035199 (10BBlack) 05Open>03Resolved a:03BBlack With the last merges above, all the known issues that actually belong here are resolved other than 3 cases from the previous lis... [15:14:40] sigh [15:14:40] [bgp.FSM@0x7fca29986950 peer 10.192.16.141:179] INFO: State is now: OPENSENT [15:14:43] [bgp.BGP@0x7fca29a33a70 peer 10.192.16.141:179] INFO: Connection lost: Connection was closed cleanly. [15:14:47] this doesn't make any sense [15:15:55] 9 1.377766 10.192.16.139 10.192.16.141 BGP 103 OPEN Message [15:16:06] 10 1.378084 10.192.16.139 10.192.16.141 TCP 66 59596 → 179 [FIN, ACK] Seq=38 Ack=2 Win=29696 Len=0 TSval=43797547 TSecr=43728293 [15:16:30] pybal sends the OPEN message and after that closes the connection [15:19:29] do we have any crazy TCP tuning in pybal instances? [15:19:36] (including pybal-test*) [15:21:02] vgutierrez: is this with the latest 1.15.x version? 1.14.x on pybal-test was performing the whole BGP dance properly with quagga, wasn't it? [15:21:36] sure [15:21:38] same for 1.15.2 [15:22:06] this behavior is pretty similar to the one exposed in T188085 [15:22:07] T188085: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085 [15:22:24] I can trigger it with gobgpd configured in passive mode [15:22:34] meaning only pybal is going to initiate the TCP connection [15:22:45] so if I do a fast pybal restart, this would happen [15:24:08] does a "slow" pybal restart work? (stop; sleep; start) [15:24:17] works as expected yes [15:24:21] interesting [15:24:23] yup [15:25:04] I'm focused on the "fast" pybal restart because it's the same that happened in T188085 [15:25:15] a package update triggers a restart [15:25:29] right, do "fast" restarts work reliably on 1.14.x or can you easily reproduce T188085? [15:25:43] same happens with 1.14.4 [15:25:47] ha! [15:25:59] the thing is that in pybal-test2001 with a fast restart [15:26:12] quagga connects to pybal and that BGP session gets established [15:26:25] but with gobgpd configured in passive mode what I see is this [15:26:39] and honestly doesn't make any sense [15:26:54] are our routers behaving like quagga or like gobgpd? [15:27:19] according to your log in https://phabricator.wikimedia.org/T188085 more like gobgpd [15:27:39] as no incoming connection from re0.cr2-equid was seen [15:28:52] ok [15:29:38] great findings :) [15:29:53] so pcap wise I cannot know who triggers the [FIN,ACK] TCP packet closing the BGP connection [15:30:03] that's why I was asking about our TCP stack tuning [15:30:44] but an easy test is reproducing this scenario on my laptop [15:30:51] and see what happens [15:48:50] same behaviour [15:48:59] even on different OS [15:49:36] what's the difference between a "slow" and a "fast" restart, tcpdump-wise? [15:52:58] hmm [15:53:06] one pretty obvious that I didn't see til now [15:53:12] thx for the question [15:53:12] 3 0.000516 10.192.16.139 → 10.192.16.141 TCP 66 42822 → 179 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=44681475 TSecr=44612222 [15:53:15] 4 0.001441 10.192.16.141 → 10.192.16.139 BGP 111 OPEN Message [15:53:55] on the slow restart, pybal open the connections but is gobgpd the one who sends the OPEN message [15:54:45] on the fast restart, (see packet #9 that I pasted at 16:15 CET) is pybal who sends the OPEN message [15:54:57] so we have two state-machine races at different layers? who opens the winning TCP, and who sends OPEN first over the winning TCP? :) [15:55:18] gotta love design-by-committee protocols :P [15:55:49] I need to read carefully the BGPv4 FSM definition [15:59:20] copies of that should come with a free shipment of migraine medicine [15:59:47] indeed [16:13:38] haha [16:56:45] hmmm [16:56:57] I see how gobgpd solves this issue [16:57:16] basically pybal is way more aggresive regarding establishing the BGP session :) [17:09:36] oh... [17:09:46] https://tools.ietf.org/html/rfc4724 --> Graceful Restart Mechanism for BGP [17:10:09] * vgutierrez flips table [17:12:44] actually is not related.. *sigh* [17:42:00] no that's not related [17:42:05] yup [17:42:05] however something I would really like to add to pybal [17:42:10] got it though [17:42:37] in a "fast restart" pybal doesn't wait for the other peer to go from IDLE to ACTIVE [17:42:53] so the other peer closes the connection [17:43:49] maybe if we slow down a little bit the pybal IDLE --> CONNECT we'd solve these issues [17:58:54] https://github.com/wikimedia/PyBal/blob/master/pybal/bgpfailover.py#L81 --> enabling here the IdleHold looks like it would mitigate/prevent T188085 [17:58:54] T188085: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085 [17:59:59] basically because it would allow cr2-equiad FSM go from ESTABLISHED --> IDLE --> ACTIVE [18:00:36] mark: thoughts? [18:25:26] actually I'd go for playing safe by default, https://github.com/wikimedia/PyBal/blob/master/pybal/bgp/bgp.py#L945 setting there idleHold=True and lets the caller decide if he wants the aggresive bevaviour [18:26:12] *behaviour [20:01:22] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4029896 (10herron) Here's a patch to get the ball rolling on a subdomain for this. It's WIP since we will need an admin on the greenhouse account to supply the... [21:12:27] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#4036493 (10BBlack) [21:12:40] 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#4036496 (10BBlack) [21:12:45] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793949 (10BBlack) 05Open>03Resolved a:03BBlack [21:13:13] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#4036499 (10BBlack) [21:13:16] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793949 (10BBlack) [21:13:22] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3793986 (10BBlack) 05Open>03Resolved a:03BBlack [21:13:55] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#4036504 (10BBlack) [21:14:07] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3815107 (10BBlack) 05Open>03Resolved a:03BBlack [21:16:03] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#4036509 (10BBlack) [21:16:40] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3794022 (10BBlack) 05Open>03Resolved a:03BBlack (other than DOA cp5006, tracked separately for repair in T187157 [21:17:15] 10Traffic, 10Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#4036517 (10BBlack) [21:17:18] 10Traffic, 10Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#4036515 (10BBlack) 05Open>03Resolved a:03BBlack [21:26:15] 10Traffic, 10Operations: Server hardware installation for Asia Cache DC - https://phabricator.wikimedia.org/T156032#4036575 (10BBlack) [21:26:18] 10Traffic, 10Operations: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#4036573 (10BBlack) 05Open>03Resolved a:03BBlack [21:26:35] 10Traffic, 10Operations: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4036582 (10BBlack) [21:26:38] 10Traffic, 10Operations: Server hardware installation for Asia Cache DC - https://phabricator.wikimedia.org/T156032#2962057 (10BBlack) 05Open>03Resolved a:03BBlack [21:27:27] 10Traffic, 10Operations, 10ops-eqsin, 10Patch-For-Review: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#4036593 (10BBlack) [21:27:30] 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#4036592 (10BBlack) [21:28:37] 10Traffic, 10Operations: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4036599 (10BBlack) [21:28:39] 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10BBlack) [21:31:24] 10Traffic, 10Operations: WP Zero workarounds for eqsin - https://phabricator.wikimedia.org/T189250#4036605 (10BBlack) p:05Triage>03Normal [21:41:20] 10Traffic, 10Operations: Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4036653 (10BBlack) p:05Triage>03Normal [22:20:04] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4036787 (10ayounsi) [22:20:21] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#2962044 (10ayounsi) [22:21:51] bblack: XioNoX: kudos for Asia :] [22:37:37] 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4036874 (10ayounsi) The IPv6 issue seems to be on the Atlas, most likely not configured yet. From the router interface I can't ping its global IP: `ping 2001:df2:e500:201:103:102:166:2... [23:01:07] FYI, one of the esams-eqiad links is down, a car hit an aerial power and fiber line, techs are on site [23:12:38] was it an aerial car? :)