[00:01:30] 10netops, 10Operations: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10ayounsi) a:05ayounsi→03faidon Over to Faidon for feedback. [06:40:58] 10netops, 10Operations: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10crusnov) [07:06:48] 10netops, 10Operations: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10Peachey88) [09:48:55] 10netops, 10Operations: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10elukey) @crusnov hi! I think this is the same issue as T212697 [11:11:38] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) firmware upgrade completed for all the affected systems. [13:00:08] 10Certcentral, 10Patch-For-Review: certcentral is incompatible with the current python3-acme version shipped in stretch-backports - https://phabricator.wikimedia.org/T213820 (10Vgutierrez) >>! In T213820#4886317, @Krenair wrote: > @Vgutierrez: this is done now right? yes, I wanted to merge https://gerrit.wiki... [13:11:53] 10Certcentral, 10Patch-For-Review: certcentral is incompatible with the current python3-acme version shipped in stretch-backports - https://phabricator.wikimedia.org/T213820 (10Krenair) 05Open→03Resolved [13:43:32] 10Traffic, 10Operations, 10Pybal: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 (10Vgutierrez) [14:08:24] 10Traffic, 10Operations, 10Pybal: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 (10Vgutierrez) After removing a service in pybal, a restart is not enough to get rid of the service at IPVS level, it should be removed manually with `ipvsadm -D -t ip:por... [14:26:39] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Nuria) 05Open→03Resolved [14:30:57] 10Traffic, 10Operations, 10Pybal: inconsistencies between pybal configuration and IPVS status - https://phabricator.wikimedia.org/T214041 (10Vgutierrez) p:05Triage→03Normal [16:45:18] 10netops, 10Operations: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10crusnov) >>! In T214008#4887389, @elukey wrote: > @crusnov hi! I think this is the same issue as T212697 Ah this is undoubtedly correct. [16:46:26] 10netops, 10Operations: Netbox Dies Mysteriously Sometimes - https://phabricator.wikimedia.org/T214008 (10elukey) [16:56:12] 10netops, 10Operations, 10Patch-For-Review: IGMP snooping breaks IPv6 ND on Junos 14.1X53-D46 - https://phabricator.wikimedia.org/T201039 (10ayounsi) p:05Normal→03Low Discussed it with Brandon, it's still something we want to fix but is now low priority. We will probably have to wait for the next DC fail... [17:29:05] do we have any stats anywhere (grafana, or extractable from logstash somehow?) on cache hits/misses for restbase? or on fraction of responses from restbase that were setting no-cache? [17:29:50] we don't have cache stats that break down per service, only total for the cluster (text vs upload) [17:30:35] ah [17:31:28] https://grafana.wikimedia.org/d/000000500/varnish-caching?refresh=15m&orgId=1 [17:31:53] and someone made a -1week comparator for the "true hitrate" [17:31:58] https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?refresh=15m&orgId=1 [17:34:02] I'd guess part of the dive today was the rebooting of a bunch of eqiad caches for the bnxt_en debugging [18:18:19] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) a:05ayounsi→03faidon Discussed it with Brandon and we think that option 3 is the best path forward. Over to @faidon for thoughts/review. [18:48:49] 10Traffic, 10Operations, 10Pybal: prometheus metrics apparently are missing some ipvs entries - https://phabricator.wikimedia.org/T214072 (10Vgutierrez) [18:54:50] 10Traffic, 10Operations, 10Pybal, 10monitoring: prometheus metrics apparently are missing some ipvs entries - https://phabricator.wikimedia.org/T214072 (10CDanis) p:05Triage→03Normal [19:13:51] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) a:03ayounsi In addition to the above DNS change, the following needs to change on the routers: `lang=diff,name=cr1/2-esams - shrink /28 to /29 [edit routing-options aggre... [19:17:59] 10netops, 10Operations, 10ops-esams: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) [19:18:02] 10Traffic, 10netops, 10Operations: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) [19:18:05] 10Traffic, 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [19:24:42] 10netops, 10Operations: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) [19:25:45] 10netops, 10Operations: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) p:05Normal→03Low a:05ayounsi→03faidon Over to @faidon for review/feedback. [20:58:12] https://phabricator.wikimedia.org/P8003 [20:58:27] it's correct to interpret the above as meaning that parsoid is active/active and a backend in either DC can receive traffic, yes? [21:12:13] yea, i would say so. 2xxx hosts being pooled should mean that [22:02:40] 10netops, 10Operations: Investigate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ayounsi) 05Open→03Resolved My guess based on those clues, is that this link flap caused at least some traffic from eqiad to codfw to be blackholed. Most likely the time protocol... [22:03:40] 10netops, 10Operations: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) p:05Normal→03Low Low priority, over to @faidon for feedbacks. [22:03:51] 10netops, 10Operations: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) a:05ayounsi→03faidon [22:08:37] bblack: is there a way to know to which DC a specific IP would be GeoDNS redirected? [22:09:02] yes [22:09:07] :) [22:09:56] https://www.maxmind.com/fr/geoip2-precision-demo ? And compare the country code to the DNS config? [22:10:04] (only thought about that now) [22:10:21] no [22:10:30] bblack@authdns1001:~$ gdnsd_geoip_test [22:10:44] [starts an interactive shell, then input mapname followed by IP like:] [22:10:50] > generic-map 1.2.3.4 [22:10:50] generic-map => 1.2.3.4/24 => eqiad, codfw, ulsfo, esams, eqsin [22:10:58] "generic-map" is the one our public DNS uses [22:11:20] > generic-map 2620:0:863::1 [22:11:20] generic-map => 2620:0:863::1/48 => ulsfo, codfw, eqiad, esams, eqsin [22:11:38] generic-map => 91.193.176.142/17 => esams, eqiad, codfw, ulsfo, eqsin [22:11:40] yay [22:11:54] if you just want to a single lookup you can put the mapname and IP on the commandline too, but if you're doing a bunch the shell way doesn't have to expensively reload maps every time [22:11:58] it's about this russian ISP emailing noc@ [22:12:22] thanks! [22:12:46] https://wikitech.wikimedia.org/w/index.php?search=gdnsd_geoip_test&title=Special%3ASearch [22:12:54] "There were no results matching the query." :) [22:13:00] add some :) [22:13:33] maybe somewhere in https://wikitech.wikimedia.org/wiki/DNS [22:13:58] yeah, on it :) [22:15:58] my interesting observation from trying the above is that gdnsd_geoip_test seems to be outputting debug messages when it shouldn't heh [22:16:03] there is anothet method to lookup data in maxmind, btw: [22:16:04] To look up data by hand, log in to mwlog1001 or mwmaint1002 and run mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip [22:16:14] once tried that. from https://wikitech.wikimedia.org/wiki/Geolocation [22:16:28] right, that will give you all the raw data maxmind has on a given IP [22:16:37] (e.g. country, city, coordinates, etc) [22:17:30] but then gdnsd has a layer of remapping it does based on our DNS config map, which can itself have either code bugs or errors in the map data (e.g. we typo'd a country code, or put it in the wrong continent group, I think both have happened before) [22:17:42] bblack: shameless copy paste to https://wikitech.wikimedia.org/wiki/DNS#Know_to_which_DC_a_specific_IP_is_redirected [22:17:52] gdnsd_geoip_test runs the same code and config as the prod DNS servers, so you know the dclist it outputs is what's actually happening for us with that netblock [22:18:07] ack, makes sense [22:19:34] (it actually is the same code as the DNS daemon itself, just linked into a separate CLI) [22:23:08] Great, next Tuesday, Telia and Zayo have a maintenance starting at the same time, for 2 of the 3 links between codfw and eqiad [22:24:18] I hope they're not sharing part of the infrastructure... [22:24:29] it might be useful to look into whether they are heh [22:28:22] Telia: Location of work: Charlotte, NC, US [22:28:22] Zayo: Location of Maintenance: San Antonio, TX [22:28:38] I guess it's unrelated and only bad timing [22:32:39] you'd think redunant circuit maint overlapping from two vendors would be rarer than it has tended to be in practice [22:33:08] I guess that probably reflects that maintenance outages on any given circuit are a lot less rare than I expect them to be [22:50:04] bblack: and there is more, see email I sent you :) [22:50:45] friday have 3 unrelated but overlaping maintenances [22:51:44] fun!