[11:35:49] 10Traffic, 10Data-Engineering-Radar, 10SRE: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10jbond) p:05Triage→03Medium [11:54:57] (EdgeTrafficDrop) firing: 62% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [11:59:57] (EdgeTrafficDrop) resolved: 62% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [12:00:18] <_joe_> uhhh what was that? [12:00:44] <_joe_> ah we actually had a spike of requests [12:54:52] in general, drmrs's traffic volume is low enough to make that EdgeTrafficDrop thing unreliable/flaky (known issues) [14:57:52] laptop:~$ host en.wikipedia.org [14:57:52] en.wikipedia.org is an alias for dyna.wikimedia.org. [14:57:52] dyna.wikimedia.org has address 91.198.174.192 [14:57:58] bblack: ^ [14:58:06] from France of course [14:58:09] :) [14:59:00] host reflect.wikimedia.org [14:59:00] reflect.wikimedia.org has address 145.100.185.15 [14:59:06] XioNoX: if you do "host reflect.wikimedia.org" it will give some insight on what IP address our geoip is seeing (likely a recursor exit) [14:59:11] doh, you typed faster :) [14:59:22] looks like something in NL [14:59:35] that's from the coworking space I'm in [15:00:29] yeah [15:00:41] the set DNS to 8.8.8.8 [15:00:45] they* [15:00:50] this can confirm the other part of it too: [15:00:56] bblack@dns1002:~$ gdnsd_geoip_test generic-map 145.100.185.15 2>/dev/null [15:00:59] generic-map => 145.100.185.15/10 => esams, eqiad, codfw, ulsfo, eqsin [15:01:30] https://wikitech.wikimedia.org/wiki/DNS#Know_which_IP_the_AuthDNS_is_seeing_a_query_from [15:01:48] and the block above [15:02:04] ah yeah, nice [15:02:10] I'm a bit surprised that 8.8.8.8 in France exits in the NL [15:02:25] TIL reflect.w.o :) [15:02:32] probably depends a bit on whatever ISP is supplying the co-working space [15:03:10] maybe they're regional and they get all their upstream access out of NL or something [15:04:01] laptop:~$ host reflect.wikimedia.org 8.8.8.8 -> reflect.wikimedia.org has address 217.128.133.0 [15:05:15] yeah that maps to drmrs [15:05:43] I wonder why you get a different answer when specifying 8.8.8.8 directly? [15:05:50] yeah, no idea :) [15:46:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10ayounsi) [16:07:04] Quick question: When you depool a cp-* server you currently either use `confctl` on a puppetmaster or `depool` on the host itself, is that right? No cookbook at the moment. [16:10:02] I'm asking because I've drafted a new Alertmanager check for varnishkafka throughput (https://gerrit.wikimedia.org/r/c/operations/alerts/+/773801) T300246 - but I think the alert will trigger when hosts are intentionally depooled. I was wondering about integrating a 'create silence' in Alertmanager or some other way of preventing this from firing. [16:10:03] T300246: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 [16:19:06] IIRC pooled status (as seen by etcd/conftool) isn't in prometheus yet as a metric (it is from pybal though IIRC), the former should be simple enough to add these days when/if needed [16:19:37] +1 to putting pooledness in prometheus [16:19:45] +1 would be really cool [16:20:34] is it confd that maintains what's on https://config-master.wikimedia.org/ ? [16:20:44] could write a template in node_exporter textfile format ;) [16:23:05] Oh yeah, that would be a very neat solution. [16:23:09] IIRC yeah that's it, also yes it'd be textfile indeed [16:23:09] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: cp1090.mgmt ssh port not accessible - https://phabricator.wikimedia.org/T304589 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson re-seated the mgmt cable. no issues logging into mgmt interface root@cp1090.mgmt.eqiad.wmnet's password: /admin1-> [16:23:38] a slight variation on the theme for "reasons" but I did the work already with "mini textfile exporter" for the network probes [16:23:52] basically because we need to be able to write an arbitrary "instance" label [16:24:03] anyways that's a detail, point being that it should be easy [16:37:03] godog: Thanks. I have tagged you on the ticket and the patch. Feel free to let me know if I can help implement the pooled/depooled metric. [16:37:08] godog: nice [16:37:11] which network probes are those? [16:38:30] topranks: for now the work I did at https://phabricator.wikimedia.org/T291946 though possibly any network level check [16:38:52] btullis: for sure, I don't have the bandwidth to implement the metric but happy to assist/brainstorm [16:39:25] btullis: my understanding is that it should be a variation of what confd does on config-master as cdanis was pointing out [16:39:50] godog: super I'll dig in and check it out :) [17:34:05] even with just a few countries mapped, you can see the reduction in peak esams traffic, nice view here: [17:34:08] https://w.wiki/4zGW [19:29:51] already more traffic than ulsfo at peak :) [21:06:24] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, and 2 others: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10Dzahn) a:03Dzahn