[06:54:03] 10Traffic, 10DNS, 10Operations: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10Volans) p:05Triage→03Unbreak! [06:57:55] 10Traffic, 10DNS, 10Operations: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10Volans) The erros seems to be caused by the lack of the entry related to `releases` in the `discovery-states` file in gdnsd configuration. This in turn seems to be related to the fact that in... [07:03:27] 10Traffic, 10DNS, 10Operations: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10Volans) 05Open→03Resolved I've merged https://gerrit.wikimedia.org/r/c/operations/dns/+/628995 and now `authdns-update` runs without errors and the DNS is unblocked. [07:21:40] o/ I'm, going to remove some LVS services again (just heads up) [07:22:49] jayme: is that freeing IPs? [07:22:58] volans: na, sorry [07:23:23] mentioning in case it should be reflected in Netbox [07:24:06] just removing HTTP services, will keep the HTTPS ones (with same IP). So no change there [07:24:42] ack jayme [07:27:04] ema: hey, as per yesterday chat here it was deemed safer to depool ulsfo for the deploy of the netbox changes [07:27:31] is there anything ongoing that I should avoid to overlap with? would now be a good time? [07:27:34] I've sent https://gerrit.wikimedia.org/r/c/operations/dns/+/629055 [07:28:36] volans: now is a good time! [07:30:26] ema: ack, thx, proceeding then [07:30:49] anyone of you care a review of the depool patch? [07:31:03] always good to have a second pair of eyes for those :) [07:31:41] looking [07:32:06] volans: lgtm [07:32:15] thx! [07:34:31] ema: what's the best grafana dashboard to keep an eye on those days to see the depool? [07:35:16] volans: I like https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=ulsfo&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 [07:35:55] thx [07:36:13] but others are cool too, like: https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&orgId=1&var-cluster=text&var-cluster=upload&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [07:36:39] maybe the latter is even better as you can see the raw request rate comparison with last week [07:37:55] nice [07:39:37] ema: unless there is an easy way to wipe-cache in the resolvers for a broad range of records I might need to keep ulsfo depooled for a bit more than 1H to ensure all TTLs are expired [07:40:09] I didn't poke at rec_control yet, the wikitech page just mentions single recorss [07:40:12] *records [07:42:31] mmmh it mentions: [07:42:35] DOMAIN can be suffixed with a '$'. to delete the whole tree from the cache. i.e. 'powerdns.com$' will remove all cached entries under and including the powerdns.com name. [07:43:00] doesn't seem to allow to do that easily for PTRs too, might be easier to just wait [07:43:11] volans: it's fine to keep it depooled for the next few hours, ulsfo's traffic isn't much at this time of day [07:43:48] the whole DC, text+upload, has less rps than a single text@esams node [07:44:01] eheheh yeah I knew this was a good time [07:49:19] ema: unrelated, do you know if donate-lb.ulsfo (and related for the other DCs) are in use? [07:49:46] they have A/AAAA/MX records, and the A/AAAA are the same IPs of text-lb [07:50:15] volans: no idea, sorry [07:50:41] np, thx anyway [07:50:45] A/AAAA/AAAAAA/AAAAAAAA [07:50:51] (sorry) :) [07:51:21] lol [07:51:48] ema: sorry to keep bothering, is the increase of 502s expected? [07:51:48] https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=14&refresh=1m&orgId=1&var-site=ulsfo&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&from=now-30m&to=now [07:53:12] could be also related to the temporary failure of the network link [07:53:38] volans: very likely unrelated to the traffic depool [07:53:54] I see a burst of errors like: [07:53:55] 20200922.07h48m11s CONNECT:[0] could not connect [CONNECTION_CLOSED] to 10.2.1.17 for 'https://restbase.discovery.wmnet:7443/en.wikipedia.org/v1/page/media-list/Camping_food/961670220' [07:54:35] but they've stopped, and affected both text and upload at the same time, so very likely network link issues [07:56:01] k [07:56:52] ok, traffic has shifted, going to proceed [07:57:06] ack [07:58:24] 10Traffic, 10Analytics-Radar, 10Operations, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [07:58:53] 10Traffic, 10Operations, 10Patch-For-Review: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10ema) 05Open→03Resolved Most patches dropped! We're left with: - 0005-stats-shortlived.patch - 0006-bump-api-soname.patch Closing. [09:31:26] FYI repooling ulsfo [09:31:34] 1h has passed since the last dns merge and all still looks good [09:47:41] I'll do some more LVS restarting :-) [10:04:57] ulsfo is back to normal traffic levels, nothing so far out of the ordinary as a consequence of the migrated DNS [10:05:03] lmk if you see anything strange [10:14:04] 10netops, 10Operations, 10ops-eqiad: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10fgiunchedi) Checking the full list of servers, we have 10 ms-be hosts in there. Since we're deploying Swift with row-availability in mind I'm ok not to depool Swift out of eqiad for this. I... [10:50:21] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10BBlack) For the record, the renewal of unified on the 18th did successfully fetch two different chains, but they're currently set up with the default chain to... [12:25:31] 10netops, 10Operations, 10Patch-For-Review: Prioritize underdog IXP - https://phabricator.wikimedia.org/T262517 (10ayounsi) 05Open→03Resolved All done in ulsfo, other sites to follow when they are turned on. [12:28:10] 10Traffic, 10Operations, 10observability: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) [12:53:15] bblack: ulsfo DNS was migrated this morning, nothing to report, we'll send patches for the other PoPs in the next days. To you think it's better to depool them too when deploying? [13:18:06] volans: I think if ulsfo went fine, and we can pre-verify that we expect no wire-visible changes to DNS records, there's no need to depool further. [13:18:33] I assume it's a one-step change from the dns server pov (the records don't dissappear for a short time from the pov of querying traffic) [13:19:09] which makes sense so long as the netbox files are already present and the dns-level change is just swapping the manual records for the $INCLUDE [13:21:02] yes it's exactly that [13:21:04] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: 'skip_first' feature flag for gdnsd GeoIP plugin - https://phabricator.wikimedia.org/T261340 (10CDanis) 05Open→03Resolved a:03CDanis Deployed at 13:20 UTC. The original TTL of the intake-logging CNAME was 1 day, so it will take that long for all cl... [13:21:29] volans: yeah so I think we're fine. We've tested the process carefully, and we can pre-verify the records match for each switch, we're good to go without depool I think [13:22:02] bblack: confirmed from a few VPSes of mine that text-next works as expected -- thanks! [13:22:04] ack, I was thinking the same but no problem to do that if deemed safer :) [13:22:31] cdanis: awesome :) [13:33:54] 10netops, 10Operations: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10CDanis) +1 for leaving the default `speed_calculation_delay = 1` +1 for setting `average_calculation_time_for_subnets` to the same as our overall `average_calculation_time` Just to make sure I understand, you w... [13:52:11] 10Traffic, 10Analytics-Radar, 10Operations, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [14:01:54] 10netops, 10Operations, 10Patch-For-Review: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10ayounsi) > Just to make sure I understand, you were thinking we don't enable_connection_tracking? I agree with that, it rather expensive on CPU. Correct. > Surprisingly though one new CLI... [14:19:46] 10netops, 10Operations, 10Patch-For-Review: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! Thanks a lot and we can revisit when we need the API client. [14:35:54] 10Traffic, 10Analytics-Radar, 10Operations, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [15:28:11] 10Traffic, 10Operations, 10Patch-For-Review: Varnish 6.0 needs a SONAME version bump - https://phabricator.wikimedia.org/T261487 (10ema) 05Open→03Resolved a:03ema This is now done, our Varnish package version 6.0.6-1wm1 includes 0006-bump-api-soname.patch taking care of the SONAME bump. All dependencie... [15:28:16] 10Traffic, 10Operations, 10Patch-For-Review: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10ema) [15:28:25] 10Traffic, 10Analytics-Radar, 10Operations, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) 05Open→03Resolved a:03ema All packages ready for prime time! [15:32:15] nice, that's a lot less patches :) [15:37:17] oh yes :) [15:38:46] 10Traffic, 10Operations: Upgrade a production cache node to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) [15:39:00] 10Traffic, 10Operations: Upgrade a production cache node to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) p:05Triage→03Medium [15:39:05] tomorrow ^ [15:39:46] time to leave o/ [16:04:02] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10BBlack) a:05BBlack→03RobH @RobH please do when able [16:12:18] 10Traffic, 10Operations, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10BBlack) 05Open→03Declined I don't think anyone's had any ideas in these months, and the operational context and grafana data is startin... [16:13:04] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179 (10BBlack) [16:16:35] 10netops, 10Operations, 10IPv6, 10Patch-For-Review, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10BBlack) a:03jbond [16:46:25] 10Traffic, 10Operations, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10CDanis) Wow, I had completely forgotten about this ticket -- but, I'm plugging T263277 here as something that would've helped diagnose, if... [17:28:03] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Telia IC-361191 patch - https://phabricator.wikimedia.org/T261791 (10RobH) 05Open→03Resolved ` robh@re0.cr2-eqiad> show interfaces diagnostics optics xe-3/3/7 Physical interface: xe-3/3/7 Laser bias current : 38.708 mA Lase... [17:28:23] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Telia IC-361191 patch - https://phabricator.wikimedia.org/T261791 (10RobH) 05Resolved→03Open [17:33:14] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Telia IC-361191 patch - https://phabricator.wikimedia.org/T261791 (10RobH) 05Open→03Resolved a:05Cmjohnson→03None [18:58:16] 10Traffic, 10Discovery, 10Operations, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Gehel) p:05Medium→03Low [19:26:57] 10Traffic, 10Discovery, 10Operations, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Gehel) 05Open→03Resolved a:03Gehel Looks like everything is done, please re-open if I've missed something.