[01:11:16] 10netops, 10Operations: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) Feb. 18th - 13:00UTC - 2h - cr2/3-esams [05:11:18] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2004.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130511_vgutie... [05:13:46] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2011.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130513_vgutie... [05:32:44] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2004.codfw.wmnet'] ` and were **ALL** successful. [05:36:01] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130535_vgutie... [05:36:03] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2011.codfw.wmnet'] ` and were **ALL** successful. [05:39:06] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2008.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130538_vgutie... [05:39:27] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [05:44:25] bblack: I got a fix for the KA issue :D [05:58:30] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2001.codfw.wmnet'] ` and were **ALL** successful. [06:05:24] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2002.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [06:05:59] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [06:08:42] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2005.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [06:12:55] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1089.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130612_vgutie... [06:17:03] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1090.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130616_vgutie... [06:23:21] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2002.codfw.wmnet'] ` [06:29:47] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2005.codfw.wmnet'] ` and were **ALL** successful. [06:33:28] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1089.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1089.eqiad.wmnet'] ` [06:37:24] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1090.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1090.eqiad.wmnet'] ` [07:03:01] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [07:03:54] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1087.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130703_vgutie... [07:09:15] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1088.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130708_vgutie... [07:13:43] 10Traffic, 10Operations, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) 05Stalled→03Open After some investigations, it looks like [[ https://github.com/apache/trafficserver/pull/5811| PR 5811 ]] from up... [07:13:45] 10Traffic, 10Operations: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) [07:14:11] 10Traffic, 10Operations, 10serviceops: Add x-request-id to httpd (apache) logs - https://phabricator.wikimedia.org/T244545 (10Joe) 05Open→03Resolved a:03Joe [07:26:51] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1087.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1087.eqiad.wmnet'] ` [07:29:26] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1088.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1088.eqiad.wmnet'] ` [07:53:52] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [08:20:38] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1085.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1085.eqiad.wmnet'] ` [08:27:34] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1086.eqiad.wmnet'] ` and were **ALL** successful. [09:26:37] 10netops, 10Operations: RRDP status alert - https://phabricator.wikimedia.org/T245121 (10jijiki) [09:46:24] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1083.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130946_vgutie... [09:46:49] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [09:48:29] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1084.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130948_vgutie... [10:09:27] after yesterday's authdns2001 reinstall it needs another reboot; the initial puppet run installs the microcode updates, but these only get loaded in the initramfs on next reboot. typically the reimage script handles it at the end, but maybe there was some error [10:10:04] ack [10:11:16] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1084.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1084.eqiad.wmnet'] ` [10:14:37] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1083.eqiad.wmnet'] ` and were **ALL** successful. [10:40:38] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1081.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1081.eqiad.wmnet'] ` [10:42:56] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1082.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1082.eqiad.wmnet'] ` [10:44:14] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lea_Lacroix_WMDE) Thanks all for your feedback. Since the change we perform didn't have the expected re... [11:07:02] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [11:15:23] 10Traffic, 10Operations, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) The issue with timeouts and KeepAlive can be easily understood with a small environment using curl + ATS + httpbin. 1. curl requests /... [11:17:33] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1079.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131117_vgutie... [11:20:18] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1080.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131119_vgutie... [11:44:57] 10Traffic, 10Android-app-Bugs, 10Operations, 10Wikipedia-Android-App-Backlog, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10jbond) p:05Triage→03Medium [11:46:48] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1079.eqiad.wmnet'] ` and were **ALL** successful. [11:48:18] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1080.eqiad.wmnet'] ` and were **ALL** successful. [11:48:20] 10netops, 10Operations: RRDP status alert - https://phabricator.wikimedia.org/T245121 (10jbond) p:05Triage→03Medium [11:53:19] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [11:54:09] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1077.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131154_vgutie... [11:56:19] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131156_vgutie... [12:23:25] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1077.eqiad.wmnet'] ` and were **ALL** successful. [12:25:21] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1078.eqiad.wmnet'] ` and were **ALL** successful. [12:28:53] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:29:38] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1075.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131229_vgutie... [12:31:26] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1076.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131231_vgutie... [12:58:07] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1075.eqiad.wmnet'] ` and were **ALL** successful. [13:00:17] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1076.eqiad.wmnet'] ` and were **ALL** successful. [13:01:29] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [13:01:43] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'cat /etc/debian_version' 78 hosts will be targeted: cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-20... [13:43:10] 10Traffic, 10Operations, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) the issue described above it should be fixed almost everywhere: `===== NODE GROUP ===== (76) cp[2001-2002,2004-2008,2010-2014,2016-202... [14:04:50] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) >>! In T170567#5880790, @gerritbot wrote: > Change 571976 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez): > [operations/puppet@pr... [14:35:41] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Bugreporter) The only way to resolve the issue is increase the rate of edit Query Updater can handle. [14:36:24] bblack: have we ever thought about defining an optimize-for-equal-load-between-eqiad-and-codfw GeoDNS map, in addition to the optimize-for-latency one we have now? [14:38:23] yes, we have thought :) [14:39:59] it's something I'd like to try out for Maps, if we're open to the possibility. eqiad peak CPU util is about 50%, codfw peak CPU is about 18%. if we were to re-enable either of tile generation, or access by external clients, eqiad would be much much hotter, and codfw still underused [14:42:12] right now, maps and upload share everything from the IP address down, so we can't change them independently without first splitting it at least at the IP layer again. [14:42:33] we couldn't just split into a different dyna? [14:42:43] but, we've been talking about it and I've been thinking about it anyways, the idea of re-balancing the US to utilize codfw more, for the general case. [14:42:46] I had imagined this as just a DNS-level change [14:43:02] yes, I guess it could be a redundant dyna with the same IP [14:43:25] there's just some historical inertia to get over mentally here [14:43:47] historically we've always said we'll geomap as best we can for latency and build resources to handle it. [14:44:09] mm sure, sure [14:44:20] I think the pragmatic answer, for this particular case at this particular junction in history, is probably to rebalance just the US based on load concerns [14:44:56] there's South America too, which is all mapped to eqiad IIRC [14:45:14] because (a) the US users have great latency anyways, being close to the core DCs + (b) the diff between codfw/eqiad is small for most users we would shift + (c) it will equalize the load and make the esams-fail situation less-impactful and make better use of codfw resources, etc, etc [14:45:48] but even with all this obvious alignment and rationale, it is a kind of sea change in policy :) [14:47:27] bblack: BTW last night KA issue has been solved with 8.0.5-1wm16 [14:47:54] bblack: thanks for the explanation, all makes sense [14:48:05] ema: yeah that would help too, but as they're higher-latency, if codfw's actually further on the network I'd rather leave them wheverever they get best results [14:48:40] SA only accounts for something like 15% of the traffic NA does. [14:49:15] Ecuador is closer to eqaid than ulsfo/codfw in term of latency [14:49:44] most of the submarine cables from SA (at least the eastern part) terminate in Florida or even Virginia [14:49:53] yep [14:50:10] right, unfortunately not much cross the gulf or mexico itself coming up into TX [14:50:40] I'd be surprised if countries like Chile had better latency to eqiad than codfw/ulsfo :) [14:50:53] also the westmost point of SA is south of florida, not of california [14:50:58] we can find out easily if there's atlas probes there [14:51:37] Equador had better latency to ulsfo judging from our good old openstreetmap latency picture [14:51:42] https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-target_site=All&var-ip_version=ipv4&var-country_code=EC&var-asn=All [14:51:54] Ecuador even [14:51:59] only one datapoint from ecuador in what we're already scraping [14:52:10] but eqiad is the winner, with codfw 7ms behind [14:52:20] I suspect this is because of SAm-1 https://www.submarinecablemap.com/#/submarine-cable/south-america-1-sam-1 [14:52:43] and physical distance [14:53:01] (for west vs. east coast) [14:53:11] yeah [14:53:20] interesting! Then things have changed compared to the latency map I show in all presentations :) [14:54:00] we could/should have a more-automated way of comparing our manual map config to ripe data anywayw [14:54:05] 4 datapoints in Chile, eqiad also winner, codfw 5ms behind [14:54:20] it sounds like we have all the bits to do that, just missing a script to spit out deviations, at least for country-level outside of NA [14:54:33] yeah that's a chat I have to have with cdanis :) (latency maps) [14:55:09] bblack: note that the data we have in prometheus for RIPE Atlas, the ones provided by their stock measurements, are mostly anchor-to-anchor [14:55:30] I believe there's another built-in measurement that is probes-to-our-anchors, but we don't scrape it right now (have been meaning to add them) [14:55:43] we'd probably want to use that, and gather some more data anyway, to do user latency mapping [14:55:59] sure [14:56:09] we probably don't even want to source from it or trust it blindly either [14:56:14] yeah :) [14:56:18] and use our millions credits to get data from all the probes [14:56:24] 1 datapoint in Paraguay and eqiad is also the winner (codfw adds 12ms) [14:56:24] it'd just be nice to run a script and spit out where our config and the data differ, as a source of cases to investigate/fix manually [14:57:16] +1 [14:57:21] lots of things we can do on that front :) [14:57:52] since it's annual planning season, I'll point out that this is obviously MTP-aligned work ;) [15:00:27] anyway back to where we started -- bblack, how do you want to proceed on the codfw mapping changes? if we wanted to do an experiment with a new dyna, starting with maps I could send you patches for that, but certainly open to plenty of other options there as well [15:02:06] I'll get back to you in a little bit, have a 30m mtg [15:02:40] np [15:13:01] (something that would be interesting/fun to do: a setup that let you "use wikipedia from X country", simulating the network RTT; probably easiest to do on a laptop we then passed around to people [c.f. https://github.com/tylertreat/comcast] but would still be a fun demo for allhands) [15:23:25] cdanis: tc? [15:55:37] cdanis: yeah basically, I'd like to propose a new base map that balances the US more (shift more stuff to the left, a little more into ulsfo but not too much, focus on codfw/eqiad load balance without killing latencies more than necessary - stick to US state shifts, leave further-flung things alone) [15:56:10] we can experiment with that on maps first if you want, but maps loading is probably different than upload+text loading (it may have different demographics for whatever reasons) [15:56:15] still, it can be a starting point [15:58:28] (so yeah, maybe make a couple patches - one to split maps dyna using the same IPs (a new redundant-looking entry in geo-resources), and another with some ballpark guess at the rebalance as a separate map from generic-map) [15:59:18] keep it structurally similar though, with the intent we'll eventually make it the new generic-map and want to keep being able to diff them for now easily. [16:00:10] (we'll have to sort out some minor diffs on the esams-fail map eventually too, probably) [16:18:35] 10Traffic, 10Analytics, 10Operations: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) [16:24:28] 10Traffic, 10Analytics, 10Operations: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) We can also change the alarm to hourly and see events caught. Now, this might mean a super large number of false positives so we nee... [16:25:05] 10Traffic, 10Analytics, 10Operations, 10Research: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) [16:28:55] vgutierrez: on the tlsv1.3 outbound patch - I'm assuming the thinking is that we have AES accel everywhere. Maybe still stick chapoly in there as second-choice, before aes128, though? [16:29:14] so I've assumed the same as with outbound TLSv1.2 [16:29:42] but yeah I'm open to suggestions :) [16:29:57] ah [16:30:02] yeah I never noticed it there either [16:30:38] you could make some arguments about whether aes256 or chapoly should be the first choice. for public we choose chapoly, for internal for now I think aes256 does make more sense. [16:30:52] I'll put up a patch to inject chapoly in the middle though, it's still a better choice than aes128 [16:31:04] ack [16:31:13] I'll change that before enabling TLSv1.3 fleet wide [16:31:18] (for outbound traffic) [16:32:55] oh now I really see, we had aes256-only for 1.2 [16:33:25] hmmm [16:33:43] for a controlled environment I don't see it as a problem :) [16:33:53] yeah, if we decide to switch we always can [16:34:03] in which case, we could leave 128 out of the 1.3 list too [16:34:19] yeah [16:35:09] the only potential problem, is if some future origin doesn't configure aes256 for some reason, but I guess if we stick to a single choice, we'll find out quickly when someone tries to configure such an origin :) [16:35:16] I've read somewhere that kernel 5.6 ships HW acceleration for POLY1305 [16:35:21] https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=d7d7b853566254648df59f7ea27ea05952a6cfa8 [16:35:31] I wonder if that's already being used in OpenSSL [16:35:36] yeah if poly becomes faster, it's definitely the better choice [16:35:50] but for now for the inside, I'm assuming aes256 still has the speed advantage [16:35:54] yup [16:36:13] I'm gonna leave soon.. [16:36:17] current status of experiments [16:36:29] cp4031 running with ats-tls using KA against varnish-fe [16:36:41] cp4026 and cp4032 running ATS 8.0.6-rc0 [16:36:59] 8.0.6 still has some known issues? [16:36:59] 8.0.6-rc0 today is behaving good after reverting 1 commit [16:37:03] oh ok [16:37:05] I've already talked to upstream [16:37:23] I saw the ka fixups you put into 8.0.5 [16:37:36] is that "everywhere" now? [16:37:44] yes [16:37:49] awesome [16:37:55] every cp node runs buster and 8.0.6-1wm16 [16:37:59] *8.0.5-1wm16 [16:38:00] sorry :) [16:38:20] also [16:38:21] on cp4031 [16:38:23] see https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&from=now-3h&to=now&var-site=ulsfo%20prometheus%2Fops&var-instance=cp4031&var-layer=tls&fullscreen&panelId=56 [16:38:50] this is the halfopen setting? [16:38:53] yes [16:39:08] it's the proper fix for T236458 apparently [16:39:08] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [16:39:17] nice! [16:39:19] I'll let that running on cp4031 till tomorrow morning [16:39:35] and if it behaves properly like it's apparently doing [16:39:39] I'll apply it fleet wide [16:40:06] the shape looks doubtful, but we'll see, maybe it levels off [16:40:17] if you zoom out, it looks like the start of the same ramp so far though [16:40:35] traffic is also getting higher on ulsfo :) [16:40:37] next hour or two will tell I think! :) [17:06:29] 10netops, 10Operations: RRDP status alert - https://phabricator.wikimedia.org/T245121 (10ayounsi) I think this means that the query to that URL times out. As it completes properly from codfw I'm wondering if it's not an issue with the webproxies (overloaded or similar). Any idea who can help looking into it? [18:26:45] sigh [18:26:55] it looks like it isn't that [18:50:59] XioNoX: yeah, the 'comcast' tool just generates `tc` and `iptables` commands (the latter specifying drop probabilities) [18:51:44] bblack: okay, thanks! I'll get around to poking at the maps-map soon :) [19:06:26] 10netops, 10Operations, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) p:05Triage→03Medium [19:09:23] TIL: http://icmpcheck.popcount.org/ [19:09:55] and v6 http://icmpcheckv6.popcount.org [20:06:14] hey that's pretty nice [21:20:42] 10Traffic, 10Operations, 10Release, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Train Deployments: Varnish 500 spike following 1.35.0-wmf.19 all-wiki deployment - https://phabricator.wikimedia.org/T245202 (10greg) [22:59:35] 10Traffic, 10Android-app-Bugs, 10Operations, 10RESTBase, and 6 others: Varnish 500 spike of all /page/related/ hits following 1.35.0-wmf.19 all-wiki deployment - https://phabricator.wikimedia.org/T245202 (10Jdforrester-WMF) [23:01:48] 10Traffic, 10Android-app-Bugs, 10Operations, 10RESTBase, and 6 others: Varnish 500 spike of all /page/related/ hits following 1.35.0-wmf.19 all-wiki deployment - https://phabricator.wikimedia.org/T245202 (10Jdforrester-WMF) We dug into the Web request logs. All the 500s were coming from `/api/rest_v1/page/...