[01:34:09] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArthurPSmith) >>! In T243701#5855439, @ArielGlenn wrote: >>>! In T243701#5855352, @Lea_Lacroix_WMDE wro... [10:15:51] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4022.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:46:59] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4022.ulsfo.wmnet'] ` and were **ALL** successful. [11:26:31] 10Traffic, 10Operations: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) [11:26:34] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:42:22] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4021.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002071242_vgutie... [13:14:30] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4021.ulsfo.wmnet'] ` and were **ALL** successful. [13:31:24] 10netops, 10Operations: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) a:05ayounsi→03faidon As it matches the reboot of cr3-knams I'd say the optic on that side needs to be replaced. (maybe the power fluctuation damaged it?). But as the optic looks fine on the CLI, maybe i... [13:42:16] 10netops, 10Operations: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10faidon) Please file a #procurement task for Willy/Rob to execute on :) [13:49:13] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10ema) >>! In T241145#5856750, @Gilles wrote:> > We will keep an eye on the trend in coming days to check how much of a dent it... [14:02:26] 10Traffic, 10Operations, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) So I had a beautiful CR ready to log this data, but ATS ability of filter logs based on IPs is currently broken, so I just applied the CR manually on cp1075... [14:05:30] 10netops, 10Operations: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) Opened T244574. [14:09:51] 10netops, 10Operations: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 (10ayounsi) cr3-knams got upgraded to 18 yesterday. Waiting to see if the issue happen again. [14:12:47] bblack: https://phabricator.wikimedia.org/T244538#5859612 [14:27:17] we have (at least) two kind of delays there [14:28:37] Date:2020-02-07 Time:13:57:00 Client-IP:2620:0:861:ed1a::1 ReqMethod:GET RespStatus:200 OriginStatus:200 ReqHeader:Host:en.wikipedia.org ReqURL:https://en.wikipedia.org/favicon.ico UABegin:0 UAFirstRead:0 UAReadHeaderDone:0 CacheOpenReadBegin:0 CacheOpenReadEnd:0 DNSLookupBegin:0 DNSLookupEnd:20 ServerConnect:20 ServerConnectEnd:20 UAFirstRead:0 ServerFirstRead:20 ServerReadHeaderDone:20 ServerClose:20 UAWrite:20 UAClose:20 [14:28:37] SMFinish:20 PluginActive:0 PluginTotal:0 [14:28:49] like this where ats-tls loses 20ms doing a DNS lookup [14:29:16] or like this [14:29:17] Date:2020-02-07 Time:13:57:01 Client-IP:2620:0:861:ed1a::1 ReqMethod:GET RespStatus:200 OriginStatus:200 ReqHeader:Host:en.wikipedia.org ReqURL:https://en.wikipedia.org/favicon.ico UABegin:0 UAFirstRead:0 UAReadHeaderDone:0 CacheOpenReadBegin:0 CacheOpenReadEnd:0 DNSLookupBegin:79 DNSLookupEnd:79 ServerConnect:80 ServerConnectEnd:80 UAFirstRead:0 ServerFirstRead:80 ServerReadHeaderDone:80 ServerClose:80 UAWrite:80 UAClose:80 [14:29:17] SMFinish:80 PluginActive:0 PluginTotal:0 [14:29:27] where ats-tls loses 79ms without doing nothing at all aparently [14:30:24] the dns lookup was reenabled here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/564929, now that the underlying issue has been fixed by https://github.com/apache/trafficserver/pull/6332 [14:30:36] I'd say we can backport it and re-disable DNS lookup on ats-tls [14:31:17] regarding the second kind of delay... no idea (yet) :) [14:32:55] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4031.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [14:54:46] 10Traffic, 10Operations, 10MW-1.35-notes (1.35.0-wmf.18; 2020-02-04), 10Performance Issue, and 2 others: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10Addshore) 05Open→03Resolved a:03Addshore I... [15:00:16] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4031.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4031.ulsfo.wmnet'] ` [15:26:46] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4030.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [15:29:36] 10Traffic, 10Operations: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) [15:53:36] 10netops, 10Operations: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) Current dates are: Feb. 11th - 21:00UTC - 1h - cr1-eqsin - eqsin will be depooled (this is when eqsin sees the less traffic) Feb. 12th - 13:00UTC - 2h - cr2/3-esams [15:53:53] 10Traffic, 10Operations, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) To mitigate the DNS based delay that can be seen on the milestone log, I've backported https://github.com/apache/trafficserver/pull/6332 in https://gerrit.w... [15:55:30] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4030.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4030.ulsfo.wmnet'] ` [16:05:24] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [19:27:01] hmm seeing the delays in our milestone log [19:27:29] it looks like our build is missing https://github.com/apache/trafficserver/pull/6103 [19:27:42] I'll backport it on Monday [21:55:10] 10netops, 10Operations: some outbound is TCP failing from fundraising cluster as of approx 2020-02-07 16:15UTC - https://phabricator.wikimedia.org/T244610 (10Jgreen) [21:55:27] 10netops, 10Operations: some outbound is TCP failing from fundraising cluster as of approx 2020-02-07 16:15UTC - https://phabricator.wikimedia.org/T244610 (10Jgreen) p:05Triage→03Unbreak! [21:57:41] XioNoX: uh oh, are you around? [21:58:31] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Pchelolo) RESTBase itself seem to be working correctly. Something is wrong with routing before RESTBase. If you look at https://en.wikipedia.beta.wmfl... [22:36:24] 10netops, 10Operations: some outbound is TCP failing from fundraising cluster as of approx 2020-02-07 16:15UTC - https://phabricator.wikimedia.org/T244610 (10Dwisehaupt) On our end, I fixed the typo that would have alerted us to this much earlier in the day. [frack::puppet] 6247eb80 Fix typo in critical alter... [23:16:31] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) >>! In T244586#5860963, @Mholloway wrote: > To help isolate possible culprits, when was this last working? Yesterday, but not sure w... [23:20:50] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Krenair) Did fix puppet on cache-text05 yesterday, it did a lot of stuff to replace some nginx/varnish stuff with ATS. May be related? [23:21:26] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) >>! In T244586#5861250, @Krenair wrote: > Did fix puppet on cache-text05 yesterday, it did a lot of stuff to replace > some nginx/var... [23:22:14] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Mholloway) It looks like routing for /api/rest_v1/ in Beta is set up in the prefix puppet settings for deployment-cache-text (as seen [[ https://gerri... [23:23:06] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10MarcoAurelio) Not sure T243226 might be related. [23:24:21] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) deployment-restbase01.deployment-prep.eqiad.wmflabs reports `The last Puppet run was at Mon Jan 20 10:54:08 UTC 2020 (26669 minutes a... [23:25:17] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Pchelolo) >>! In T244586#5861273, @Jdforrester-WMF wrote: > deployment-restbase01.deployment-prep.eqiad.wmflabs reports `The last Puppet run was at Mo... [23:26:03] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) Ah, right, that's the please-upgrade-puppet task that @MarcoAurelio linked above. [23:43:44] 10Traffic, 10Operations, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Krinkle) [23:43:48] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle)