[01:09:46] 06Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993#11127068 (10BCornwall) 05Open→03Resolved a:03BCornwall [01:10:10] 06Traffic, 06SRE: Recompile fifo-log-demux with hardening options - https://phabricator.wikimedia.org/T342900#11127071 (10BCornwall) p:05Triage→03Low [01:12:12] 06Traffic: Migrate MarkMonitor redirection services over to ncredir - https://phabricator.wikimedia.org/T400731#11127072 (10BCornwall) 05In progress→03Resolved [01:45:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [01:50:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [02:15:16] 06Traffic, 06Commons, 06Infrastructure-Foundations, 10WMF-General-or-Unknown: Upload to Commons fails with basic 2M/64K ADSL connections - https://phabricator.wikimedia.org/T205619#11127182 (10Jidanni) [02:18:52] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [02:21:37] 06Traffic, 06Commons, 06Infrastructure-Foundations, 10WMF-General-or-Unknown: Upload to Commons fails with basic 2M/64K ADSL connections - https://phabricator.wikimedia.org/T205619#11127185 (10Jidanni) @ssingh If you try the experiment in T205619#8909717 you can reproduce the issue. [02:23:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [03:44:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2012 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [03:49:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2012 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [03:59:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [04:04:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [05:10:21] 06Traffic, 10DNS, 06SRE: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11127302 (10waldyrious) @BCornwall you had emailed geral@wikimedia.pt (which is an alias to the Wikimedia-PT-internal mailing list -- wikimedia-pt-internal@lists.wikime... [05:35:15] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#11127323 (10ayounsi) After a long wait Juniper says its not a bug, but they never implemented the feature : > Following discussions with our Engineering team, I’d like t... [05:44:46] 10netops, 06Infrastructure-Foundations, 06SRE: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11127339 (10ayounsi) Good job!! Next step is to remove OSPF, feel free to send a patch in that regard. It can restore the previously removed `replace: ospf {}` as well as re... [06:44:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [06:49:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [07:59:10] 06Traffic, 10Continuous-Integration-Infrastructure, 07Jenkins: Cannot update CI Jenkins jobs - https://phabricator.wikimedia.org/T403089#11127547 (10hashar) I have send a patch #upstream to have an user-agent added: https://review.opendev.org/c/jjb/python-jenkins/+/958730 which would add `python-jenkins... [08:18:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [08:23:52] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [09:49:12] Hey :) I'd like to merge these two patches https://gerrit.wikimedia.org/r/c/operations/puppet/+/1182768 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1182769 during the infra window, any objections? [09:55:17] * fabfur checking [09:56:13] claime: no objections, are those two endpoint currently working? [09:56:52] fabfur: On the backend? Yes [09:57:02] +1 [10:00:00] ok cool, merging the first one then [10:15:46] hmm something's not working [10:16:32] ? [10:16:53] Ah [10:16:59] I think I know [10:17:13] actually no I don't [10:17:32] I'm getting a 502 for https://test2.wikipedia.org/w/rest.php/v1/page/Earth when hitting my drmrs cp node [10:17:42] Let me check something [10:18:14] on the drmrs node what if you curl the endpoint with the proper `-H "Host: test2.wikipedia.org"` ? [10:18:48] checking [10:18:57] works [10:20:40] I'm wondering if I shouldn't be matching /w/rest.php(.*) [10:22:30] Huh it didn't rewrite the host correctly wtf [10:22:37] 20250828.10h16m23s CONNECT: attempt fail [CONNECTION_ERROR] to 10.2.2.76:4113 for host='test2.wikipedia.org' connection_result=Broken pipe [32] error=Broken pipe [32] attempts=3 url='https://mw-api-ext-ro.discovery.wmnet:4113/w/rest.php/v1/page/Earth' [10:23:11] It should try to connect to rest-gateway.discovery.wmnet:4113 [10:23:21] The port got rewritten correctly though, weird [10:33:42] fabfur: I'm stumped [10:36:22] OH [10:36:32] The multi-dc rewrite [10:36:53] ? [10:37:12] My plugin chain rewrites the original host to rest-gateway.discovery.wmnet [10:37:19] But then it goes through /etc/trafficserver/lua/multi-dc.lua [10:37:28] And that re-rewrites it to mw-api-ext-ro.discovery.wmnet [10:37:40] mmm ok [10:40:03] Hmm I don't know right away how I can get around that [10:41:03] I'm gonna revert the 2nd patch, and I'll think about it [10:41:46] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1182808 [10:42:52] yep, safer, sorry this morning I'm a little busy with some other stuff but I can review a little more in the afternoon [10:42:59] fabfur: No problem [11:00:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [11:05:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [11:58:27] claime: let us know if you need some help with that [11:59:40] vgutierrez: I'm currently writing a CR that changes how we choose the dest_host for multi-dc, that will be up for your review in a few minutes [12:00:30] cool [12:02:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1182815 feel free to tell me that's absolutely not how it works :D [12:02:39] I'm going to lunch, back in an hour or so [13:24:00] 06Traffic, 10HaproxyKafka: Missing error field in DLQ messages - https://phabricator.wikimedia.org/T403174 (10Fabfur) 03NEW [13:43:30] 06Traffic, 10HaproxyKafka: Missing error field in DLQ messages - https://phabricator.wikimedia.org/T403174#11128536 (10Fabfur) 05Open→03Resolved [13:45:57] 06Traffic, 10HaproxyKafka: Missing Message and Hostname fields in messages sent to DLQ - https://phabricator.wikimedia.org/T403176 (10Fabfur) 03NEW [13:49:47] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11128565 (10ssingh) Hi folks: Picking this ticket up as part of Traffic cleaning up our stuff. It seems the current static routes -- even if they matter... [13:58:55] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11128591 (10cmooney) >>! In T300877#11128565, @ssingh wrote: > It seems the current static routes -- even if they matter at this stage -- are incorrect s... [14:07:05] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11128621 (10ayounsi) +1 to remove them :) [14:09:52] ty vgutierrez <3 I'll probably deploy that early next week as I have meetings coming and want to babysit the rollout [14:10:00] it's a pretty big blast radius on that change [14:10:17] claime: I'm not so sure about dropping __init__ and just reloading ATS [14:10:39] so I'd depool a host, let puppet do its thing there and check that everything works as expected [14:11:08] an alternative could be leaving __init__ there but empty [14:11:13] As in you're not sure it's going to work at all, or that the transition will work? [14:11:33] it will work, but I'm not sure that reloading ATS would be enough in that case [14:11:39] it could require a restart [14:11:42] Ah, it could need a restart, ok [14:12:13] that's the case for global hooks [14:12:30] I can put an empty init back in, I don't think that would cost anything, and we avoid a dodgy restart? [14:12:42] SGTM [14:12:47] Cool, doing [14:14:52] 06Traffic, 06MW-Interfaces-Team, 06serviceops, 07Epic, and 3 others: API Rate Limiting Architecture - https://phabricator.wikimedia.org/T399291#11128703 (10daniel) [14:58:57] 06Traffic, 06SRE, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11128998 (10bd808) [15:54:14] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11129257 (10ssingh) Hi Netops folks: Thanks for your feedback. Following up again after discussing this with Traffic. We decided that we will do away wi... [16:04:23] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11129305 (10elukey) The `late_command.sh` issue is fixed after https://gerrit.wikimedia.org/r/1182766, I tested a bookworm reimage for cp2043 and it worked nice... [16:06:36] 06Traffic, 06Movement-Insights, 10Data-Engineering (Q1 FY25/26 July 1st - September 30th): NEW BUG REPORT: Investigate rise in May 2025 Reader metrics - https://phabricator.wikimedia.org/T395934#11129327 (10mforns) In the last couple hours, we've deployed the improvements to the automated traffic detection.... [17:04:25] 06Traffic, 06SRE: Move ncredir7003 into service and decom ncredir7002 - https://phabricator.wikimedia.org/T395796#11129503 (10ssingh) 05Open→03Resolved a:03ssingh Seems like the parent task T394263 already tracks this work and `ncredir700[34]` have been moved into service while `ncredir700[12]` have... [17:05:14] 06Traffic, 06SRE: Move ncredir7003 into service and decom ncredir7002 - https://phabricator.wikimedia.org/T395796#11129509 (10ssingh) [17:07:09] 06Traffic, 10Ganeti, 06SRE: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#11129538 (10ssingh) 05Open→03Resolved a:03ssingh `(doh|durum)700[12]` decomissioned and `(doh|durum)700[34]` are in service now. Tracking was in parent task T394263 so resolving this. [17:38:50] 06Traffic, 10DNS, 06SRE, 10WikiLearn: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#11129696 (10ssingh) Hi @Asaf: Is there an update to this? We are cleaning up and triaging the tasks and so the context is if there is anything required from our end, of if there is an update from yours. [17:42:20] vgutierrez: when Puppet changes a file like /etc/haproxy/ipblocks.d/all.map (the Beta Cluster block list) is something else needed to make HAProxy see the new data? Context is T403075 where apparently the change did not take effect when I thought it would. [17:44:11] it should not: it's a confd::file which calls reload haproxy [17:44:28] confd::file { '/etc/haproxy/ipblocks.d/all.map': [17:44:31] reload => '/usr/bin/systemctl reload haproxy.service', [17:44:39] going to a meeting but I can help check later [17:44:45] (it's a bit late for vg) [17:59:52] oh actually [17:59:53] } else { [17:59:53] # deployment-prep still uses static configuration of abusers [17:59:53] $abuse_networks = network::parse_abuse_nets('varnish') [17:59:53] file { '/etc/haproxy/ipblocks.d/all.map': [17:59:55] ensure => file, [17:59:58] content => template('profile/cache/haproxy/ipblocks-all.map.erb'), [18:00:01] validate_cmd => '/usr/local/bin/check-haproxy-map %', [18:00:03] } [18:00:06] this is moresuitable [18:00:14] and hence, a manual reload is required [18:02:34] so you will need to a manual reload, similar to the varnish reload [18:16:58] we could patch that [18:17:20] and notify Service[haproxy] [18:28:45] ok. but is there no interaction here with the fact that varnish is still managed manually? [18:28:51] that we can't fix yet [18:35:34] 06Traffic, 10DNS, 06FR-donorrelations, 06SRE: Custom URL for survey pop-up - https://phabricator.wikimedia.org/T400278#11129811 (10ssingh) Hi @EBrill-WMF: I am sorry this slipped through the cracks. Unfortunately, we will be unable to make this work: `donorsatisfaction.wikimedia.org`. The reason is that... [18:42:35] 06Traffic, 10DNS, 10Sustainability (Incident Followup): Automate DNS depools such that manual commits are not required - https://phabricator.wikimedia.org/T303219#11129836 (10ssingh) 05Open→03Resolved a:03ssingh This was done in T369366. Site depool for geo DNS is now handled via the cookbook `sre.... [18:47:00] 06Traffic, 10DNS, 06SRE: Monitor DNS delegations - https://phabricator.wikimedia.org/T171470#11129846 (10ssingh) [18:47:55] 06Traffic, 10DNS, 06SRE: Monitor DNS delegations - https://phabricator.wikimedia.org/T171470#11129848 (10ssingh) [T402960 as a related task] [18:49:36] 06Traffic: varnish-frontend-fetcherr.service crashloop on cp3066 - https://phabricator.wikimedia.org/T387864#11129855 (10ssingh) 05Open→03Declined This hasn't happened in a while. I am marking this as resolved since this was basically the only occurrence of this event and I am not even sure what we can i... [18:57:52] 06Traffic, 10Maps, 06SRE: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11129879 (10ssingh) Hi: please see https://lists.wikimedia.org/pipermail/maps-l/2020-August/001729.html and T261424 on general information on why we are restricting thi... [18:59:17] 06Traffic, 10Maps, 06SRE: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11129884 (10ssingh) In general, our approval thus far can only apply to websites and not extensions. So there is a technical restriction that will prevent this from hap... [19:00:06] 06Traffic, 06Data-Engineering-Radar, 10HaproxyKafka, 10Observability-Logging, 13Patch-For-Review: Shutdown varnishkafka webrequest instances - https://phabricator.wikimedia.org/T393772#11129887 (10ssingh) @Fabfur: I think this all done and the alerts have been removed as well. Confirming: can we close th... [19:03:17] 06Traffic: Improving the time it takes to run authdns-update - https://phabricator.wikimedia.org/T393602#11129894 (10ssingh) 05Open→03Resolved a:03ssingh The `gc-authdns-git-repo.timer` has been running monthly and has significantly reduced the time it takes to run `authdns-update`. That, combined with... [19:08:37] 06Traffic, 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#11129910 (10ssingh) 05Open→03Resolved a:03ssingh Thanks a lot to @elukey and @Vo... [19:10:46] 06Traffic, 06Infrastructure-Foundations, 10Puppet-Core, 06SRE: allow non-roots to pool/depool certain DNS Discovery services - https://phabricator.wikimedia.org/T250557#11129914 (10ssingh) Is there still interest in pursuing this? This overlaps with both I/F and Traffic so we will need to assign it accordi... [19:12:43] 06Traffic: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457#11129918 (10ssingh) Thanks for reporting this @fgiunchedi! I am a bit split about this too because like Arzhel said, it's actually a feature in that respect. We don't want `anycast-h... [19:17:09] 06Traffic, 06Commons, 06SRE: HTTP 500 Error while trying to make large tabular JSON data file - https://phabricator.wikimedia.org/T329339#11129936 (10ssingh) 05Open→03Resolved a:03ssingh This has been open for two years at this point and there has been no follow-up from the reporter, or Traffic, an... [19:26:18] 06Traffic, 10Observability-Metrics, 06SRE: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266#11129959 (10ssingh) 05Open→03Resolved a:03ssingh As far as I can tell and especially the ones listed as missing, all dashboards have been ported to Thanos. Please let me know if... [20:07:26] 06Traffic, 10DNS: Monitor DNS delegations - https://phabricator.wikimedia.org/T171470#11130002 (10BCornwall) p:05Medium→03Low ncmonitor doesn't handle constant re-validation of domains - the related task only intends on validating before sending a patch. I think that it could be interesting to place such a... [22:15:43] FIRING: [13x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [22:20:43] RESOLVED: [34x] HaproxyKafkaSocketDroppedMessages: Unexpected rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [23:15:34] 06Traffic, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: [Rollout Phase 1] Implement redirect-less mobile routing and enable for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595#11130425 (10Krinkle) [23:19:07] 06Traffic, 10Hiddenparma: Introduce known-client-identity objects and integrate with requestctl - https://phabricator.wikimedia.org/T403220 (10Scott_French) 03NEW [23:19:41] 06Traffic, 10Hiddenparma: Introduce known-client-identity objects and integrate with requestctl - https://phabricator.wikimedia.org/T403220#11130472 (10Scott_French) p:05Triage→03Medium