[00:13:51] FIRING: FermMSS: Unexpected MSS value on 10.2.2.30:9200 @ cirrussearch1081 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=eqiad&var-cluster=elasticsearch - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [00:28:46] ^^ hey traffic, we had an incident very similar to https://w.wiki/EoqT that involved the above host in the last ~30m or so. I doubt the MSS stuff is related, but would like to discuss w/y'all further tomorrow if that's OK [04:14:06] FIRING: FermMSS: Unexpected MSS value on 10.2.2.30:9200 @ cirrussearch1081 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=eqiad&var-cluster=elasticsearch - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [07:11:44] 06Traffic, 10Hiddenparma, 06SRE: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119 (10Joe) 03NEW [07:11:55] 06Traffic, 10Hiddenparma, 06SRE: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11022677 (10Joe) p:05Triage→03High a:05Joe→03None [07:26:21] 06Traffic, 10Hiddenparma, 06SRE: Better mapping of requests coming from datacenters/clouds - https://phabricator.wikimedia.org/T400120 (10Joe) 03NEW [08:14:06] FIRING: FermMSS: Unexpected MSS value on 10.2.2.30:9200 @ cirrussearch1081 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=eqiad&var-cluster=elasticsearch - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [09:19:22] 06Traffic, 06SRE, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022917 (10Joe) 05Open→03Invalid The task is invalid as the bot was indeed using a user-agent that doesn't respect our UA policy., which has been in place since 2010... [09:56:57] inflatador: the alert itself is related to cirrussearch1081 not accepting connections on port 9200 anymore [12:49:47] 06Traffic: Provide a golang-github-confluentinc-confluent-kafka-go-dev version that matches librdkafka capabilities for bullseye - https://phabricator.wikimedia.org/T374232#11023947 (10Vgutierrez) 05Open→03Declined Fixed by vendoring the go dependencies [12:50:27] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#11023950 (10Vgutierrez) p:05Triage→03Medium [12:56:48] 06Traffic, 10Liberica: Reduce the chances of false positives on MSS clamping alerts - https://phabricator.wikimedia.org/T400155 (10Vgutierrez) 03NEW [12:56:54] 06Traffic, 10Liberica: Reduce the chances of false positives on MSS clamping alerts - https://phabricator.wikimedia.org/T400155#11023976 (10Vgutierrez) p:05Triage→03Medium [13:03:49] 06Traffic: depool script should be verbose when it fails to perform a depool - https://phabricator.wikimedia.org/T400156 (10Vgutierrez) 03NEW [13:03:58] 06Traffic: depool script should be verbose when it fails to perform a depool - https://phabricator.wikimedia.org/T400156#11024000 (10Vgutierrez) p:05Triage→03Medium [13:27:18] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad - https://phabricator.wikimedia.org/T400159 (10Jclark-ctr) 03NEW [14:12:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161 (10cmooney) 03NEW p:05Triage→03Low [14:25:21] RESOLVED: FermMSS: Unexpected MSS value on 10.2.2.30:9200 @ cirrussearch1081 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=eqiad&var-cluster=elasticsearch - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [14:37:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp5027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5027 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [14:42:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [14:42:40] FIRING: VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5026 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:47:40] FIRING: VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5026 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:52:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:57:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:02:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:07:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:12:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:21:59] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11024611 (10Jhancock.wm) I'll trash the optic. Good to close if there are no other points to cover. [15:27:17] 06Traffic: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796#11024647 (10Vgutierrez) 05Stalled→03Resolved a:03Vgutierrez We should proceed with the next LTS branch: 3.2 [15:27:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:30:26] 06Traffic: Refactor haproxy module to support tls/http settings as different types - https://phabricator.wikimedia.org/T341040#11024662 (10Fabfur) 05Open→03Resolved This is now obsolete, given the amount of changes that recently impacted haproxy configuration and puppetization [15:37:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:39:25] 06Traffic, 13Patch-For-Review: Upgrade HAProxy to version 3 on cp hosts - https://phabricator.wikimedia.org/T366885#11024738 (10Vgutierrez) this task and subtasks are now outdated and we should probably target 3.2, please close them, thanks [15:41:38] 06Traffic, 13Patch-For-Review: Upgrade HAProxy to version 3 on cp hosts - https://phabricator.wikimedia.org/T366885#11024742 (10Fabfur) 05Open→03Declined Re-opening with the correct version [15:41:40] 06Traffic: HAProxy 3.0 production rollout - https://phabricator.wikimedia.org/T366891#11024744 (10Fabfur) 05Open→03Declined [15:41:47] 06Traffic: Check production configuration compatibility with HAProxy 3.0 - https://phabricator.wikimedia.org/T366888#11024746 (10Fabfur) 05Open→03Declined [15:41:51] 06Traffic: Create new haproxy30 component and import bullseye package - https://phabricator.wikimedia.org/T366890#11024748 (10Fabfur) 05Open→03Declined [15:41:55] 06Traffic: Backport HAProxy 3.0 to Bullseye - https://phabricator.wikimedia.org/T366887#11024752 (10Fabfur) 05Open→03Declined [15:42:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:47:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:57:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:02:40] RESOLVED: [3x] VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:30:05] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11025252 (10cmooney) 05Open→03Resolved [18:03:17] 06Traffic, 10HaproxyKafka: Prevent HaproxyKafka from hanging - https://phabricator.wikimedia.org/T400199 (10Fabfur) 03NEW [18:25:59] 06Traffic, 06Commons, 10MediaWiki-Uploading, 06SRE: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#11025546 (10BCornwall) 05Open→03Stalled I see. Thank you for the response. I'll set this as "stalled". Please do report back if this is continuing! [18:53:54] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for some gnmic sourced metrics in codfw - https://phabricator.wikimedia.org/T400205 (10cmooney) 03NEW p:05Triage→03Medium [18:57:55] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for some gnmic sourced metrics in codfw - https://phabricator.wikimedia.org/T400205#11025679 (10cmooney) Hmm so looking a bit closer the issue seems to be counters on cr2-codfw itself ` cmooney@re0.cr2-codfw> show interfaces xe-0/1/1:1 |... [18:58:12] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025680 (10cmooney) [18:58:42] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025681 (10cmooney) [19:01:45] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025685 (10cmooney) >>! In T400205#11025679, @cmooney wrote: > Perhaps some odd bug to do with the new MPC10E card? This possibly? https://supportportal.juniper.... [19:08:39] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats repoted by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025713 (10cmooney) [19:09:34] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025714 (10cmooney) [19:21:56] 10netops, 06Infrastructure-Foundations, 06SRE: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025744 (10cmooney) The linked PR on the Juniper site says it was fixed in 23.4R1, we are on 23.4R2, so in theory shouldn't be it. I guess we could try the same fix, probably th... [21:59:10] 06Traffic, 06Experimentation Lab, 13Patch-For-Review: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411#11026106 (10KOfori) 05Open→03Resolved Thanks, everyone!