[08:01:34] <wikibugs>	 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339063 (10MoritzMuehlenhoff) For Ganeti I propose the following plan. It allows us to keep all misc services running on magru, so no need to fiddle with...
[08:35:14] <wikibugs>	 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339088 (10ops-monitoring-bot) Draining ganeti7004.magru.wmnet of running VMs
[08:36:00] <wikibugs>	 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339089 (10ops-monitoring-bot) Draining ganeti7004.magru.wmnet of running VMs
[08:58:00] <jinxer-wm>	 FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[09:03:00] <jinxer-wm>	 RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[09:20:00] <jinxer-wm>	 FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[09:25:00] <jinxer-wm>	 RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[09:41:55] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339219 (10akosiaris)
[09:42:00] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339220 (10akosiaris)
[10:02:35] <effie>	 Dear traffic 
[10:02:36] <effie>	 I will be replacing kafka-main1001 with kafka-main1006 - T363214, please let me know you see anything I may have done
[10:02:37] <stashbot>	 T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214
[10:12:30] <vgutierrez>	 effie: hmm that's gonna impact purged on several DCs
[10:13:25] <effie>	 according to the process, impact should be minimal 
[10:13:40] <effie>	 https://wikitech.wikimedia.org/wiki/Kafka/Administration#Hardware_replace_a_broker
[10:14:05] <effie>	 oh wait, ok 
[10:14:10] <effie>	 it is what it is I guess
[10:14:42] <vgutierrez>	 effie: when are you doing the switch?
[10:14:49] <effie>	 I will be starting now 
[10:14:57] <effie>	 switch will be done after data is copied over 
[10:15:43] <fabfur>	 hope librdkafka supports auto-reconnect if connected to that specific broker...
[10:15:57] <effie>	 we did this proces a while a go on codfw 
[10:16:18] <effie>	 task writes: 
[10:16:18] <effie>	 Before replacing the first broker here, give traffic a headsup so they can monitor for high latency of cache purges (purged). We saw that happening during T363210: kafka-main200[6789] and kafka-main2010 implementation tracking but I think this is not an issue anymore with the transfer.py approach.
[10:16:19] <stashbot>	 T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210
[10:16:42] <effie>	 I should be getting started, since tranfer wil take a while 
[10:16:47] <effie>	 vgutierrez: fabfur go?
[10:17:20] <vgutierrez>	 effie: yeah.. I 'm aware of that, I'll keep track of purged lag with https://grafana.wikimedia.org/goto/i_thDI7Ng?orgId=1
[10:17:25] <vgutierrez>	 effie: please go ahead
[10:17:27] <fabfur>	 I think if it's been tested ok for me
[10:17:29] <vgutierrez>	 and thanks for the heads up
[10:17:41] <fabfur>	 I'll keep an eye on the dashboard too
[10:49:08] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339426 (10akosiaris) I 've gone ahead and blocked   ` GET <domain>/api/rest_v1/list/pair/{from}/{to}/ GET <domain>/api/rest_v1/...
[11:18:03] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339491 (10akosiaris) I just noticed that the `wikimedia.org` argument applies to   ` GET <domain>/api/rest_v1/list/pair/{from}/...
[11:21:22] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339495 (10akosiaris)
[11:22:41] <vgutierrez>	 effie: already running? no noticeable impact 
[11:22:50] <effie>	 yes it is 
[11:23:04] <vgutierrez>	 cool
[11:55:35] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339608 (10MSantos) >>! In T375616#10339426, @akosiaris wrote: >  [...] > @MSantos does the above look correct to you?   Yes, it...
[11:57:05] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339611 (10akosiaris)
[12:03:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:08:25] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:12:29] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339689 (10akosiaris)
[12:13:25] <jinxer-wm>	 FIRING: [15x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:13:41] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339700 (10akosiaris) 05Open→03Resolved All done. Resolving, hopefully we won't have to reopen.
[12:15:00] <wikibugs>	 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339708 (10akosiaris) Courtesy of requestctl, we have the following [superset view](https://superset.wikimedia.org/superset/...
[12:18:25] <jinxer-wm>	 FIRING: [20x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:19:37] <vgutierrez>	 fabfur: ^^ did you trigger a puppet run after merging the revert?
[12:20:10] <fabfur>	 yep, already run on eqsin
[12:20:17] <fabfur>	 now running on ulsfo
[12:20:24] <fabfur>	 should resolve shortly
[12:20:34] <wikibugs>	 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops: Block traffic to RESTBase /page/related endpoint before it's deprecated - https://phabricator.wikimedia.org/T376297#10339727 (10MSantos) p:05Triage→03Medium
[12:21:13] <wikibugs>	 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339739 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp7007.magru.wmnet with...
[12:23:25] <jinxer-wm>	 RESOLVED: [20x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:23:59] <fabfur>	 ^^ should be all resolved now
[12:38:25] <wikibugs>	 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339847 (10RobH) Work rescheduled after conversation with both @ssingh and @MoritzMuehlenhoff regarding ganeti host cadence and swa...
[12:44:02] <wikibugs>	 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339856 (10ssingh) Thanks @RobH , sounds good!  >>! In T376737#10339847, @RobH wrote: > Work rescheduled after conversation with bo...
[12:54:40] <jinxer-wm>	 FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[13:04:40] <jinxer-wm>	 FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[13:09:40] <jinxer-wm>	 FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[13:17:04] <effie>	 anyone up to +1 https://gerrit.wikimedia.org/r/1093330  (kafka)?
[13:17:42] <wikibugs>	 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340002 (10RobH)
[13:17:46] <wikibugs>	 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp7007.magru.wmnet with OS...
[13:24:40] <jinxer-wm>	 FIRING: [11x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[13:29:40] <jinxer-wm>	 RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[13:59:43] <wikibugs>	 06Traffic, 06Data-Platform-SRE: Allow TLS authenticated client to write on new topics - https://phabricator.wikimedia.org/T380373 (10Fabfur) 03NEW
[14:20:29] <effie>	 vgutierrez: how are things on your end?
[14:22:09] <vgutierrez>	 effie: no purged alerts, we got two lag spikes but not too bad
[14:23:22] <effie>	 great
[14:23:24] <vgutierrez>	 as expected we got errors on the consumers
[14:23:27] <vgutierrez>	 Nov 20 14:17:39 cp6009 purged[372633]: %4|1732112259.413|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10406 ms without a successful response from the group coordinator (broker 1001, last error was Broker: Not coordinator): revoking assignment and rejoining group
[14:23:37] <vgutierrez>	 but it recovered nicely
[14:23:55] <effie>	 lovely to hear that 
[14:28:19] <vgutierrez>	 effie: hmmm interesting, it was expected that we got connection issues against the whole cluster?
[14:29:58] <vgutierrez>	 https://www.irccloud.com/pastebin/POuh0HYF/
[14:30:13] <vgutierrez>	 see cp1100 purged log as an example
[14:31:20] <effie>	 the whole cluster?
[14:32:10] <vgutierrez>	 no the whole cluster sorry
[14:32:11] <effie>	 to my knowledge no
[14:32:13] <vgutierrez>	 just several nodes
[14:32:19] <vgutierrez>	 kafka-main1002 and 1003 for cp1100
[14:32:34] <effie>	 those never left tbe cluster
[14:32:47] <effie>	 1001 is the old one and 1006 is the new one
[14:32:49] <vgutierrez>	 but apparently they weren't accepting new connections
[14:33:35] <vgutierrez>	 other hosts were reporting issues against other nodes... 1004, 1005...
[14:35:56] <vgutierrez>	 apparently some step of the maintenance impacted connectivity against kafka-main@eqiad
[14:36:51] <effie>	 the firewall rules have been put in place since yesterday
[14:37:45] <vgutierrez>	 effie: a puppet would could flush the whole set of rules leaving the firewall on a state of DROP by defualt?
[14:37:46] <vgutierrez>	 *default?
[14:37:50] <vgutierrez>	 *puppet run
[14:38:22] <effie>	 I dont see how, the new hosts were added yesterday in ferm 
[14:40:10] <effie>	 I will be back in about 30' we can look into it
[14:40:33] <effie>	 I will tell serviceops in the meantime
[14:41:17] <vgutierrez>	 effie: oh
[14:41:19] <vgutierrez>	 Nov 20 13:35:42 kafka-main1005 kafka-server-start[3086112]: [2024-11-20 13:35:42,925] INFO [KafkaServer id=1005] shutting down (kafka.server.KafkaServer)
[14:41:34] <vgutierrez>	 Nov 20 13:35:42 kafka-main1005 kafka-server-start[3086112]: [2024-11-20 13:35:42,927] INFO [KafkaServer id=1005] Starting controlled shutdown (kafka.server.KafkaServer)
[14:42:08] <vgutierrez>	 that explains the connectivity issues I was seeing :)
[14:43:50] <vgutierrez>	 and we got the same events on the other servers
[14:44:01] <vgutierrez>	 (at different times)
[14:48:54] <effie>	 ok thanks
[15:11:17] <wikibugs>	 06Traffic: Upgrade lshw on all cp hosts - https://phabricator.wikimedia.org/T380295#10340520 (10Fabfur) 05Open→03Resolved
[15:11:39] <wikibugs>	 06Traffic: Upgrade haproxy to 2.8.12 on cp hosts - https://phabricator.wikimedia.org/T379891#10340523 (10Fabfur) 05In progress→03Resolved Deployed on all sites
[15:16:40] <jinxer-wm>	 FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[15:21:40] <jinxer-wm>	 FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[15:26:40] <jinxer-wm>	 FIRING: [13x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[15:31:40] <jinxer-wm>	 FIRING: [15x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[15:32:03] <sukhe>	 cp5017 is having a rough day
[15:32:23] <effie>	 vgutierrez: can you share the dashboard you saw the reporeted connectivity problems?
[15:32:41] <vgutierrez>	 effie: purged logs
[15:33:04] <vgutierrez>	 effie: you got an example here: https://www.irccloud.com/pastebin/POuh0HYF/
[15:33:47] <vgutierrez>	 effie: you got similar logs on every cp instance on eqiad, esams, drmrs & magru 
[15:34:05] <effie>	 alright, is it still ongoing ?
[15:34:10] <vgutierrez>	 effie: nope
[15:34:13] <effie>	 ah! 
[15:34:24] <vgutierrez>	 it matches the timestamps where kafka daemons were restarted
[15:34:40] <effie>	 alright, then it is all well, I will add it to the docs
[15:34:42] <vgutierrez>	 but I wasn't expecting to see kafka being restarted on each instance
[15:35:08] <effie>	 excellent
[15:35:20] <vgutierrez>	 is actually needed or is it an oversight of our automation?
[15:42:21] <effie>	 according to the docs it does, since we ae updating the configuration 
[15:46:40] <jinxer-wm>	 FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[15:51:40] <jinxer-wm>	 RESOLVED: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[15:51:54] <wikibugs>	 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340716 (10RobH) Confirmed new window with Willy and sent update to ticket:    > Support, Can we shift this to work on Monday, Nove...
[16:57:59] <wikibugs>	 06Traffic: Package and deploy ATS 9.2.6 - https://phabricator.wikimedia.org/T379797#10340936 (10Fabfur) a:03Fabfur
[16:58:28] <wikibugs>	 06Traffic, 10envoy, 06serviceops, 06SRE: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#10340940 (10jijiki) p:05Triage→03Medium
[18:59:17] <wikibugs>	 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10341561 (10RobH) New work window confirmed by ascenty:  Comentário gerado em Smart Hands: Hello,    > We received the Ticket and sc...