[08:01:34] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339063 (10MoritzMuehlenhoff) For Ganeti I propose the following plan. It allows us to keep all misc services running on magru, so no need to fiddle with... [08:35:14] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339088 (10ops-monitoring-bot) Draining ganeti7004.magru.wmnet of running VMs [08:36:00] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339089 (10ops-monitoring-bot) Draining ganeti7004.magru.wmnet of running VMs [08:58:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [09:03:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [09:20:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [09:25:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [09:41:55] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339219 (10akosiaris) [09:42:00] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339220 (10akosiaris) [10:02:35] Dear traffic [10:02:36] I will be replacing kafka-main1001 with kafka-main1006 - T363214, please let me know you see anything I may have done [10:02:37] T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214 [10:12:30] effie: hmm that's gonna impact purged on several DCs [10:13:25] according to the process, impact should be minimal [10:13:40] https://wikitech.wikimedia.org/wiki/Kafka/Administration#Hardware_replace_a_broker [10:14:05] oh wait, ok [10:14:10] it is what it is I guess [10:14:42] effie: when are you doing the switch? [10:14:49] I will be starting now [10:14:57] switch will be done after data is copied over [10:15:43] hope librdkafka supports auto-reconnect if connected to that specific broker... [10:15:57] we did this proces a while a go on codfw [10:16:18] task writes: [10:16:18] Before replacing the first broker here, give traffic a headsup so they can monitor for high latency of cache purges (purged). We saw that happening during T363210: kafka-main200[6789] and kafka-main2010 implementation tracking but I think this is not an issue anymore with the transfer.py approach. [10:16:19] T363210: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210 [10:16:42] I should be getting started, since tranfer wil take a while [10:16:47] vgutierrez: fabfur go? [10:17:20] effie: yeah.. I 'm aware of that, I'll keep track of purged lag with https://grafana.wikimedia.org/goto/i_thDI7Ng?orgId=1 [10:17:25] effie: please go ahead [10:17:27] I think if it's been tested ok for me [10:17:29] and thanks for the heads up [10:17:41] I'll keep an eye on the dashboard too [10:49:08] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339426 (10akosiaris) I 've gone ahead and blocked ` GET /api/rest_v1/list/pair/{from}/{to}/ GET /api/rest_v1/... [11:18:03] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339491 (10akosiaris) I just noticed that the `wikimedia.org` argument applies to ` GET /api/rest_v1/list/pair/{from}/... [11:21:22] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339495 (10akosiaris) [11:22:41] effie: already running? no noticeable impact [11:22:50] yes it is [11:23:04] cool [11:55:35] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339608 (10MSantos) >>! In T375616#10339426, @akosiaris wrote: > [...] > @MSantos does the above look correct to you? Yes, it... [11:57:05] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339611 (10akosiaris) [12:03:25] FIRING: [2x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:25] FIRING: [10x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:29] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339689 (10akosiaris) [12:13:25] FIRING: [15x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:41] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339700 (10akosiaris) 05Open→03Resolved All done. Resolving, hopefully we won't have to reopen. [12:15:00] 06Traffic, 10CX-cxserver, 10RESTBase Sunsetting, 07Essential-Work: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616#10339708 (10akosiaris) Courtesy of requestctl, we have the following [superset view](https://superset.wikimedia.org/superset/... [12:18:25] FIRING: [20x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:37] fabfur: ^^ did you trigger a puppet run after merging the revert? [12:20:10] yep, already run on eqsin [12:20:17] now running on ulsfo [12:20:24] should resolve shortly [12:20:34] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops: Block traffic to RESTBase /page/related endpoint before it's deprecated - https://phabricator.wikimedia.org/T376297#10339727 (10MSantos) p:05Triage→03Medium [12:21:13] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339739 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp7007.magru.wmnet with... [12:23:25] RESOLVED: [20x] SystemdUnitFailed: haproxykafka.service on cp4037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:59] ^^ should be all resolved now [12:38:25] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339847 (10RobH) Work rescheduled after conversation with both @ssingh and @MoritzMuehlenhoff regarding ganeti host cadence and swa... [12:44:02] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339856 (10ssingh) Thanks @RobH , sounds good! >>! In T376737#10339847, @RobH wrote: > Work rescheduled after conversation with bo... [12:54:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:04:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:09:40] FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:17:04] anyone up to +1 https://gerrit.wikimedia.org/r/1093330 (kafka)? [13:17:42] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340002 (10RobH) [13:17:46] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp7007.magru.wmnet with OS... [13:24:40] FIRING: [11x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:29:40] RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:59:43] 06Traffic, 06Data-Platform-SRE: Allow TLS authenticated client to write on new topics - https://phabricator.wikimedia.org/T380373 (10Fabfur) 03NEW [14:20:29] vgutierrez: how are things on your end? [14:22:09] effie: no purged alerts, we got two lag spikes but not too bad [14:23:22] great [14:23:24] as expected we got errors on the consumers [14:23:27] Nov 20 14:17:39 cp6009 purged[372633]: %4|1732112259.413|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10406 ms without a successful response from the group coordinator (broker 1001, last error was Broker: Not coordinator): revoking assignment and rejoining group [14:23:37] but it recovered nicely [14:23:55] lovely to hear that [14:28:19] effie: hmmm interesting, it was expected that we got connection issues against the whole cluster? [14:29:58] https://www.irccloud.com/pastebin/POuh0HYF/ [14:30:13] see cp1100 purged log as an example [14:31:20] the whole cluster? [14:32:10] no the whole cluster sorry [14:32:11] to my knowledge no [14:32:13] just several nodes [14:32:19] kafka-main1002 and 1003 for cp1100 [14:32:34] those never left tbe cluster [14:32:47] 1001 is the old one and 1006 is the new one [14:32:49] but apparently they weren't accepting new connections [14:33:35] other hosts were reporting issues against other nodes... 1004, 1005... [14:35:56] apparently some step of the maintenance impacted connectivity against kafka-main@eqiad [14:36:51] the firewall rules have been put in place since yesterday [14:37:45] effie: a puppet would could flush the whole set of rules leaving the firewall on a state of DROP by defualt? [14:37:46] *default? [14:37:50] *puppet run [14:38:22] I dont see how, the new hosts were added yesterday in ferm [14:40:10] I will be back in about 30' we can look into it [14:40:33] I will tell serviceops in the meantime [14:41:17] effie: oh [14:41:19] Nov 20 13:35:42 kafka-main1005 kafka-server-start[3086112]: [2024-11-20 13:35:42,925] INFO [KafkaServer id=1005] shutting down (kafka.server.KafkaServer) [14:41:34] Nov 20 13:35:42 kafka-main1005 kafka-server-start[3086112]: [2024-11-20 13:35:42,927] INFO [KafkaServer id=1005] Starting controlled shutdown (kafka.server.KafkaServer) [14:42:08] that explains the connectivity issues I was seeing :) [14:43:50] and we got the same events on the other servers [14:44:01] (at different times) [14:48:54] ok thanks [15:11:17] 06Traffic: Upgrade lshw on all cp hosts - https://phabricator.wikimedia.org/T380295#10340520 (10Fabfur) 05Open→03Resolved [15:11:39] 06Traffic: Upgrade haproxy to 2.8.12 on cp hosts - https://phabricator.wikimedia.org/T379891#10340523 (10Fabfur) 05In progress→03Resolved Deployed on all sites [15:16:40] FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:21:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:26:40] FIRING: [13x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:31:40] FIRING: [15x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:32:03] cp5017 is having a rough day [15:32:23] vgutierrez: can you share the dashboard you saw the reporeted connectivity problems? [15:32:41] effie: purged logs [15:33:04] effie: you got an example here: https://www.irccloud.com/pastebin/POuh0HYF/ [15:33:47] effie: you got similar logs on every cp instance on eqiad, esams, drmrs & magru [15:34:05] alright, is it still ongoing ? [15:34:10] effie: nope [15:34:13] ah! [15:34:24] it matches the timestamps where kafka daemons were restarted [15:34:40] alright, then it is all well, I will add it to the docs [15:34:42] but I wasn't expecting to see kafka being restarted on each instance [15:35:08] excellent [15:35:20] is actually needed or is it an oversight of our automation? [15:42:21] according to the docs it does, since we ae updating the configuration [15:46:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:51:40] RESOLVED: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:51:54] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340716 (10RobH) Confirmed new window with Willy and sent update to ticket: > Support, Can we shift this to work on Monday, Nove... [16:57:59] 06Traffic: Package and deploy ATS 9.2.6 - https://phabricator.wikimedia.org/T379797#10340936 (10Fabfur) a:03Fabfur [16:58:28] 06Traffic, 10envoy, 06serviceops, 06SRE: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#10340940 (10jijiki) p:05Triage→03Medium [18:59:17] 06Traffic, 10ops-magru, 06SRE, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10341561 (10RobH) New work window confirmed by ascenty: Comentário gerado em Smart Hands: Hello, > We received the Ticket and sc...