[01:42:07] 06serviceops, 10MW-on-K8s, 10Shellbox, 10SRE-swift-storage: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10086764 (10tstarling) >>! In T292322#9614342, @Joe wrote: > @tstarling I think we determined that the expensive part of handling large files in shellbox was mostly the do... [05:50:22] 06serviceops, 10MW-on-K8s, 10Shellbox, 10SRE-swift-storage: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10086879 (10tstarling) 05Open→03In progress a:03tstarling [09:39:30] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:01:00] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087262 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:31:08] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:36:30] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:42:03] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:45:51] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:51:30] 06serviceops, 06DBA, 06SRE, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10087394 (10Tgr) Related: {T198755} >>!... [11:00:48] 06serviceops, 06DBA, 06SRE, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10087406 (10Ladsgroup) >>! In T372943#10... [11:21:59] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [11:28:56] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [11:29:19] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087484 (10Clement_Goubert) [11:31:20] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [11:34:23] 06serviceops, 06All-and-every-Wikisource, 10Thumbor, 13Patch-For-Review: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC - https://phabricator.wikimedia.org/T372470#10087492 (10hnowlan) 05Open→03Stalled This appears to have dropped. Leaving open to get patches resolved at a l... [12:13:46] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087551 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [12:46:11] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189 (10JMeybohm) 03NEW [12:46:15] 06serviceops: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214#10087619 (10JMeybohm) [12:46:16] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10087618 (10JMeybohm) [12:46:18] 06serviceops, 13Patch-For-Review: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#10087620 (10JMeybohm) [13:09:16] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10087686 (10brouberol) I'll add some thoughts as well. In our testing, we haven't been able to throttle the follower replication traffic for a given broken //in the absence of a reassignment//. It would... [13:16:29] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10087703 (10JMeybohm) [13:25:26] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195 (10bking) 03NEW [13:30:09] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195#10087777 (10bking) a:05brouberol→03None [13:30:10] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10087779 (10JMeybohm) [13:30:51] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087782 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [13:31:58] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10087795 (10JMeybohm) [13:33:08] 06serviceops, 13Patch-For-Review: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#10087797 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1f9d2e89-ff2c-47c0-ae0e-3a1dd3fcd648) set by jayme@cumin1002 for 4 days, 0:00:00 on 1 host(s)... [13:33:09] 06serviceops, 13Patch-For-Review: kafka-main200[6789] and kafka-main2010 implementation tracking - https://phabricator.wikimedia.org/T363210#10087791 (10JMeybohm) [13:41:57] jayme: re: kafka-main2* how sure are we that we actually saturated links? [13:47:33] cdanis: we where ~85MB/s [13:47:37] I see about 600 Mbps so not saturated bw-wise but packet drops [13:48:57] jayme: https://librenms.wikimedia.org/graphs/to=1724420700/id=28923/type=port_bits/from=1724247900/ [13:49:33] 700mbit ingress, no drops on the switch side, nic-saturation-exporter metrics show that we weren't micro-bursting above like 800mbit/s [13:49:38] we can probably say "to much to keep up with the latency expectation of eventgate" [13:49:44] yeah, that I believe [13:49:54] heh yeah [13:50:06] did the eventgate liveness probe also depend upon kafka? [13:50:16] no [13:50:21] that's something [13:50:23] that does a tcp probe [13:51:28] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [13:51:42] cdanis: is nic-saturation plottet somewhere? [13:52:31] jayme: yeah, on the standard host dashboard, and there's also a heatmap on the standard cluster dashboard [13:52:44] but the metrics that are plotted there, were all 0 for the entire duration [13:53:21] ah, it's the right hand axis on network errors here? https://grafana-rw.wikimedia.org/d/000000377/host-overview?forceLogin&from=now-2d&orgId=1&refresh=5m&to=now&var-cluster=kafka_main&var-datasource=thanos&var-server=kafka-main2006&viewPanel=11 [13:54:22] yeah [13:54:33] so, the metrics are a little tricky [13:55:10] every ~1hz the exporter wakes up and looks at the change in rx/tx bytes across all physical interfaces [13:55:31] if it was >=80% of line speed it increments a 'warm' metric, if it was >=90% of line speed it increments 'hot' [13:55:52] and then ofc in prom we sample that counter (which is like, seconds-per-second, on average) [13:56:07] ah, I see. Thanks [13:56:17] https://grafana.wikimedia.org/goto/NAvB0m3IR?orgId=1 [13:56:27] sorry for not subsribing you to T373189 - I forgot [13:56:30] np! [13:56:41] it was only for about 1 minute, at 11:58, that kafka-main2006 was 'warm' at all [13:56:49] err 11:57 [13:57:06] I see... [13:57:27] yeah. My feeling is that there wasn't actually an issue until eventgate made one up [13:57:33] yeah I think so too [13:57:55] I did some other digging around on the kafka nodes and I didn't find anything else that looked like it saturated [13:57:59] all other kafka clients did not even flinch [13:58:19] disk i/o queue was fine, all the other NICs were fine, nothing visible in terms of errors or drops on the switch side either (checked all the kafka-main2* ports in librenms) [13:58:49] <3 thanks for digging [13:58:54] it wasn't some goofy thing where we were saturating core 0 or anything on kafka-main2006 either https://grafana.wikimedia.org/d/OERePosZk/cdanis-host-per-cpu-skew?orgId=1&var-datasource=codfw+prometheus%2Fops&var-cluster=kafka_main&var-instance=kafka-main2006%3A9100&from=1724318938394&to=1724340978145 [14:02:05] I think there is still some room to improve the process (mainly actively moving away the leadership from the to be removed broker). But the main part is probably just in changing the eventgate readiness to something that checks if the core schema is loaded and return 200 [14:03:03] +1 [14:03:21] maybe setting a fixed timeout for calls to kafka as well to not let clients hang waiting, but I don't really know enough [14:16:40] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10087989 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:37:34] 06serviceops, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10088060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [18:48:28] 06serviceops, 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10088734 (10Scott_French) Following up on the status of the php-geoip extension (h/t to @Krinkle for all the discussion out of band): In Debian, there is no php-geoip package after bullseye... [19:14:28] 06serviceops, 06DBA, 06SRE, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10088802 (10CDanis) >>! In T372943#10083... [19:27:58] 06serviceops, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10088852 (10Krinkle) [19:40:12] 06serviceops, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507#10088904 (10CDanis) >>! In T372507#10088733, @Scott_French wrote: > In puppet, this comment [2] would suggest it's only needed for fundraising use cases.... [20:05:45] 06serviceops: Establish a proper process for repacing kafka nodes - https://phabricator.wikimedia.org/T373189#10088967 (10CDanis) Just FTR -- we didn't actually saturate anything at all on the Kafka hosts. The hottest that any NIC was was about 70% of line rate, which is warm but not hot. No drops or errors on... [20:17:36] 06serviceops, 10Scap, 13Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥), 10Sustainability (Incident Followup): scap should check if it is running within a tmux/screen - https://phabricator.wikimedia.org/T361724#10088983 (10thcipriani) p:05High→03Medium a:03dancy [20:54:17] 06serviceops, 06DBA, 06SRE, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10089050 (10Tgr) >>! In T372943#10087406... [23:54:48] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#10089240 (10jwang) Hi, I failed to ssh deployment.eqiad.wmnet. The message I got is `deployment.eqiad.wmnet: Permission denied (publickey)`. I was able to ssh a couple of months ago. Is this related to the deploymen...