[02:14:22] 06Traffic, 06SRE: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781#11769898 (10Pppery) [08:07:11] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11770806 (10ABran-WMF) [08:33:23] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827#11770925 (10ABran-WMF) after merging [[ https://gerrit.wikimedia.org/r/1265322 | that config ]] change, [[ https... [08:45:42] 06Traffic, 10Liberica, 10Prod-Kubernetes, 07Kubernetes, 06ServiceOps new (Next quarter): Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436#11771019 (10JMeybohm) `wikikube-worker2347.codfw.wmnet` is fine after a ferm restart. For whatever reason it was unable... [09:44:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2041 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [09:45:20] ^^ me, silencing [09:49:43] RESOLVED: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2041 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2041 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [10:08:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp1111:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1111&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:09:13] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2042 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2042 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [10:13:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp1111:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1111&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:19:04] RESOLVED: [2x] HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp2041 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [12:04:42] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 2 others: Increase visibility of kubernetes network status - https://phabricator.wikimedia.org/T356877#11772206 (10JMeybohm) 05Stalled→03Open p:05Triage→03Medium We had some issues that could have been surfaced by this... [12:17:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp1111:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1111&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [12:22:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp1111:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1111&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [12:45:48] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11772430 (10Ladsgroup) And poster of videos have been broken now. I fixed them in https://gerrit.wikimedia.org/r/c/mediawiki/extensio... [13:06:01] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11772577 (10Ladsgroup) ugh, I mistook normaliseParams with getSteppedThumbWidth [15:37:58] 06Traffic, 10Liberica, 10Prod-Kubernetes, 07Kubernetes, 06ServiceOps new (Next quarter): Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436#11773729 (10JMeybohm) >>! In T420436#11771019, @JMeybohm wrote: > On `wikikube-worker1347.eqiad.wmnet` the istio-ingress... [15:39:04] 06Traffic, 10Liberica, 10Prod-Kubernetes, 07Kubernetes, 06ServiceOps new (Next quarter): Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436#11773748 (10JMeybohm) [15:41:05] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Increase visibility of kubernetes network status - https://phabricator.wikimedia.org/T356877#11773772 (10JMeybohm) [16:23:32] hello Traffic, I need to move back Gerrit Gitiles traffic from the replica host back to the primary. [16:23:32] The reason is the replica does not have the user sessions (which I knew) and it turns out we have some repository that requires to be logged in for viewing. [16:23:32] The change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1265465 [16:23:32] (I poked SRE collab but they are out at this time) [16:27:22] brett: thanks! [16:39:50] hashar: not a big deal for us to merge. can you get a review from someone on your team as well? [16:40:30] sukhe: that is for SRE collab and they are not there [16:41:07] I am not sure anyone in my team has knowldge about the Gerrit/Traffic. But I can try Tyler [16:41:40] arnaudb: ^ if you are still around [16:42:34] he told me he went out for sport, and at 7pm I guess is out there [16:42:39] yeah fair :) [16:42:55] +1 [16:43:10] but I am around to monitor for a bit :] [16:43:10] I have a little gerrit / gerrit-replica state :) [16:43:40] ok thanks cdanis [16:43:50] switching Gitiles traffic to a replica is smething I have wanted to do for a while [16:43:54] hashar: merging now [16:43:58] cool! [16:43:59] OK to continue? [16:44:08] sukhe: 112 [16:44:13] ha! [16:44:17] :) [16:49:50] I have opened some dashboard to monitor the switch on the Gerrit side [16:50:18] hashar: rolling out with -b31 [16:55:34] looks like it is working [16:55:44] (Gitiles shows me I am not logged in) [16:56:38] oh you must be hitting drmrs [16:56:41] yep, rolled out there [16:57:59] oh [16:58:01] yeah sorry [16:58:12] I do show as logged in [16:58:23] (I really have an issue with boolean logic sometime) [16:59:50] cdanis: sukhe: thanks for stepping in and rolling that change! [17:00:08] that fixes the issue of browsing private repos [17:00:19] I'll keep monitoring Gerrit for a bit [17:00:20] hashar: no worries, hth [17:00:31] and circle back with Arnaud tomorrow morning [17:06:23] 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#11774409 (10ssingh) Before we can start rolling this out to all DNS hosts, there is some additional work that needs to be done: - pdns-rec 5.x is only available on t... [17:06:26] sukhe: cdanis: at quick glance Gerrit/Gitiles look all good thank you! I'll recheck later tonight [17:06:36] 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#11774412 (10ssingh) [17:06:48] thanks :) [17:07:35] thanks both [17:53:03] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11774687 (10BCornwall) [19:26:54] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11775067 (10BCornwall) [19:33:56] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs6002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:03] hm [19:38:44] what happened? [19:38:51] 06Traffic, 06SRE: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11775112 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `hcaptcha[1001-1002,2001-2002].wikimedia.org` - hca... [19:38:55] cjd91: I see you're on there [19:40:11] cjd91: please respond [19:40:24] What method were you doing for rebooting it? [19:40:25] it seems to have resolved [19:40:32] probably the reboot [19:40:38] via the autorestart service? [19:40:41] it hasn't been rebooted [19:43:06] I see https://sal.toolforge.org/log/2bdmRZ0BffdvpiTrlFzN and then a failure later. [19:43:26] hm [19:43:31] uh, it just rebooted [19:44:48] I was using the cookbook, but I'd forgotten that the cookbook (sre.loadbalancer.admin) takes care of depooling, so I'd depooled first and then tried to reboot [19:48:56] RESOLVED: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs6002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:48] 06Traffic, 06SRE: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11775191 (10BCornwall) The LVS service has been remove, the hosts, decommissioned, and the hcaptcha_proxy module removed from puppet. I'm not sure that any... [20:28:01] 06Traffic, 10ServiceOps-Services-Oids, 06Product Safety and Integrity (Sprint Forsythia (Mar 23 - Apr 10))), 06ServiceOps new (Next quarter), 05WE4.2 Bot detection (WE4.2 hCaptcha editing trial): hCaptcha: Stop using urldownloader for health checks of th... - https://phabricator.wikimedia.org/T421464#11775265 [20:38:07] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 07OKR-Work: Log Api-User-Agent header in Turnilo - https://phabricator.wikimedia.org/T373871#11775321 (10LucasWerkmeister) >>! In T373871#11208344, @HCoplin-WMF wrote: > Updating as low priority since we don't think anyone is actually using it right now...