[02:01:51] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882396 (10BCornwall) [02:02:46] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882398 (10BCornwall) a:05VRiley-WMF→03BCornwall [02:03:18] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882400 (10BCornwall) [02:12:01] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882406 (10BCornwall) [02:13:16] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10882408 (10BCornwall) [07:09:20] FYI, there'll be a brief anycase alert spam for magru since I'm switching some VMs away from DRBD [07:21:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:26:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:35:30] FIRING: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:40:30] RESOLVED: [2x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [08:17:42] 06Traffic, 07Regression, 07xLab: Cookie “WMF-Uniq” has been rejected because it is in a cross-site context - https://phabricator.wikimedia.org/T395958#10882815 (10Volans) Removing SRE I/F as we're not involved in the `WMF-Uniq` cookie management. [10:05:39] 06Traffic, 10Liberica, 13Patch-For-Review: Test katran forwarding plane on lvs1013 - https://phabricator.wikimedia.org/T395228#10883092 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d0bc6853-e370-4194-a010-37ddc420e227) set by vgutierrez@cumin1002 for 1 day, 0:00:00 on 1 host(s) and the... [10:06:14] 10netops, 06Infrastructure-Foundations, 06SRE: Export additional network device stats in gnmi - https://phabricator.wikimedia.org/T395998 (10cmooney) 03NEW p:05Triage→03Low [10:06:59] 10netops, 06Infrastructure-Foundations, 06SRE: Export additional network device stats in gnmi - https://phabricator.wikimedia.org/T395998#10883105 (10cmooney) [10:07:03] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10883106 (10cmooney) [10:11:09] 10netops, 06Infrastructure-Foundations, 06SRE: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555#10883127 (10cmooney) Ok thanks guys. Let me see if I can prep a patch to remove it where we currently are. It would clear up my proposed IBGP patch quite a bit... [11:41:55] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10883377 (10ayounsi) 05Open→03Stalled a:03ayounsi Going to mark that one as stalled until we can either onboard new device... [11:43:22] 10netops, 06Infrastructure-Foundations, 06SRE: Export additional network device stats in gnmi - https://phabricator.wikimedia.org/T395998#10883383 (10ayounsi) Good idea! in theory not particularly difficult, but we should look at reducing the load (go routines) on the current gNMIc instances first. [11:57:24] 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#10883453 (10cmooney) p:05Medium→03Low >>! In T393996#10861802, @Dwisehaupt wrote: > Just pinging on this. Maintenance week is this week and we are ok for the work to happen when you a... [12:29:15] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555#10883638 (10cmooney) 05Open→03Resolved a:03cmooney [13:11:37] 06Traffic, 10Ganeti, 06SRE: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015 (10MoritzMuehlenhoff) 03NEW [13:13:08] 06Traffic, 10Ganeti, 06SRE: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10883806 (10ssingh) Sounds good, thanks. Let me know if I can help with anything. [14:06:27] 06Traffic, 10Ganeti, 06SRE: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884146 (10ssingh) [14:19:41] 06Traffic, 10Ganeti, 06SRE: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884179 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1002 for hosts: `durum7001.magru.wmnet` - durum7001.magru.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanag... [14:24:19] 06Traffic, 10Ganeti, 06SRE: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884207 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1002 for hosts: `doh7001.wikimedia.org` - doh7001.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanag... [14:24:38] 06Traffic, 10Ganeti, 06SRE: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884209 (10ssingh) [14:25:09] 06Traffic, 10Ganeti, 06SRE: Decommission doh7001 and durum7001 - https://phabricator.wikimedia.org/T396015#10884211 (10ssingh) @Muehlenhoff: Both of these are decommissioned. Let me know if any other action is required from my end, thanks! [15:07:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10884415 (10cmooney) >>! In T385217#10879725, @Jhancock.wm wrote: > @cmooney I'm gonna reply to Jorge's email about boxes a... [15:37:00] FIRING: [9x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh1001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [15:37:10] ^ that's fine, restart in progress [15:37:14] reboot rather [15:42:00] RESOLVED: [9x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh1002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [15:57:00] FIRING: [8x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh3004:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [16:02:00] FIRING: [8x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh4001:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [16:02:30] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10884669 (10Jhancock.wm) okay cool. I'm gonna unrack them tomorrow and get them boxed. i replied to Nokia's email asking for pac... [16:07:00] RESOLVED: [7x] AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh4002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [16:08:46] hello again - my rollout of the trafficserver API changes was delayed yesterday, but I'd like to try it today. Would that suit? [16:10:50] hnowlan: we are in the process of rolling out a change currently and it will be a while before it finishes. [16:11:07] 70/112, 119s ~ host, so like 1.5 hours more? [16:11:49] ack, no worries [16:12:23] thanks, I guess it will be late for you by that time so please check tomorrow. in theory we can combine the two changes but puppet is disabled and slowly being enabled, so probably best to not conflate them [16:13:11] yeah for sure, no problem [16:15:51] 06Traffic, 06Experimentation Lab, 07Regression, 07xLab: Cookie “WMF-Uniq” has been rejected because it is in a cross-site context - https://phabricator.wikimedia.org/T395958#10884747 (10dr0ptp4kt) [19:00:26] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10885491 (10ssingh) Commenting on this with my own understanding and for review of others. After that, letting @BCornwall handle updating the task description. IMO the way we...