[00:30:18] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10159245 (10Dwisehaupt) T375142 - Expanded the pfw and iptables ranges for prometheus collection so that we can hi... [01:49:49] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10159341 (10ssingh) @RobH: ` sukhe@cumin1002:~$ sudo cumin 'A:cp' 'dmesg -T | grep -q -i "core temperature is above" && echo "CPU throttled due to high temperature" || echo "CPU is OK"' 112 hosts... [04:18:14] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151 (10Papaul) 03NEW [04:18:37] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159412 (10Papaul) [04:23:29] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159413 (10Papaul) @Jhancock.wm if you have some time this week or next week can you please check in rack C8 all the servers that have only... [04:23:36] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159414 (10Papaul) p:05Triage→03Medium [04:24:55] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159415 (10Papaul) [04:49:28] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159423 (10Papaul) [06:41:15] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10159452 (10ABran-WMF) all actionnable machines are ready to be depooled. I'll start depooling 20/15min before 16:00 UTC [07:49:02] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10159580 (10ops-monitoring-bot) Draining ganeti2018.codfw.wmnet of running VMs [08:28:10] 10netops, 06Infrastructure-Foundations: Transient DOWN alert on cr2-magru - https://phabricator.wikimedia.org/T374401#10159658 (10ayounsi) There was indeed a connectivity "blips", so they're not monitoring issues. That's for Sept 9th, where we can see it was only from eqiad : https://grafana.wikimedia.org/d/m1... [09:12:33] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10159801 (10ovasileva) [09:19:30] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10159837 (10cmooney) 05Open→03Resolved a:03cmooney [09:33:15] 06Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10159866 (10Vgutierrez) @elukey I think I've found the root cause of this mess, this is the patch: `lang=diff diff --git a/kafka.go b/kafka.go index 4e495ce..b878db8 100644 --- a/kafka.go +++ b/kafka.go @@ -18... [10:03:08] 06Traffic, 06SRE, 13Patch-For-Review: Migrate purged away from cergen-issued certificate - https://phabricator.wikimedia.org/T360506#10159937 (10MoritzMuehlenhoff) 05Open→03Resolved a:03CDobbins @CDobbins FYI, I'm assigning this to you and resolve it, given you've completed all the work [10:22:01] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10160016 (10MoritzMuehlenhoff) ganeti2018 is drained [10:25:46] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10160028 (10cmooney) 05Open→03Resolved In the end we got away without needing this, thanks to data-persistence. I'll close for now and we can re-open if... [11:09:33] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160179 (10ovasileva) [11:09:36] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160180 (10ovasileva) [11:09:48] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160181 (10ovasileva) [11:46:23] 10netops, 10CFSSL-PKI, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179 (10ayounsi) 03NEW p:05Triage→03High [11:53:17] 10netops, 10CFSSL-PKI, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179#10160355 (10ayounsi) [11:53:18] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160354 (10ayounsi) [11:55:25] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160375 (10cmooney) 05Resolved→03Open [11:56:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160348 (10ayounsi) a:05cmooney→03ayounsi Blocked on {T365012} to be able to renew the certs. Other than that, manually tested and works as exp... [11:56:52] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160350 (10cmooney) 05Open→03Resolved This has been enabled following the cloudsw upgrades. (see https://gerrit.wikimedia.org/r/c/operations/p... [12:53:57] 06Traffic, 06SRE: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837#10160509 (10Daimona) [14:12:21] 06Traffic: Supported RFC 8914 [Extended DNS Errors] in Wikimedia DNS - https://phabricator.wikimedia.org/T375200 (10ssingh) 03NEW [14:12:36] 06Traffic: Support RFC 8914 [Extended DNS Errors] in Wikimedia DNS - https://phabricator.wikimedia.org/T375200#10160969 (10ssingh) [14:12:40] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10160970 (10ayounsi) It would indeed be great to have redundancy for the `fmsw`, but as that device is not managed, there i... [14:53:03] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161341 (10jcrespo) ms backups con codfw are stopped. As usual, not asking for priority over my workmates, but if you... [15:08:08] 10netops, 10CFSSL-PKI, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179#10161429 (10elukey) This time we have an issue with `sign`, since a certificate is already there. I verified with manual commands and `gencert` works fine. I e... [15:13:02] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161445 (10ssingh) Traffic hosts (cp2041/cp2042) are depooled. [15:39:01] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161570 (10ABran-WMF) all data-persistence hosts have been depooled and downtimed [16:03:54] 10netops, 06Infrastructure-Foundations, 06SRE: Top-of-rack 'MoveServersUplinks' Netbox scripts doesn't clean up the old trunk port - https://phabricator.wikimedia.org/T375216 (10cmooney) 03NEW p:05Triage→03Low [16:09:24] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161695 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a040f2d9-1940-4aba-bd29-efa9aeec87fb) set...