[00:30:18] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10159245 (10Dwisehaupt) T375142 - Expanded the pfw and iptables ranges for prometheus collection so that we can hi... [01:49:49] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10159341 (10ssingh) @RobH: ` sukhe@cumin1002:~$ sudo cumin 'A:cp' 'dmesg -T | grep -q -i "core temperature is above" && echo "CPU throttled due to high temperature" || echo "CPU is OK"' 112 hosts... [04:18:14] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151 (10Papaul) 03NEW [04:18:37] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159412 (10Papaul) [04:23:29] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159413 (10Papaul) @Jhancock.wm if you have some time this week or next week can you please check in rack C8 all the servers that have only... [04:23:36] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159414 (10Papaul) p:05Triage→03Medium [04:24:55] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159415 (10Papaul) [04:49:28] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10159423 (10Papaul) [06:41:15] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10159452 (10ABran-WMF) all actionnable machines are ready to be depooled. I'll start depooling 20/15min before 16:00 UTC [07:49:02] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10159580 (10ops-monitoring-bot) Draining ganeti2018.codfw.wmnet of running VMs [08:28:10] 10netops, 06Infrastructure-Foundations: Transient DOWN alert on cr2-magru - https://phabricator.wikimedia.org/T374401#10159658 (10ayounsi) There was indeed a connectivity "blips", so they're not monitoring issues. That's for Sept 9th, where we can see it was only from eqiad : https://grafana.wikimedia.org/d/m1... [09:12:33] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10159801 (10ovasileva) [09:19:30] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10159837 (10cmooney) 05Open→03Resolved a:03cmooney [09:33:15] 06Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10159866 (10Vgutierrez) @elukey I think I've found the root cause of this mess, this is the patch: `lang=diff diff --git a/kafka.go b/kafka.go index 4e495ce..b878db8 100644 --- a/kafka.go +++ b/kafka.go @@ -18... [10:03:08] 06Traffic, 06SRE, 13Patch-For-Review: Migrate purged away from cergen-issued certificate - https://phabricator.wikimedia.org/T360506#10159937 (10MoritzMuehlenhoff) 05Open→03Resolved a:03CDobbins @CDobbins FYI, I'm assigning this to you and resolve it, given you've completed all the work [10:22:01] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10160016 (10MoritzMuehlenhoff) ganeti2018 is drained [10:25:46] 10netops, 06Infrastructure-Foundations, 06SRE: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10160028 (10cmooney) 05Open→03Resolved In the end we got away without needing this, thanks to data-persistence. I'll close for now and we can re-open if... [11:09:33] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160179 (10ovasileva) [11:09:36] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160180 (10ovasileva) [11:09:48] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10160181 (10ovasileva) [11:46:23] 10netops, 10CFSSL-PKI, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179 (10ayounsi) 03NEW p:05Triage→03High [11:53:17] 10netops, 10CFSSL-PKI, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179#10160355 (10ayounsi) [11:53:18] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160354 (10ayounsi) [11:55:25] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160375 (10cmooney) 05Resolved→03Open [11:56:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160348 (10ayounsi) a:05cmooney→03ayounsi Blocked on {T365012} to be able to renew the certs. Other than that, manually tested and works as exp... [11:56:52] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add cloudsw to gnmic interface stats collection - https://phabricator.wikimedia.org/T365012#10160350 (10cmooney) 05Open→03Resolved This has been enabled following the cloudsw upgrades. (see https://gerrit.wikimedia.org/r/c/operations/p... [12:53:57] 06Traffic, 06SRE: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837#10160509 (10Daimona) [14:12:21] 06Traffic: Supported RFC 8914 [Extended DNS Errors] in Wikimedia DNS - https://phabricator.wikimedia.org/T375200 (10ssingh) 03NEW [14:12:36] 06Traffic: Support RFC 8914 [Extended DNS Errors] in Wikimedia DNS - https://phabricator.wikimedia.org/T375200#10160969 (10ssingh) [14:12:40] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10160970 (10ayounsi) It would indeed be great to have redundancy for the `fmsw`, but as that device is not managed, there i... [14:53:03] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161341 (10jcrespo) ms backups con codfw are stopped. As usual, not asking for priority over my workmates, but if you... [15:08:08] 10netops, 10CFSSL-PKI, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179#10161429 (10elukey) This time we have an issue with `sign`, since a certificate is already there. I verified with manual commands and `gencert` works fine. I e... [15:13:02] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161445 (10ssingh) Traffic hosts (cp2041/cp2042) are depooled. [15:39:01] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161570 (10ABran-WMF) all data-persistence hosts have been depooled and downtimed [16:03:54] 10netops, 06Infrastructure-Foundations, 06SRE: Top-of-rack 'MoveServersUplinks' Netbox scripts doesn't clean up the old trunk port - https://phabricator.wikimedia.org/T375216 (10cmooney) 03NEW p:05Triage→03Low [16:09:24] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161695 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a040f2d9-1940-4aba-bd29-efa9aeec87fb) set... [16:16:51] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161716 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9d0dd9cc-ca9d-4736-b81c-6f32f4a0772d) set... [16:22:44] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161819 (10cmooney) All hosts have been moved and all now responding to ping again. [16:28:00] FIRING: [2x] PurgedHighEventLag: High event process lag with purged on cp2037:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [16:28:18] oh uh [16:28:41] vgutierrez: it's back [16:28:52] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161862 (10ABran-WMF) d/p instances are repooling [16:31:15] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161877 (10MatthewVernon) ms-nodes all good; thanos-be2004 seems OK (but checking that picked up an unrelated replica... [16:32:44] sukhe: yes.. that's teh current bug [16:32:49] the one addressed on purged 0.24 [16:33:02] I'll upgrade and restart purged on cp2037 [16:33:12] ok (I saw you did 2038 already) [16:33:17] 4038 [16:33:41] right, ok [16:33:48] so only 4038 so far? [16:34:04] yeah, I built purged 0.24 this morning [16:34:17] ok let me know if you need me to handle something, I am here [16:34:41] done, thx for pinging [16:37:38] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10161888 (10jcrespo) Resumed ms backups on codfw. [16:38:00] FIRING: [3x] PurgedHighEventLag: High event process lag with purged on cp2037:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [16:38:17] ^ known/addressed [16:41:02] cp4043 now [16:41:16] they are messing with the kafka cluster somehow [16:41:53] yep [16:42:20] ok, I'll update purged on ulsfo/codfw/eqsin [16:42:25] wish me luck :) [16:44:45] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10161925 (10cmooney) >>! In T373942#10154484, @Dwisehaupt wrote: > Am I correct in assuming the checks from the ne... [16:44:55] vgutierrez: if you break anything, we will pick up the pieces [16:53:00] RESOLVED: [4x] PurgedHighEventLag: High event process lag with purged on cp2037:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:28:11] 06Traffic, 06Abstract Wikipedia team: Wikifunctions is down - https://phabricator.wikimedia.org/T374318#10162134 (10DVrandecic) I wonder if we can raise that ban again? I think the crawler in combination with T374241 was causing the site instability issue. I would suggest that we de-ban the bot, and see if... [17:29:20] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10162152 (10Dwisehaupt) @cmooney Thanks for the follow up, it's all cleared up now. Most of this came from my conf... [17:55:52] 10netops, 06Infrastructure-Foundations, 06SRE: EX4600 does not support class-of-service 'port scheduling' - https://phabricator.wikimedia.org/T373594#10162292 (10cmooney) Just a note on this task to say that I was able to perform some throughput tests on the old asw-d-codfw devices (QFX5100) which have t... [17:56:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10162293 (10cmooney) 05Open→03Resolved a:03cmooney All done with this. Big thanks for @Jhancock.wm for the amazing work m... [17:57:10] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#10162302 (10cmooney) >>! In T360789#9941103, @Papaul wrote: > All the cabling is done. I am leaving this task open so when we move the console cables from a... [17:57:26] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10162299 (10cmooney) 05Open→03Resolved a:03cmooney [17:58:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10162308 (10cmooney) @Jhancock.wm thanks for doing this. I have completed my testing now on the old switch (thankfully all went well). So thi... [18:19:30] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10162355 (10cmooney) Actually just checking it's still at status "planned" in Netbox. And looking at puppetboard it seems it never got added p... [18:58:23] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10162431 (10RobH) I think they'll want a dump of the dmesg directly for the CPU temperature incidents so we can point at where it had to throttle down at exact dates/time, since now they are saying... [19:20:52] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10162474 (10ssingh) Hi @RobH: Sharing the `cumin` command the output so that you have some timestamps (UTC) ready to go (`esams` one is at the end but I am just dumping all for later use): ` sukhe... [19:58:39] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10162557 (10RobH) I've sent over the log output for the two esam hosts to their respective support email threads, lets see what they say! Thank you! [20:57:35] 06Traffic: Write a cookbook that performs a rolling restart of HAProxy - https://phabricator.wikimedia.org/T375232 (10CDobbins) 03NEW [21:52:22] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10162880 (10Dwisehaupt)