[00:17:25] 10SRE, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10ssingh) [00:18:59] !log rebooting Cassandra on sessionstore1001 — T327954 [00:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:03] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [00:24:02] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2023-03-28 00:00:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:28:00] (03PS3) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 [00:28:58] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [00:29:38] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [00:39:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905560 [00:39:36] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905560 (owner: 10TrainBranchBot) [00:50:07] !log rebooting sessionstore1001 — T327954 [00:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:12] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [00:52:06] (03PS1) 10Kevin Bazira: ml-services: Add trwiki editquality isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/905561 (https://phabricator.wikimedia.org/T334158) [00:56:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905560 (owner: 10TrainBranchBot) [01:41:53] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:47] !log fab@deploy2002 Started deploy [airflow-dags/research@2192f15]: (no justification provided) [02:07:09] !log fab@deploy2002 Finished deploy [airflow-dags/research@2192f15]: (no justification provided) (duration: 00m 21s) [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:34:30] (03PS1) 10KartikMistry: Enable Section Translation on Kashmiri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906137 (https://phabricator.wikimedia.org/T326541) [05:05:42] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:05:44] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:02] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:22] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:42] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:42] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:46] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:04] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:46] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:04] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:10] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:53] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0600) [06:00:06] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0600). [06:17:44] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) I don't know what has happened over the night but the zuul-merger service started alarming over night: ` Notificatio... [06:19:05] (03PS1) 10Hashar: zuul: disable monitoring for disabled merger service [puppet] - 10https://gerrit.wikimedia.org/r/906307 (https://phabricator.wikimedia.org/T324659) [06:19:47] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906307 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [06:27:10] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10hashar) [06:27:54] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10hashar) 05Open→03Resolved That one has been solved after I have found... [06:58:04] (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/output/906307/1701/" [puppet] - 10https://gerrit.wikimedia.org/r/906307 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [06:58:53] if someone could please puppet-merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/906307/ , that will remove an erroneous alarm for contint2002 zuul-merger service which is intentionally disabled but still has a monitoring enabled :) [07:00:06] Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0700) [07:00:06] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:21] I should ask in sre ;) [07:00:22] morning! there are no trainees signed up for the window and just the one patch scheduled. kart is not apparently here yet, let's see if they want to self-deploy as usual, when they do arrive. [07:03:55] ah. Sorry for late joining. [07:03:57] welcome kart_ ! are you self-deploying today? [07:04:04] no trainees signed up so .... [07:04:26] apergos: Yes :) [07:04:41] I am taking a break, will show up for the mediawiki train in an hour [07:04:42] great! go ahead when ready [07:05:51] (I realized that I've put wrong Gerrit link. Fixed it) [07:06:20] oh! looknig now [07:06:50] yes ok, that seems like a much smaller change for a backport window :-D [07:07:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906137 (https://phabricator.wikimedia.org/T326541) (owner: 10KartikMistry) [07:07:41] apergos: :D [07:08:21] (03Merged) 10jenkins-bot: Enable Section Translation on Kashmiri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906137 (https://phabricator.wikimedia.org/T326541) (owner: 10KartikMistry) [07:09:43] !log kartik@deploy2002 Started scap: Backport for [[gerrit:906137|Enable Section Translation on Kashmiri Wikipedia (T326541)]] [07:09:47] T326541: Enable Section Translation on Kashmiri Wikipedia - https://phabricator.wikimedia.org/T326541 [07:10:41] 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10elukey) >>! In T334078#8759196, @Ottomata wrote: > From a brief glance, those look like normal consumer reassignment messages. Probably shouldn't be alerts. @Ottomata I thought so yes, but I got a... [07:11:13] !log kartik@deploy2002 kartik: Backport for [[gerrit:906137|Enable Section Translation on Kashmiri Wikipedia (T326541)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:16:53] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10ayounsi) p:05Triage→03Low [07:16:56] !log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Abuse filter maintainer" "Abuse filter maintainers" "Zabe" --reason "per request [[:phab:T334147|T334147]]" [07:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:01] T334147: Request to move translatable page: Abuse filter maintainer - https://phabricator.wikimedia.org/T334147 [07:19:15] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:906137|Enable Section Translation on Kashmiri Wikipedia (T326541)]] (duration: 09m 31s) [07:19:18] T326541: Enable Section Translation on Kashmiri Wikipedia - https://phabricator.wikimedia.org/T326541 [07:25:20] (03CR) 10Elukey: [C: 03+2] zuul: disable monitoring for disabled merger service [puppet] - 10https://gerrit.wikimedia.org/r/906307 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [07:28:22] !log installing ghostscript security updates [07:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:20] apergos: I'm done. Sorry for bit late reply. [07:31:15] (03CR) 10Elukey: [C: 03+2] "Deploy anytime! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905561 (https://phabricator.wikimedia.org/T334158) (owner: 10Kevin Bazira) [07:31:32] no worries, as long as you don't forget completely :-D [07:31:48] !log UTC morning backport and config training window done [07:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:06] apergos: :) [07:39:54] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10ayounsi) Thanks for the feedback! > Weighing this against the costs of maintaining them properly, that's the big question here. Indeed :) I opened... [07:47:56] (03PS5) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) [07:53:57] (03PS1) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) [07:53:58] 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Resolved→03Open Re-opening as there are some EVPN elements outside the 'protocols bgp' context that also need to be added. Will submit patch. [07:56:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:56:58] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [07:58:32] (03PS6) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) [08:00:06] hashar and dduvall: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0800) [08:00:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40551/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [08:00:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.232 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:01:52] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906530 (https://phabricator.wikimedia.org/T330209) [08:01:54] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906530 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [08:02:35] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906530 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot) [08:03:05] * Lucas_WMDE waves farewell to IE11 [08:05:13] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to the extent a Partman recipe can look good" [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [08:08:25] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10cmooney) That codfw error is interesting actually, it makes me wonder why we have the "no-resolve" command on those routes? Without that the error wo... [08:08:51] !log restarting update-ubuntu-mirror.service on mirror1001 o check if it was a transient erro [08:08:53] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:01] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.3 refs T330209 [08:09:05] T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209 [08:09:39] (03CR) 10Jelto: [C: 03+2] "thanks for the review! I'll test and re-image gitlab2003 with the new partman config" [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [08:10:01] (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [08:10:10] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: mute etcd-mirror pint promql checks [alerts] - 10https://gerrit.wikimedia.org/r/906011 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:10:32] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move varnishkafka-exporter stats to counters [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) (owner: 10Filippo Giunchedi) [08:10:39] (03CR) 10Jelto: [C: 03+2] install_server: simplify gitlab disk layout, drop lvm, use four SSDs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [08:10:55] jelto: I merged you change too [08:11:00] your change even [08:11:08] godog: thanks a lot! :) [08:11:20] sure np [08:16:41] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: ignore 'status' label pint check [alerts] - 10https://gerrit.wikimedia.org/r/906020 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:21:31] (03PS2) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 [08:23:10] (03PS4) 10David Caro: kubernetes: set NO_HOME for bulidservice and unset workingDir [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 [08:23:12] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40552/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [08:23:19] (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice and unset workingDir (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [08:23:27] (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice and unset workingDir (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro) [08:24:30] (03CR) 10MVernon: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/906078 (https://phabricator.wikimedia.org/T334122) (owner: 10Eevans) [08:27:07] hashar: Is there anything in that release that could explain a very low opcache hit ratio in your opinion? [08:27:30] It may just be that it needs to rebuild, but we're starting to warn heavy [08:28:30] claime: opcache? the php bytecodes one? [08:28:37] hashar: yeah [08:28:39] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [08:29:11] https://grafana.wikimedia.org/goto/DmpVatYVz?orgId=1 [08:29:32] 08:09:01 Finished php-fpm-restarts (duration: 02m 37s) [08:29:35] that is all I know :] [08:30:15] We'll wait and see if it goes back up [08:30:53] I don't know anything about the caches anymore, I have long forgot or lost track of all the changes that happened on that front [08:31:49] maybe it is typical for a Thursday deploy as we get so many high traffic / lot of different code paths being newly loaded [08:34:11] over 9 days I see similar fall for the scap wikiversions last thusday https://grafana.wikimedia.org/d/GuHySj3mz/mediawiki-application-php?orgId=1&from=now-9d&to=now&viewPanel=33 [08:35:13] maybe because we never invalidate opcache keys until php-fpm is restarted by scap [08:35:24] Probably yeah [08:35:57] and ideally one day someone will figure out why the opcache gets corrupted or what kind of race condition we suffer from :D [08:36:03] I'll keep an eye on it [08:36:38] if the responses times from the app servers backend stay similar, I think it is all fine [08:37:25] Number of affected appservers is going down [08:37:56] So I guess it just takes some time to replenish opcache after the restart [08:39:05] latency increased a bit but that's kind of expected [08:39:28] (and is going down anyways) [08:39:40] (03PS1) 10Filippo Giunchedi: sre: mute puppet-ca pint checks for missing series [alerts] - 10https://gerrit.wikimedia.org/r/906533 (https://phabricator.wikimedia.org/T309182) [08:39:57] actually it went up a bit for parsoid, but not for appservers [08:40:16] !log powercycle ml-serve2004 - host frozen, racadm getsel shows multi-bit errors in various DIMM slots [08:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:11] RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms [08:43:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:46:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:46:38] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:47:32] (03PS3) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 [08:51:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:52:03] (03CR) 10Clément Goubert: [C: 03+2] Revert "mediawiki::scap: force creation of the symlink when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/905983 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [08:52:15] (03PS4) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 [08:53:17] (03CR) 10Elukey: "Folks the PCC output is consistent for all nodes, but it varies for kafka logging since we already removed the pki migration config after " [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [08:56:49] (03CR) 10JMeybohm: rest-gateway: add helmfile, enable mobileapps (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [08:58:56] jouncebot: nowandnext [08:58:56] For the next 1 hour(s) and 1 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0800) [08:58:56] In 1 hour(s) and 1 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1000) [08:58:57] In 1 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1000) [09:00:10] hashar: You're done with the train right? I can deploy a config change? [09:08:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [09:09:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert) [09:09:41] (03PS4) 10Clément Goubert: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) [09:09:51] Mpf, rebase >_> [09:11:08] (03CR) 10TrainBranchBot: "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert) [09:12:01] (03Merged) 10jenkins-bot: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert) [09:12:16] !log cgoubert@deploy2002 Started scap: Backport for [[gerrit:904463|jobrunners: Raise memory_limit to match parsoid (T333528)]] [09:12:20] T333528: Increase memory_limit for jobrunners to $wmgMemoryLimitParsoid - https://phabricator.wikimedia.org/T333528 [09:13:30] (Emergency syslog message) firing: (2) Alert for device ssw1-e1-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:13:37] !log cgoubert@deploy2002 cgoubert: Backport for [[gerrit:904463|jobrunners: Raise memory_limit to match parsoid (T333528)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [09:15:27] topranks: FYI ^^^ ssw1-e1-eqiad.mgmt.eqiad.wmnet [09:18:30] (Emergency syslog message) resolved: (2) Device ssw1-e1-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:19:27] !log cgoubert@deploy2002 Finished scap: Backport for [[gerrit:904463|jobrunners: Raise memory_limit to match parsoid (T333528)]] (duration: 07m 11s) [09:19:32] T333528: Increase memory_limit for jobrunners to $wmgMemoryLimitParsoid - https://phabricator.wikimedia.org/T333528 [09:21:21] (03PS2) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) [09:21:38] volans: ack, thanks [09:21:57] just added to monitoring, perhaps should have left alarms off but good test :P [09:22:16] !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye [09:23:58] (03CR) 10David Caro: [C: 03+1] "Neat, I got a bit carried out with the refactor, so let me know if you prefer merging this, and then doing the refactor (that I can do if " [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [09:24:27] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) I don't manage that spreadsheet, so I have no idea :) If that doesn't work we can easily switch to do the match on the Serial number column, that seems hardcoded for... [09:25:21] no prob :D [09:26:57] (03PS3) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) [09:30:33] !log kafka main codfw cluster migrated to PKI TLS certs for brokers - T319372 [09:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:38] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [09:31:51] 10SRE, 10serviceops: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey) Last steps: * clean up certs in puppet private * verify if any change is needed in deployment-prep [09:31:59] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Volans) FYI there is already a [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/restart-pybal.... [09:37:44] (03PS9) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [09:38:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [09:38:47] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:39:32] (03PS1) 10Filippo Giunchedi: aptrepo: go with Grafana 9 only [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) [09:39:45] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:39:50] (03CR) 10MVernon: "Hi," [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [09:41:14] (03PS1) 10Elukey: kafka: remove setting to avoid checking the hostname in TLS certs [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 [09:42:41] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:42:45] (03PS1) 10Filippo Giunchedi: hieradata: rename aux-k8s prometheus [puppet] - 10https://gerrit.wikimedia.org/r/906539 (https://phabricator.wikimedia.org/T334192) [09:43:27] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:44:13] (03PS1) 10Cathal Mooney: Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937) [09:45:27] (03CR) 10JMeybohm: [C: 03+1] "With my limited understanding, this lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [09:46:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:46:56] (03CR) 10JMeybohm: [C: 03+1] Add upstream release 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/905959 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [09:48:13] (03CR) 10David Caro: smart_data_dump: adapt for newer ssacli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro) [09:48:36] (03CR) 10JMeybohm: [C: 03+1] istio: upgrade to upstream version 1.15.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [09:48:46] (03CR) 10Muehlenhoff: aptrepo: go with Grafana 9 only (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) (owner: 10Filippo Giunchedi) [09:51:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:55:59] (03CR) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [09:56:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:56:30] (03CR) 10Filippo Giunchedi: aptrepo: go with Grafana 9 only (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) (owner: 10Filippo Giunchedi) [09:56:32] (03PS2) 10Filippo Giunchedi: aptrepo: go with Grafana 9 only [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) [09:56:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:56:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46083 and previous config saved to /var/cache/conftool/dbconfig/20230406-095640-ladsgroup.json [09:56:44] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [09:58:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46084 and previous config saved to /var/cache/conftool/dbconfig/20230406-095800-ladsgroup.json [10:00:05] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1000) [10:10:37] (03CR) 10JMeybohm: [C: 03+1] rest-gateway: add helmfile, enable mobileapps (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [10:10:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) (owner: 10Filippo Giunchedi) [10:13:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P46085 and previous config saved to /var/cache/conftool/dbconfig/20230406-101306-ladsgroup.json [10:13:26] (03PS1) 10Elukey: admin_ng: bump max quota for ml-serve namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/906541 [10:13:36] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1003.mgmt.eqiad.wmnet with reboot policy FORCED [10:14:27] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: go with Grafana 9 only [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) (owner: 10Filippo Giunchedi) [10:15:43] (03PS5) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) [10:16:29] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/906541 (owner: 10Elukey) [10:22:35] (03CR) 10Elukey: [C: 03+2] admin_ng: bump max quota for ml-serve namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/906541 (owner: 10Elukey) [10:23:50] PROBLEM - Host blog.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [10:25:08] RECOVERY - Host blog.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 23.80 ms [10:25:59] (03CR) 10Muehlenhoff: Add an in place Debian upgrade script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [10:26:07] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:26:13] (03CR) 10Volans: "Nice! I've left few minor nits/possible improvement, none of them is a blocker. The rest LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [10:26:53] (03CR) 10Volans: [C: 03+1] "Lovely!!!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 (owner: 10Elukey) [10:27:06] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:27:13] (03CR) 10Elukey: [C: 03+2] kafka: remove setting to avoid checking the hostname in TLS certs [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 (owner: 10Elukey) [10:27:47] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:28:13] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P46086 and previous config saved to /var/cache/conftool/dbconfig/20230406-102813-ladsgroup.json [10:29:38] (03CR) 10Volans: "Thanks for the replies, I don't want to be a blocker." [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [10:30:53] (03Merged) 10jenkins-bot: kafka: remove setting to avoid checking the hostname in TLS certs [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 (owner: 10Elukey) [10:36:22] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:36:34] PROBLEM - BFD status on cr3-knams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:36:44] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:37:28] * volans checking calendar [10:38:09] (03PS9) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [10:38:11] (03PS6) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) [10:39:02] (03CR) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [10:39:40] (03CR) 10Muehlenhoff: "After some more investigation I think I know the issue: cloudvirt1019/cloudvirt1020 are unicorns since they are the only two remaining two" [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro) [10:39:50] RECOVERY - BFD status on cr3-knams is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:40:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirtlocal1003.mgmt.eqiad.wmnet with reboot policy FORCED [10:41:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [10:41:14] mmmh there is a maintenance but sems a different cable [10:41:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [10:42:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr) [10:42:56] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:43:18] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46087 and previous config saved to /var/cache/conftool/dbconfig/20230406-104319-ladsgroup.json [10:43:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:43:24] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [10:43:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:43:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:44:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:44:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:44:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:44:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T333332)', diff saved to https://phabricator.wikimedia.org/P46088 and previous config saved to /var/cache/conftool/dbconfig/20230406-104435-ladsgroup.json [10:46:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T333332)', diff saved to https://phabricator.wikimedia.org/P46089 and previous config saved to /var/cache/conftool/dbconfig/20230406-104644-ladsgroup.json [10:50:47] (03PS10) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [10:53:00] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:53:06] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:53:38] (03CR) 10MVernon: "Thanks again!" [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [10:54:40] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:54:46] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:58:07] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [11:01:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P46090 and previous config saved to /var/cache/conftool/dbconfig/20230406-110151-ladsgroup.json [11:03:06] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:03:28] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:09:58] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 40 probes of 780 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:11:16] ^^ problem on transit cct from codfw to eqdfw, not sure it should cause the atlas alert though [11:15:08] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:15:44] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 25 probes of 780 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:16:26] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:16:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P46091 and previous config saved to /var/cache/conftool/dbconfig/20230406-111657-ladsgroup.json [11:21:28] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:21:48] PROBLEM - BFD status on cr3-knams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:23:57] (03PS1) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) [11:26:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) (owner: 10Muehlenhoff) [11:27:45] (03CR) 10JMeybohm: [C: 04-1] "I think the istio ingress module might get you in trouble here, at least the staging part of it. It was developed with wikikube only in mi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [11:28:10] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:28:28] RECOVERY - BFD status on cr3-knams is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:29:04] ^^ this was very odd, different provider, Dallas area being common though. [11:29:27] what was strange is that IPv6 was working, and OSPF was up, but BFD was down and IPv4 wasn't working [11:29:33] has come back now [11:29:59] I'd be slightly worried there is a bad secondary path we got flipped to due to some wan re-routing. [11:31:22] (03CR) 10Volans: [C: 03+1] "LGTM cookbook/python wise. As for the swift logic I'll leave it to the swift experts." [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [11:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T333332)', diff saved to https://phabricator.wikimedia.org/P46092 and previous config saved to /var/cache/conftool/dbconfig/20230406-113203-ladsgroup.json [11:32:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:32:08] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [11:32:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:32:23] topranks: there was a planned work in the calendar, not sure if it can be related [11:32:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T333332)', diff saved to https://phabricator.wikimedia.org/P46093 and previous config saved to /var/cache/conftool/dbconfig/20230406-113226-ladsgroup.json [11:34:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T333332)', diff saved to https://phabricator.wikimedia.org/P46094 and previous config saved to /var/cache/conftool/dbconfig/20230406-113436-ladsgroup.json [11:39:34] volans: thanks yeah I seen that, shouldn't be related to any of these based on the info [11:41:04] things have been stable for past ~10mins anyway, ripe atlas probes are back at same success level as previous [11:41:14] (03PS1) 10Muehlenhoff: Add krb2002 as additional KDC [puppet] - 10https://gerrit.wikimedia.org/r/906560 (https://phabricator.wikimedia.org/T331695) [11:41:14] I'll continue to keep an eye on thins [11:41:43] ack, thx [11:41:48] lmk if we (oncall) can help [11:49:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P46095 and previous config saved to /var/cache/conftool/dbconfig/20230406-114942-ladsgroup.json [11:49:45] (03CR) 10David Caro: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) (owner: 10Muehlenhoff) [11:58:00] (03PS1) 10Muehlenhoff: Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) [12:03:42] (03CR) 10Ayounsi: [C: 03+1] Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P46096 and previous config saved to /var/cache/conftool/dbconfig/20230406-120448-ladsgroup.json [12:08:37] (03CR) 10DCausse: "\o/" [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 (owner: 10Elukey) [12:09:18] (03CR) 10Ayounsi: Add EVPN protocol config for enabled L3 switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [12:11:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [12:13:41] (03PS1) 10Muehlenhoff: zuul-merger: Make auto restart dependent on whether service is enabled or not [puppet] - 10https://gerrit.wikimedia.org/r/906564 [12:14:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906564 (owner: 10Muehlenhoff) [12:14:47] (03PS1) 10Jelto: install_server: hard code raid sizes for gitlab partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/906565 (https://phabricator.wikimedia.org/T330172) [12:16:30] (03CR) 10Jelto: "as discussed in IRC, moving from relative to absolute raid sizes for GitLab" [puppet] - 10https://gerrit.wikimedia.org/r/906565 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:19:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T333332)', diff saved to https://phabricator.wikimedia.org/P46097 and previous config saved to /var/cache/conftool/dbconfig/20230406-121955-ladsgroup.json [12:19:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [12:20:00] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:20:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [12:20:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T333332)', diff saved to https://phabricator.wikimedia.org/P46098 and previous config saved to /var/cache/conftool/dbconfig/20230406-122018-ladsgroup.json [12:21:52] (03PS2) 10Muehlenhoff: zuul-merger: Make auto restart dependent on whether service is enabled or not [puppet] - 10https://gerrit.wikimedia.org/r/906564 [12:22:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T333332)', diff saved to https://phabricator.wikimedia.org/P46099 and previous config saved to /var/cache/conftool/dbconfig/20230406-122229-ladsgroup.json [12:23:07] (03CR) 10Muehlenhoff: "No idea why PCC is marked as failing, the result seems all fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [12:23:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906564 (owner: 10Muehlenhoff) [12:25:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:25:08] (03CR) 10Muehlenhoff: [C: 03+1] "Let's give it a shot" [puppet] - 10https://gerrit.wikimedia.org/r/906565 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:25:10] (03CR) 10Hashar: [C: 03+1] "Ah nice. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/906564 (owner: 10Muehlenhoff) [12:26:18] !log restarting blazegraph on wdqs1012 (BlazegraphFreeAllocatorsDecreasingRapidly) [12:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:50] (03CR) 10Muehlenhoff: [C: 03+2] zuul-merger: Make auto restart dependent on whether service is enabled or not [puppet] - 10https://gerrit.wikimedia.org/r/906564 (owner: 10Muehlenhoff) [12:29:51] (03CR) 10Jelto: [C: 03+2] install_server: hard code raid sizes for gitlab partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/906565 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:32:29] (03CR) 10Ayounsi: "Some initial comments." [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) (owner: 10Cathal Mooney) [12:33:36] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [12:35:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:35:07] (03PS4) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) [12:37:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P46100 and previous config saved to /var/cache/conftool/dbconfig/20230406-123735-ladsgroup.json [12:41:16] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [12:48:01] (03Abandoned) 10David Caro: smart_data_dump: adapt for newer ssacli [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro) [12:49:23] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) (owner: 10Muehlenhoff) [12:50:43] !log import grafana 9.4 T317887 [12:50:46] (03PS1) 10Muehlenhoff: zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570 [12:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:49] T317887: Upgrade to Grafana 9 - https://phabricator.wikimedia.org/T317887 [12:51:10] (03CR) 10CI reject: [V: 04-1] zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff) [12:52:14] (03CR) 10Elukey: [C: 03+1] Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [12:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P46101 and previous config saved to /var/cache/conftool/dbconfig/20230406-125242-ladsgroup.json [12:52:49] (03PS2) 10Muehlenhoff: zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570 [12:53:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Clement_Goubert) FWIW, the cookbook can be used, but it needs to be given the actual lvs servers to run on. Assuming `lvs1020` and `lvs2010` are secondaries, `lvs1... [12:55:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff) [12:59:31] (03PS1) 10Elukey: Upgrade to upstream version 1.15.7 [debs/istio] - 10https://gerrit.wikimedia.org/r/906571 (https://phabricator.wikimedia.org/T334068) [13:00:02] (03CR) 10Elukey: "Already imported the pristine/upstream release and pushed to gerrit." [debs/istio] - 10https://gerrit.wikimedia.org/r/906571 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1300). [13:00:05] mazevedo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] hi! [13:00:27] I can deploy in 15-30 minutes if no one else is around [13:00:38] ok, thanks! [13:03:03] (03PS3) 10Muehlenhoff: zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570 [13:04:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Volans) Thanks for the clarification @Clement_Goubert [13:04:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff) [13:05:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204 (10Clement_Goubert) [13:07:44] (03PS3) 10Clément Goubert: cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T334060) [13:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T333332)', diff saved to https://phabricator.wikimedia.org/P46102 and previous config saved to /var/cache/conftool/dbconfig/20230406-130749-ladsgroup.json [13:07:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:07:54] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:08:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46103 and previous config saved to /var/cache/conftool/dbconfig/20230406-130812-ladsgroup.json [13:08:49] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [13:08:57] (03PS4) 10Muehlenhoff: zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570 [13:09:01] (03PS4) 10Clément Goubert: cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T334204) [13:09:34] (03PS1) 10Filippo Giunchedi: data-engineering: fix varnishkafka metric names, deploy to all sites [alerts] - 10https://gerrit.wikimedia.org/r/906574 (https://phabricator.wikimedia.org/T309182) [13:09:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff) [13:09:58] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [13:10:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46104 and previous config saved to /var/cache/conftool/dbconfig/20230406-131022-ladsgroup.json [13:10:27] (03CR) 10Elukey: "now that I think about it, this may affect deployment-prep's settings (and maybe pontoon ones). Should we set profile::kafka::broker:use_p" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [13:10:53] (03PS1) 10Atieno: blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) [13:11:37] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: fix varnishkafka metric names, deploy to all sites [alerts] - 10https://gerrit.wikimedia.org/r/906574 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:15:45] (03CR) 10Hashar: [C: 03+1] "Looks better indeed :)" [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff) [13:19:27] (03CR) 10Muehlenhoff: [C: 03+2] zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff) [13:22:02] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add upstream release 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/905959 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [13:22:10] (03PS5) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) [13:22:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:22:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:22:42] * volans here [13:22:57] here too, checking [13:23:01] acked [13:23:06] thank you! [13:23:30] btw the link to alerts doesn't work [13:23:47] https://librenms.wikimedia.org/bill/bill_id=28/ [13:23:51] hello analytics [13:24:02] (03CR) 10JHathaway: [C: 03+1] hieradata: rename aux-k8s prometheus [puppet] - 10https://gerrit.wikimedia.org/r/906539 (https://phabricator.wikimedia.org/T334192) (owner: 10Filippo Giunchedi) [13:24:04] (03CR) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [13:24:21] XioNoX: I was about to ask :) [13:24:22] XioNoX: do you have already a hostname? [13:24:51] volans: many I'd guess [13:25:06] (03CR) 10Kamila Součková: "Just kamila doing her best "confused kamila" impression" [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [13:25:11] a bunch of an-presto looks like, https://librenms.wikimedia.org/device/160/ports [13:25:24] analytics1* [13:25:27] https://librenms.wikimedia.org/device/device=160/tab=port/port=14063/ for example [13:25:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P46106 and previous config saved to /var/cache/conftool/dbconfig/20230406-132528-ladsgroup.json [13:25:34] they all contribute [13:25:48] probably worth checking who ran a large job [13:25:48] https://yarn.wikimedia.org/cluster/app/application_1678266962370_166166 started 41 mins ago, does it match? [13:25:51] indeed [13:25:52] (more or less) [13:26:09] checking [13:26:23] seems close enough [13:26:29] seems allocating a ton of resources [13:26:39] elukey: didn´t it start 41hours ago ? [13:26:48] Elapsed: 41hrs, 17mins, 11sec [13:26:51] claime: yes you are right, sorry [13:26:53] I see traffic rising starting at 12 utc [13:27:06] but it may have started to use resources only recently [13:27:23] elukey: fair enough, I was just sanity checking myself x) [13:27:30] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:27:30] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:27:39] between 12 adn 12:05 the start yes [13:27:40] claime: thanks a lot for double checking :) [13:27:52] (03CR) 10Hokwelum: [C: 03+1] "looks good! Thank you :-)" [puppet] - 10https://gerrit.wikimedia.org/r/902738 (owner: 10Meno25) [13:28:43] it is probably that job, it is using a ton of executors https://yarn.wikimedia.org/proxy/application_1678266962370_166166/executors/ [13:29:13] memory used, disk used, cores... and not network used :D [13:29:29] *and no [13:29:50] is it possible to know who ran it? [13:30:00] yes the username is aitolkyn [13:30:18] (03CR) 10JHathaway: [C: 03+1] Create a separate Hiera variable of KDCs specifically for use in client config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [13:30:24] * Lucas_WMDE here [13:30:27] probably a collaborator, I see a gmail email list [13:30:31] *listed [13:30:38] (03PS10) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [13:31:23] the traffic did not went back to normal values yet fwiw [13:31:34] although the page recovered [13:31:51] ok, then I’ll hold off on the deployment (fyi mazevedo) [13:32:11] (03PS3) 10Daniel Kinzler: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) [13:32:20] (03CR) 10CI reject: [V: 04-1] Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:32:26] we can kill it easily if anything happens [13:32:29] Cc: steve_munene: --^ [13:32:37] volans: FWIW the link I think works, though the page already recovered a minute later from https://librenms.wikimedia.org/device/device=160/tab=logs/section=eventlog/ [13:32:45] Lucas_WMDE ack [13:33:15] godog: but we got the recovery 5m after [13:33:43] (03CR) 10JMeybohm: [C: 03+1] Upgrade to upstream version 1.15.7 [debs/istio] - 10https://gerrit.wikimedia.org/r/906571 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [13:34:01] is it possible do something else about it than kill it? [13:34:05] (03CR) 10Hnowlan: [C: 03+1] blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno) [13:34:12] (03PS1) 10Andrew Bogott: cloud-vps: only backup toolforge things every other day [puppet] - 10https://gerrit.wikimedia.org/r/906579 [13:34:23] XioNoX: not that I know [13:34:43] !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye [13:34:46] do we have any kind of rate-limiter that could be applied? [13:34:57] volans: yeah I'm not sure why yet re: recovery not being immediate [13:35:34] (03CR) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [13:36:35] (03CR) 10Ayounsi: "A few comments inline. Overall lgtm." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [13:39:16] RECOVERY - Check systemd state on contint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:37] the traffic has gone a bit down [13:40:33] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [13:40:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P46108 and previous config saved to /var/cache/conftool/dbconfig/20230406-134035-ladsgroup.json [13:41:47] (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309) [13:42:22] (03PS2) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309) [13:44:13] (03CR) 10Cathal Mooney: "Thanks for the review, yeah makes sense for it to be a file. As for the extention change that'll require more work as it'll mean changing" [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) (owner: 10Cathal Mooney) [13:44:44] PROBLEM - HTTPS on clouddumps1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [13:45:23] (03CR) 10Ayounsi: [C: 03+1] "Nice!" [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [13:46:24] RECOVERY - HTTPS on clouddumps1001 is OK: SSL OK - Certificate dumps.wikimedia.org valid until 2023-06-05 08:49:55 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [13:50:29] (03PS1) 10Filippo Giunchedi: Make deploy-tag compulsory [alerts] - 10https://gerrit.wikimedia.org/r/906581 (https://phabricator.wikimedia.org/T309182) [13:51:23] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs6003.drmrs.wmnet [13:55:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46109 and previous config saved to /var/cache/conftool/dbconfig/20230406-135541-ladsgroup.json [13:55:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:55:45] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:55:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:56:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T333332)', diff saved to https://phabricator.wikimedia.org/P46110 and previous config saved to /var/cache/conftool/dbconfig/20230406-135604-ladsgroup.json [13:56:28] (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309) [13:56:48] !log rebooting sessionstore1001 — T327954 [13:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:52] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [13:57:40] I lost track of the channel for a bit – would it be okay to deploy a config change now? [13:58:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T333332)', diff saved to https://phabricator.wikimedia.org/P46111 and previous config saved to /var/cache/conftool/dbconfig/20230406-135813-ladsgroup.json [13:58:26] Lucas_WMDE: traffic's back to normal, I think you can go ahead [13:58:26] (03PS2) 10Muehlenhoff: Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) [13:58:30] ok thanks [13:58:52] mazevedo: if you’re still around I can deploy the config change now [13:59:01] (the backports window is almost over but there’s nothing immediately after it) [14:00:47] hey [14:00:52] still here! [14:00:54] ok! [14:01:23] (03PS4) 10Lucas Werkmeister (WMDE): Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [14:01:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [14:01:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts lvs6003.drmrs.wmnet [14:02:16] (03CR) 10Elukey: [V: 03+2 C: 03+2] Upgrade to upstream version 1.15.7 [debs/istio] - 10https://gerrit.wikimedia.org/r/906571 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [14:02:22] (03Merged) 10jenkins-bot: Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo) [14:02:28] (03CR) 10Elukey: [V: 03+2 C: 03+2] istio: upgrade to upstream version 1.15.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [14:02:34] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905555|Add session schema config for mobile apps (T331481)]] [14:02:38] T331481: Generalize Android MEP session schema for iOS to use - https://phabricator.wikimedia.org/T331481 [14:03:57] !log lucaswerkmeister-wmde@deploy2002 mazevedo and lucaswerkmeister-wmde: Backport for [[gerrit:905555|Add session schema config for mobile apps (T331481)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:04:08] mazevedo: can you test the change on mwdebug? [14:04:20] Lucas_WMDE on it [14:05:04] Lucas_WMDE looking good, thanks! [14:05:10] ok thanks! [14:08:32] !log fab@deploy2002 Started deploy [airflow-dags/research@2192f15]: (no justification provided) [14:08:43] !log fab@deploy2002 Finished deploy [airflow-dags/research@2192f15]: (no justification provided) (duration: 00m 11s) [14:10:28] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905555|Add session schema config for mobile apps (T331481)]] (duration: 07m 54s) [14:10:33] T331481: Generalize Android MEP session schema for iOS to use - https://phabricator.wikimedia.org/T331481 [14:11:13] (03CR) 10JHathaway: [C: 03+1] Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [14:12:05] BGP alerts in drmrs epxected [14:12:17] mazevedo: should be done now, thanks for your patience :) [14:12:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS bullseye [14:12:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye [14:13:13] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: increase jvm-overhead.max [deployment-charts] - 10https://gerrit.wikimedia.org/r/906040 (owner: 10DCausse) [14:13:19] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) ` root@apt2001:/srv/wikimedia# reprepro lsbycomponent istio-cni istio-cni | 1.9.5-1 | bullseye-wikimedia | component/istio195 | amd64 istio-cni | 1.15.7-1 | bullseye-wikimedia... [14:13:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P46112 and previous config saved to /var/cache/conftool/dbconfig/20230406-141319-ladsgroup.json [14:14:02] !log upload new istio-cni and istioctl 1.15.7 debian package versions to bullseye-wikimedia - T334068 [14:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:06] T334068: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 [14:15:10] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:15:17] ^ expected [14:16:36] (03PS1) 10Elukey: custom_deploy.d: upgrade istio to 1.15.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/906590 (https://phabricator.wikimedia.org/T334068) [14:16:59] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) [14:17:43] (03PS2) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309) [14:17:59] (03Merged) 10jenkins-bot: rdf-streaming-updater: increase jvm-overhead.max [deployment-charts] - 10https://gerrit.wikimedia.org/r/906040 (owner: 10DCausse) [14:19:02] (03CR) 10JMeybohm: [C: 03+1] custom_deploy.d: upgrade istio to 1.15.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/906590 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [14:20:44] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] service: move device-analytics to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/899607 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [14:21:16] !log upgrade istioctl on deploy[12]002 and istio-cni on ml-serve[12]00[1-8] manually - T334068 [14:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:20] T334068: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 [14:21:28] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: upgrade istio to 1.15.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/906590 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey) [14:21:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:21:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:22:19] (03PS1) 10Filippo Giunchedi: prometheus: provision k8s-aux volume [puppet] - 10https://gerrit.wikimedia.org/r/906591 (https://phabricator.wikimedia.org/T334192) [14:23:43] (03PS1) 10Ssingh: hiera: lvs6003: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/906592 (https://phabricator.wikimedia.org/T321309) [14:24:45] we are having somewhat of an outage atm [14:24:49] with s1 in eqiad [14:25:31] Slave_SQL_Running_State: Waiting for semi-sync ACK from slave [14:25:34] COME ON [14:25:52] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 73 connections established with conf2005.codfw.wmnet:4001 (min=74) https://wikitech.wikimedia.org/wiki/PyBal [14:26:09] marostegui: can I disable semi-sync on eqiad master of s1? All of the s1 is lagged for a minute [14:26:25] Amir1: need any help from oncallers? [14:26:48] checking error rate would be amazing [14:26:59] restarted replication [14:27:13] now at Slave_SQL_Running_State: Reading event from the relay log [14:27:16] ack [14:27:23] let's see if it gets better [14:27:34] back to Slave_SQL_Running_State: Waiting for semi-sync ACK from slave [14:27:43] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.80:4972]) https://wikitech.wikimedia.org/wiki/PyBal [14:27:48] did we failover s1 master recently? [14:27:49] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.80:4972]) https://wikitech.wikimedia.org/wiki/PyBal [14:27:59] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 91 connections established with conf2004.codfw.wmnet:4001 (min=92) https://wikitech.wikimedia.org/wiki/PyBal [14:28:04] (03CR) 10Ssingh: [C: 03+2] hiera: lvs6003: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/906592 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:28:11] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 77 connections established with conf1007.eqiad.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [14:28:21] (03CR) 10Kamila Součková: [C: 03+1] "LGTM except for the commented." [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [14:28:26] hnowlan: are those because of you? ^^^ lvs alerts [14:28:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P46113 and previous config saved to /var/cache/conftool/dbconfig/20230406-142826-ladsgroup.json [14:28:38] probably [14:28:45] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 77 connections established with conf1007.eqiad.wmnet:4001 (min=78) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal [14:28:45] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.80:4972]) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal [14:28:45] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 123 connections established with conf1007.eqiad.wmnet:4001 (min=124) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal [14:28:45] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.80:4972]) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal [14:28:45] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 73 connections established with conf2005.codfw.wmnet:4001 (min=74) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal [14:28:45] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.80:4972]) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal [14:28:45] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 91 connections established with conf2004.codfw.wmnet:4001 (min=92) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal [14:28:47] volans: just acked [14:28:48] seems to match the hosts :) [14:28:51] ack thx [14:28:59] maybe the cookbook could silence them... [14:29:47] PROBLEM - statsv Varnishkafka log producer on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:30:00] that'snew [14:30:03] PROBLEM - eventlogging Varnishkafka log producer on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:30:55] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T320967) [14:30:59] T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967 [14:31:19] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 92 connections established with conf2004.codfw.wmnet:4001 (min=92) https://wikitech.wikimedia.org/wiki/PyBal [14:31:20] (03CR) 10Cathal Mooney: [C: 03+2] Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [14:31:21] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:31:24] (03CR) 10Herron: [C: 03+1] Make deploy-tag compulsory [alerts] - 10https://gerrit.wikimedia.org/r/906581 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:31:29] volans: I will look at the cp3064 one [14:31:41] thx [14:31:51] (03Merged) 10jenkins-bot: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [14:31:53] RECOVERY - statsv Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:32:00] cool didn't even look :) [14:32:06] but still, something must be wrong so [14:32:11] RECOVERY - eventlogging Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:32:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T320967) [14:32:57] Lucas_WMDE ty :) [14:33:57] !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye [14:34:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [14:34:52] (03CR) 10JHathaway: [C: 03+1] prometheus: provision k8s-aux volume [puppet] - 10https://gerrit.wikimedia.org/r/906591 (https://phabricator.wikimedia.org/T334192) (owner: 10Filippo Giunchedi) [14:35:30] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: provision k8s-aux volume [puppet] - 10https://gerrit.wikimedia.org/r/906591 (https://phabricator.wikimedia.org/T334192) (owner: 10Filippo Giunchedi) [14:37:18] (03PS1) 10Ladsgroup: Disable DT backend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906593 [14:37:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [14:37:56] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T320967) [14:38:00] T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967 [14:38:19] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 74 connections established with conf2005.codfw.wmnet:4001 (min=74) https://wikitech.wikimedia.org/wiki/PyBal [14:38:21] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:38:31] (03CR) 10Ladsgroup: [C: 03+2] Disable DT backend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906593 (owner: 10Ladsgroup) [14:38:33] (03PS5) 10EoghanGaffney: Cookbook for switchover of Gitlab to a new host [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) [14:38:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906593 (owner: 10Ladsgroup) [14:39:03] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 78 connections established with conf1007.eqiad.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [14:39:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T320967) [14:39:16] (03Merged) 10jenkins-bot: Disable DT backend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906593 (owner: 10Ladsgroup) [14:39:21] (03PS2) 10Cathal Mooney: Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937) [14:39:27] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:906593|Disable DT backend on enwiki]] [14:40:43] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync data for new ssw1 spine switches in eqiad. - cmooney@cumin1001 - T322937" [14:40:47] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [14:40:59] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:906593|Disable DT backend on enwiki]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:41:01] (03CR) 10Cathal Mooney: [C: 03+2] Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [14:41:36] (03Merged) 10jenkins-bot: Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [14:42:39] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync data for new ssw1 spine switches in eqiad. - cmooney@cumin1001 - T322937" [14:43:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T333332)', diff saved to https://phabricator.wikimedia.org/P46114 and previous config saved to /var/cache/conftool/dbconfig/20230406-144332-ladsgroup.json [14:43:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [14:43:37] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:43:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [14:44:01] (03CR) 10Hnowlan: [C: 03+2] admin: update platform engineering approvers [puppet] - 10https://gerrit.wikimedia.org/r/889967 (https://phabricator.wikimedia.org/T300244) (owner: 10Hnowlan) [14:44:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [14:44:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [14:44:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46115 and previous config saved to /var/cache/conftool/dbconfig/20230406-144437-ladsgroup.json [14:46:39] (03PS1) 10Jelto: install_server: fix line break in gitlab parman recipe [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) [14:46:41] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:906593|Disable DT backend on enwiki]] (duration: 07m 14s) [14:47:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46116 and previous config saved to /var/cache/conftool/dbconfig/20230406-144753-ladsgroup.json [14:48:43] (03CR) 10Jelto: "I compared this to the other recipes and it seems the end of recipe should not continue with a . \ (only a .) but the different partitions" [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:52:06] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10Elitre) Of course, approved. [14:52:58] (03CR) 10Muehlenhoff: install_server: fix line break in gitlab parman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:56:17] (03PS1) 10Ssingh: hiera: lvs6003: update interface name [puppet] - 10https://gerrit.wikimedia.org/r/906598 (https://phabricator.wikimedia.org/T321309) [14:57:01] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@8fb20e9]: (no justification provided) [14:57:02] (03CR) 10Ssingh: [C: 03+2] hiera: lvs6003: update interface name [puppet] - 10https://gerrit.wikimedia.org/r/906598 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:57:21] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs6003.drmrs.wmnet with OS bullseye [14:57:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye executed with errors: - lvs6003 (**FAIL**) - Downtimed on... [14:57:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS bullseye [14:57:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye [14:58:22] (03CR) 10Jelto: install_server: fix line break in gitlab parman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [15:02:55] 10SRE, 10Traffic: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 (10ssingh) 05Resolved→03Open ` Apr 06 14:27:14 cp3064 varnishkafka[1513247]: Condition(c->offset <= c->vtx->len) not true. Apr 06 14:27:14 cp3064 systemd[1]: varnishkafka... [15:03:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P46117 and previous config saved to /var/cache/conftool/dbconfig/20230406-150300-ladsgroup.json [15:04:33] (03PS1) 10BCornwall: sre/systemd: Remove query params from dashboard [alerts] - 10https://gerrit.wikimedia.org/r/906599 [15:07:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you !" [alerts] - 10https://gerrit.wikimedia.org/r/906599 (owner: 10BCornwall) [15:09:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) >>! In T292095#8715082, @Jclark-ctr wrote: > @cmooney Racks e5-7 f5-7 have been cabled and racked do you want to use same ticket f... [15:11:04] (03CR) 10BCornwall: [C: 03+2] sre/systemd: Remove query params from dashboard [alerts] - 10https://gerrit.wikimedia.org/r/906599 (owner: 10BCornwall) [15:16:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [15:17:25] (03PS1) 10Ladsgroup: Disable writes on group2 for DT backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906600 [15:18:02] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@8fb20e9]: (no justification provided) (duration: 21m 01s) [15:18:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P46118 and previous config saved to /var/cache/conftool/dbconfig/20230406-151806-ladsgroup.json [15:18:46] (03CR) 10Ladsgroup: [C: 03+2] Disable writes on group2 for DT backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906600 (owner: 10Ladsgroup) [15:19:30] (03Merged) 10jenkins-bot: Disable writes on group2 for DT backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906600 (owner: 10Ladsgroup) [15:19:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [15:20:00] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:906600|Disable writes on group2 for DT backend]] [15:20:31] !log fab@deploy2002 Started deploy [airflow-dags/research@2192f15]: (no justification provided) [15:20:42] !log fab@deploy2002 Finished deploy [airflow-dags/research@2192f15]: (no justification provided) (duration: 00m 11s) [15:21:19] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:906600|Disable writes on group2 for DT backend]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [15:21:49] (03CR) 10Muehlenhoff: [C: 03+1] install_server: fix line break in gitlab parman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [15:28:12] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:906600|Disable writes on group2 for DT backend]] (duration: 08m 11s) [15:29:18] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) [15:29:25] (03PS1) 10Ssingh: admin: add trizek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/906601 (https://phabricator.wikimedia.org/T333863) [15:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46119 and previous config saved to /var/cache/conftool/dbconfig/20230406-153312-ladsgroup.json [15:33:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [15:33:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:33:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [15:33:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T333332)', diff saved to https://phabricator.wikimedia.org/P46120 and previous config saved to /var/cache/conftool/dbconfig/20230406-153335-ladsgroup.json [15:35:49] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T333332)', diff saved to https://phabricator.wikimedia.org/P46121 and previous config saved to /var/cache/conftool/dbconfig/20230406-153602-ladsgroup.json [15:41:10] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) [15:42:10] (03PS1) 10Ssingh: hiera: update lvs6003 interfaces in common/interfaces.yaml [puppet] - 10https://gerrit.wikimedia.org/r/906604 (https://phabricator.wikimedia.org/T321309) [15:42:47] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6003.drmrs.wmnet with OS bullseye [15:42:57] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye completed: - lvs6003 (**WARN**) - Downtimed on Icinga/Aler... [15:43:23] (03CR) 10Jelto: [C: 03+2] install_server: fix line break in gitlab parman recipe [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [15:44:10] (03CR) 10Ssingh: [C: 03+2] hiera: update lvs6003 interfaces in common/interfaces.yaml [puppet] - 10https://gerrit.wikimedia.org/r/906604 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:51:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P46122 and previous config saved to /var/cache/conftool/dbconfig/20230406-155108-ladsgroup.json [15:51:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/906601 (https://phabricator.wikimedia.org/T333863) (owner: 10Ssingh) [15:53:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS bullseye [15:53:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye [15:54:02] (03CR) 10EoghanGaffney: "I've taken care of, I think, all of the comments below (except one TODO about a DNS check, that will come later). Follow-up review would b" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [15:57:05] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:45] (03CR) 10Ssingh: [C: 03+2] admin: add trizek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/906601 (https://phabricator.wikimedia.org/T333863) (owner: 10Ssingh) [16:00:05] jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) [16:02:06] ^^^ BGP alert on asw1-b12-drmrs relates to reimage of lvs6003 su.khe has kicked off [16:02:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) 05Open→03Resolved a:03ssingh @Trizek-WMF: Your access request has been merged. Please try logging in in about 30 minutes and feel free re-open... [16:05:12] !log Enable BGP EVPN sessions between eqiad row e/f Leaf and Spine devices [16:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:33] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [16:06:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P46123 and previous config saved to /var/cache/conftool/dbconfig/20230406-160614-ladsgroup.json [16:09:27] (03PS1) 10Cathal Mooney: Puppet additions for ssw1-e1-eqiad and ssw1-f1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/906627 (https://phabricator.wikimedia.org/T322937) [16:12:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [16:15:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage [16:21:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T333332)', diff saved to https://phabricator.wikimedia.org/P46124 and previous config saved to /var/cache/conftool/dbconfig/20230406-162120-ladsgroup.json [16:21:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [16:21:25] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:21:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [16:21:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T333332)', diff saved to https://phabricator.wikimedia.org/P46125 and previous config saved to /var/cache/conftool/dbconfig/20230406-162144-ladsgroup.json [16:24:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T333332)', diff saved to https://phabricator.wikimedia.org/P46126 and previous config saved to /var/cache/conftool/dbconfig/20230406-162409-ladsgroup.json [16:31:33] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:34:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6003.drmrs.wmnet with OS bullseye [16:34:56] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye completed: - lvs6003 (**WARN**) - Downtimed on Icinga/Aler... [16:39:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P46127 and previous config saved to /var/cache/conftool/dbconfig/20230406-163916-ladsgroup.json [16:41:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs6003.drmrs.wmnet [16:41:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs6003.drmrs.wmnet [16:54:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P46128 and previous config saved to /var/cache/conftool/dbconfig/20230406-165422-ladsgroup.json [16:58:53] (03CR) 10Volans: "Much nicer! I found some smaller things that still need fixing, but we should be closed." [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [16:58:59] !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye [17:00:04] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1700). [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1700) [17:02:40] (03CR) 10FNegri: [C: 03+1] wikireplica_dns.yaml: move toolsdb DNS to new server in 'tools' project [puppet] - 10https://gerrit.wikimedia.org/r/906053 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott) [17:02:57] (03CR) 10Andrew Bogott: [C: 03+2] wikireplica_dns.yaml: move toolsdb DNS to new server in 'tools' project [puppet] - 10https://gerrit.wikimedia.org/r/906053 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott) [17:05:23] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@318480e]: Fix for dump_month_of_daily_pageviews dag - Analytics [airflow-dags@318480e] [17:05:38] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@318480e]: Fix for dump_month_of_daily_pageviews dag - Analytics [airflow-dags@318480e] (duration: 00m 14s) [17:09:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T333332)', diff saved to https://phabricator.wikimedia.org/P46129 and previous config saved to /var/cache/conftool/dbconfig/20230406-170928-ladsgroup.json [17:09:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [17:09:33] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:09:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [17:10:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [17:10:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [17:10:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T333332)', diff saved to https://phabricator.wikimedia.org/P46130 and previous config saved to /var/cache/conftool/dbconfig/20230406-171028-ladsgroup.json [17:12:45] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs3007.esams.wmnet [17:12:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T333332)', diff saved to https://phabricator.wikimedia.org/P46131 and previous config saved to /var/cache/conftool/dbconfig/20230406-171254-ladsgroup.json [17:14:55] (03CR) 10Volans: "thanks for all the fixes, just found couple of minor issues, LGTM otherwise" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [17:15:00] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Would it be possible to use both serial number and/or asset tag for the match? I'll follow up with Julianne (she's currently out) regarding the formula being us... [17:16:30] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) But if the asset tag comes from netbox it will not match anything for future hosts... as the host will not be anymore in Netbox :) [17:19:11] (03PS1) 10Volans: reports: accounting use serial to match recyled [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906635 (https://phabricator.wikimedia.org/T320955) [17:19:48] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy I've sent the above patch to match on serial instead of asset tag. LMK what do you want to do. [17:19:52] 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10ssingh) [17:22:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts lvs3007.esams.wmnet [17:24:19] 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10ssingh) (Person on clinic duty here): I initially removed the `SRE-Access-Requests` tag, my apologies, because I thought that this... [17:28:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P46132 and previous config saved to /var/cache/conftool/dbconfig/20230406-172800-ladsgroup.json [17:31:55] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-e1-eqiad down. [17:32:22] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-e1-eqiad down. [17:32:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e7d20917-1f70-4c85-bea4-4fae89694441) set by cmooney@cumin1001 f... [17:32:37] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-f1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-f1-eqiad down. [17:32:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-f1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-f1-eqiad down. [17:33:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=09fdc8d3-92d3-4c3b-8e46-8c1befa6a846) set by cmooney@cumin1001 f... [17:33:07] (03CR) 10Volans: [C: 03+2] reports: accounting use serial to match recyled [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906635 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [17:33:35] (03PS1) 10Ssingh: hiera: lvs3007: update iface names for bullseye (esams) [puppet] - 10https://gerrit.wikimedia.org/r/906636 (https://phabricator.wikimedia.org/T321309) [17:33:59] (03Merged) 10jenkins-bot: reports: accounting use serial to match recyled [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906635 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [17:34:01] (03CR) 10Dzahn: [C: 03+1] "@Antoine What do you think, can I just merge it and we see when we get to it? pretty sure we won't want "wmflabs" in there and just need t" [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [17:34:08] 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dwisehaupt) @ssingh Thanks for looking at this. This task was created to help capture the output of some ongoing discussions to fi... [17:34:11] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [17:34:18] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [17:35:06] (03PS6) 10EoghanGaffney: Cookbook for switchover of Gitlab to a new host [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) [17:35:18] (03CR) 10EoghanGaffney: Cookbook for switchover of Gitlab to a new host (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [17:35:21] BGP alerts in esams expected [17:36:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3007.esams.wmnet with OS bullseye [17:36:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs3007.esams.wmnet with OS bullseye [17:37:05] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:37:12] ^ expected [17:38:01] (03PS11) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [17:38:03] (03PS1) 10David Caro: maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637 [17:39:10] (03CR) 10David Caro: "The one adding prometheus will follow this one, wanted to split the click change and the prometheus one" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [17:39:19] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:39:47] (03CR) 10BPirkle: "Is this really the phab task you intended to link?" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) (owner: 10Muehlenhoff) [17:39:48] 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dzahn) my 2 cents: Yes, it's possible to resolve this with a new admin group that gets sudo privs to run a particular set of cook... [17:40:50] (03CR) 10CI reject: [V: 04-1] maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro) [17:41:14] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [17:43:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P46133 and previous config saved to /var/cache/conftool/dbconfig/20230406-174306-ladsgroup.json [17:45:46] 10SRE, 10SRE-Access-Requests: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10Dzahn) Currently the process to sign a new NDA is under way. Once that is confirmed on T333884 it would be a good time to also resolve this ticket. [17:46:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Dzahn) a:05Ladsgroup→03ItamarWMDE [17:46:59] (03PS1) 10Volans: reports: accounting convert serial to string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906638 (https://phabricator.wikimedia.org/T320955) [17:47:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Dzahn) 05Open→03In progress [17:47:37] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Dzahn) 05Open→03In progress [17:47:53] (03CR) 10Volans: [C: 03+2] reports: accounting convert serial to string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906638 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [17:48:41] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:42] (03Merged) 10jenkins-bot: reports: accounting convert serial to string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906638 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [17:49:52] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [17:49:59] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [17:51:13] (03CR) 10Ssingh: "Merging before Puppet kicks in :)" [puppet] - 10https://gerrit.wikimedia.org/r/906636 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:51:15] (03CR) 10Ssingh: [C: 03+2] hiera: lvs3007: update iface names for bullseye (esams) [puppet] - 10https://gerrit.wikimedia.org/r/906636 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:55:54] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) 05Open→03Resolved Thanks @Volans. It looks like we're all set now. https://netbox.wikimedia.org/extras/reports/results/4443574/ [17:56:09] (03CR) 10Volans: [C: 03+1] "LGTM for the cookbook/python stuff, I'll leave it to the gitlab experts for the logic" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [17:58:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T333332)', diff saved to https://phabricator.wikimedia.org/P46134 and previous config saved to /var/cache/conftool/dbconfig/20230406-175813-ladsgroup.json [17:58:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [17:58:18] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:58:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [17:58:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:58:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:58:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3007.esams.wmnet with reason: host reimage [17:58:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46135 and previous config saved to /var/cache/conftool/dbconfig/20230406-175854-ladsgroup.json [18:00:05] hashar and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1800). [18:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46136 and previous config saved to /var/cache/conftool/dbconfig/20230406-180119-ladsgroup.json [18:02:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3007.esams.wmnet with reason: host reimage [18:04:55] (03PS3) 10Majavah: Remove osmdb records [dns] - 10https://gerrit.wikimedia.org/r/892901 (https://phabricator.wikimedia.org/T323159) [18:06:49] (03PS2) 10Majavah: openstack: remove osmdb dns records [puppet] - 10https://gerrit.wikimedia.org/r/892903 (https://phabricator.wikimedia.org/T323159) [18:06:51] (03PS2) 10Majavah: P:wmcs: remove osmdb classes [puppet] - 10https://gerrit.wikimedia.org/r/892904 (https://phabricator.wikimedia.org/T323159) [18:06:53] (03PS2) 10Majavah: osm: remove unuseud shapefile_import class [puppet] - 10https://gerrit.wikimedia.org/r/892905 [18:07:29] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:16:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P46137 and previous config saved to /var/cache/conftool/dbconfig/20230406-181625-ladsgroup.json [18:17:53] (03PS1) 10BCornwall: hiera: lvs/interfaces: update 6001 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906640 (https://phabricator.wikimedia.org/T321309) [18:17:55] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:01] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:18:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3007.esams.wmnet with OS bullseye [18:19:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs3007.esams.wmnet with OS bullseye completed: - lvs3007 (**PASS**) - Downtimed on Icinga/Aler... [18:20:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:23:49] please hold off making any netbox changes for now [18:26:33] ok [18:31:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P46138 and previous config saved to /var/cache/conftool/dbconfig/20230406-183132-ladsgroup.json [18:33:39] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Thanks @Dzahn. I understood it was /fnavas-wmf\ not just /fnavas\ Neither of those two or /fnavas-foundation\ allow me to log-in on the wikitech ma... [18:37:30] (03CR) 10Stevemunene: [C: 03+2] modules::profile::manifests::airflow.pp: add plugins_folder path [puppet] - 10https://gerrit.wikimedia.org/r/904609 (https://phabricator.wikimedia.org/T324485) (owner: 10Mforns) [18:38:38] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [18:41:06] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Hi! So,, each LDAP user has multiple fields, uid, sn and cn and depending on whether it's an SSH login, a wiki login or other, confusingly a different one may b... [18:43:10] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) P.S. I did test if wikitech wiki sends out email to myself, and it did. that's why I am saying to check on the ITS side. [18:46:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46139 and previous config saved to /var/cache/conftool/dbconfig/20230406-184638-ladsgroup.json [18:46:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:46:43] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:46:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:47:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46140 and previous config saved to /var/cache/conftool/dbconfig/20230406-184701-ladsgroup.json [18:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46141 and previous config saved to /var/cache/conftool/dbconfig/20230406-184929-ladsgroup.json [18:54:07] (03PS2) 10Ssingh: hiera: lvs/interfaces: update 6001 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906640 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:54:24] (03CR) 10Ssingh: [C: 03+1] "Looks good but let's hold till we figure out the Netbox restore, just in case!" [puppet] - 10https://gerrit.wikimedia.org/r/906640 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:04:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P46142 and previous config saved to /var/cache/conftool/dbconfig/20230406-190435-ladsgroup.json [19:14:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P46143 and previous config saved to /var/cache/conftool/dbconfig/20230406-191941-ladsgroup.json [19:26:34] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@b454afd]: (no justification provided) [19:26:45] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@b454afd]: (no justification provided) (duration: 00m 11s) [19:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46144 and previous config saved to /var/cache/conftool/dbconfig/20230406-193447-ladsgroup.json [19:34:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:34:53] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:35:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:35:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46145 and previous config saved to /var/cache/conftool/dbconfig/20230406-193510-ladsgroup.json [19:37:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46146 and previous config saved to /var/cache/conftool/dbconfig/20230406-193737-ladsgroup.json [19:45:27] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:45:38] (03PS2) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 [19:45:41] (03PS3) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 [19:52:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P46147 and previous config saved to /var/cache/conftool/dbconfig/20230406-195243-ladsgroup.json [19:59:39] (03PS2) 10Eevans: swift: add ms-fe101[3-4] as new Swift proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/906078 (https://phabricator.wikimedia.org/T334122) [20:00:05] brennen and TheresNoTime: May I have your attention please! UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T2000) [20:00:19] nary a patch to be found [20:00:42] :D [20:01:46] (03CR) 10Eevans: [C: 03+2] swift: add ms-fe101[3-4] as new Swift proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/906078 (https://phabricator.wikimedia.org/T334122) (owner: 10Eevans) [20:03:00] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:04:32] this is what i like to see when i belatedly remember it's the backport window. [20:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P46148 and previous config saved to /var/cache/conftool/dbconfig/20230406-200750-ladsgroup.json [20:09:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1013.eqiad.wmnet [20:09:45] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1014.eqiad.wmnet [20:10:37] PROBLEM - Check systemd state on ms-be1061 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:56] (03CR) 10Ssingh: "To be merged on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:15:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1013.eqiad.wmnet [20:15:27] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1061 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:15:53] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10wiki_willy) a:03Jclark-ctr [20:16:40] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1014.eqiad.wmnet [20:17:43] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdata1001.frack.eqiad.wmnet (WMF7292) - https://phabricator.wikimedia.org/T333971 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [20:19:07] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:19:37] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1004.wikimedia.org (B1) - https://phabricator.wikimedia.org/T333997 (10wiki_willy) a:03Jclark-ctr [20:20:00] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1003.wikimedia.org (A3) - https://phabricator.wikimedia.org/T333996 (10wiki_willy) a:03Jclark-ctr [20:20:55] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [20:22:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46149 and previous config saved to /var/cache/conftool/dbconfig/20230406-202256-ladsgroup.json [20:22:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [20:23:01] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:23:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [20:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46150 and previous config saved to /var/cache/conftool/dbconfig/20230406-202319-ladsgroup.json [20:24:14] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [20:24:46] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [20:25:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46151 and previous config saved to /var/cache/conftool/dbconfig/20230406-202535-ladsgroup.json [20:40:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P46152 and previous config saved to /var/cache/conftool/dbconfig/20230406-204041-ladsgroup.json [20:41:31] !log eevans@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [20:41:57] (03PS4) 10Subramanya Sastry: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [20:42:23] (03CR) 10Subramanya Sastry: "Rebased and resolved merge conflict." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [20:43:44] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove info for new ssw as need to set back to planned to make homer happy - cmooney@cumin1001 - T322937" [20:43:48] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [20:44:59] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove info for new ssw as need to set back to planned to make homer happy - cmooney@cumin1001 - T322937" [20:45:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [20:48:38] 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10matmarex) [20:49:22] !log eevans@cumin1001 conftool action : set/weight=40; selector: name=ms-fe1013.eqiad.wmnet [20:49:36] !log eevans@cumin1001 conftool action : set/weight=40; selector: name=ms-fe1014.eqiad.wmnet [20:49:56] !log eevans@cumin1001 conftool action : set/pooled=yes; selector: name=ms-fe1013.eqiad.wmnet [20:50:01] !log eevans@cumin1001 conftool action : set/pooled=yes; selector: name=ms-fe1014.eqiad.wmnet [20:51:28] (03CR) 10Subramanya Sastry: [C: 03+1] Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [20:52:02] 10SRE-swift-storage: Bring ms-fe101[3-4] into service - https://phabricator.wikimedia.org/T334122 (10Eevans) 05Open→03Resolved Done! [20:53:10] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [20:55:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P46153 and previous config saved to /var/cache/conftool/dbconfig/20230406-205548-ladsgroup.json [20:57:58] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [20:59:00] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:00:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:00:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:02:31] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:02:47] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:04:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:05:20] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:07:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:07:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:10:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46154 and previous config saved to /var/cache/conftool/dbconfig/20230406-211054-ladsgroup.json [21:10:59] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [21:15:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.708 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:15:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:18:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:19:04] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED [21:22:40] (03PS1) 10Mforns: analytics::refinery::job::druid_load: absent all jobs [puppet] - 10https://gerrit.wikimedia.org/r/906660 (https://phabricator.wikimedia.org/T334095) [21:23:03] (03CR) 10CI reject: [V: 04-1] analytics::refinery::job::druid_load: absent all jobs [puppet] - 10https://gerrit.wikimedia.org/r/906660 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns) [21:30:14] (03PS1) 10Mforns: analytics::refinery::job::druid_load: Remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906662 (https://phabricator.wikimedia.org/T334095) [21:33:51] (03Abandoned) 10Mforns: analytics::refinery::job::druid_load: absent all jobs [puppet] - 10https://gerrit.wikimedia.org/r/906660 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns) [21:33:56] (03Abandoned) 10Mforns: analytics::refinery::job::druid_load: Remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906662 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns) [21:37:02] (03PS1) 10Mforns: ::analytics::refinery::job::druid_load: absent remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906665 (https://phabricator.wikimedia.org/T334095) [21:43:15] (03PS1) 10Mforns: ::analytics::refinery::job::druid_load: Remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667 [21:43:16] Hey all - I’d like to fix a small issue in /private on production and deploy - let me know if I should hold off. [21:43:45] (03PS2) 10Mforns: ::analytics::refinery::job::druid_load: remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667 [21:46:10] (03CR) 10CI reject: [V: 04-1] ::analytics::refinery::job::druid_load: remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667 (owner: 10Mforns) [21:47:16] (03PS3) 10Mforns: ::analytics::refinery::job::druid_load: remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667 (https://phabricator.wikimedia.org/T334095) [21:52:53] !log Deployed updated mitigation for T333140 [21:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:34] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jhancock.wm) 05Open→03Resolved Received the new drive this afternoon. Worked with Matthew to replace the drive. It seems to be working and no longer throwing errors. Going to... [22:28:36] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [22:32:45] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10Jhancock.wm) @Marostegui would you be able to help me with this swap? If so when would work best for you? [23:27:14] (03PS1) 10Kevin Bazira: httpbb: Add test cases for trwiki editquality inference services [puppet] - 10https://gerrit.wikimedia.org/r/906687 (https://phabricator.wikimedia.org/T334158)