[00:01:00] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:14] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:13:34] 10SRE: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10ZnashBR) [00:14:48] PROBLEM - MariaDB Replica Lag: s7 on dbstore1003 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43203.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:21:20] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:08] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:32] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:26] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [00:43:20] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:02] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:55:48] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:07:20] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:24:32] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:26:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:23:11] (03CR) 10DannyS712: CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [02:41:06] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:45:50] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:58:22] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:02:24] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:13:52] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:42] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:18] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:21:12] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:52] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [04:43:38] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:57] 10SRE: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10Aklapper) @ZnashBR: Hi and welcome! Can you please elaborate on the Wikimedia Affiliate supporting project and who you have been in contact with? [05:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:09:30] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:48] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:22:36] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 57223 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [05:29:32] (03PS3) 10KartikMistry: testwiki: Enable SectionTranslation for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384) [05:32:51] (03PS1) 10Ayounsi: Rename cloudstore to clouddump [homer/public] - 10https://gerrit.wikimedia.org/r/806026 (https://phabricator.wikimedia.org/T302981) [05:35:41] (03CR) 10Ayounsi: [C: 03+2] Rename cloudstore to clouddump [homer/public] - 10https://gerrit.wikimedia.org/r/806026 (https://phabricator.wikimedia.org/T302981) (owner: 10Ayounsi) [05:39:59] (03PS1) 10Ayounsi: Add cloudstore with clouddumps [homer/public] - 10https://gerrit.wikimedia.org/r/806067 (https://phabricator.wikimedia.org/T302981) [05:40:57] (03CR) 10Ayounsi: [C: 03+2] Add cloudstore with clouddumps [homer/public] - 10https://gerrit.wikimedia.org/r/806067 (https://phabricator.wikimedia.org/T302981) (owner: 10Ayounsi) [05:53:24] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:05] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T0600). [06:09:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:10:02] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:56] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:13:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:15:38] (03PS1) 10Thiemo Kreuz (WMDE): Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) [06:18:30] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:26:10] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [06:32:41] (03CR) 10Thiemo Kreuz (WMDE): phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [06:34:41] (03CR) 10DannyS712: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [06:49:07] !log Rerun webrequest-load-wf-upload-2022-6-15-22 after weird oozie failure [06:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1 and apergos: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T0700). [07:00:04] kart_ and TheresNoTime: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] good morning! [07:00:16] apergos: morning! :D [07:00:22] we have a trainee signed up although they have not arrived at the gmeet yet [07:00:39] kart_: I imagine you would seld deploy. but. can we coordinate a bit? [07:01:04] I'd like to have you screen share the deployment steps while I talk through the process, if you can work with that [07:01:53] (in the meantime our trainee did just join the gmeet so that's all good) [07:02:21] * kart_ is here. [07:02:35] see my question to you [07:03:22] Message me GMeet link, I can join. [07:03:28] ok! [07:07:58] (03CR) 10KartikMistry: [C: 03+2] "UTC morning backport deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [07:08:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:08:49] (03Merged) 10jenkins-bot: testwiki: Enable SectionTranslation for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [07:11:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:12:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:31] (03CR) 10Thiemo Kreuz (WMDE): phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [07:18:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:08] (03PS1) 10Slyngshede: Fix LDAP / Puppet mismatch for cmyrick [puppet] - 10https://gerrit.wikimedia.org/r/806071 [07:22:24] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:805370|testwiki: Enable SectionTranslation for 11 Wikipedias (T309384 T310116)]] (duration: 03m 41s) [07:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:29] T310116: Enable Section Translation in Uzbek Wikipedia - https://phabricator.wikimedia.org/T310116 [07:22:29] T309384: Enable Content and Section translation on wikipedias with new MT support from Flores - https://phabricator.wikimedia.org/T309384 [07:23:23] (03CR) 10Jcrespo: "Should probably reference the admin: module on commit topic and Bug:T310524 on the second to last line for better searchability/context?" [puppet] - 10https://gerrit.wikimedia.org/r/806071 (owner: 10Slyngshede) [07:24:08] (03CR) 10Slyngshede: [V: 03+1] "A little cleanup, using the logging build into the system::timer module." [puppet] - 10https://gerrit.wikimedia.org/r/805829 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:24:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:24:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:37] (03PS2) 10Slyngshede: M:admin/data/data.yaml Fix LDAP / Puppet mismatch for cmyrick [puppet] - 10https://gerrit.wikimedia.org/r/806071 (https://phabricator.wikimedia.org/T310524) [07:30:27] I'm done with deployment @apergos [07:30:32] awesome! [07:30:59] if anyone else has a patch and would like to self deploy, now's the time. otherwise I'll wander off in a few minutes [07:33:44] thank you both! :) [07:34:28] thank you for showing up and thanks kart_ for being 1/2 of the training as well as deploying! [07:36:18] Thank you for joining :) [07:37:19] apergos: should I resolve T305191 or leave it open? I'll be joining the next (few) training sessions regardless :) [07:37:20] T305191: Deployment training request for TheresNoTime - https://phabricator.wikimedia.org/T305191 [07:38:06] TheresNoTime: just mark that you did it and let Tyler close I think [07:38:35] you can (and we like it if you) come to many more trainings, regardless of the task being closed. and then eventually... [07:38:43] after you've been deploying for awhile... [07:38:53] you start helping to give these trainings :-) [07:39:01] (our secret plan is now revealed!) [07:40:10] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10SLyngshede-WMF) p:05Triage→03Medium [07:42:40] 10SRE, 10Maps: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10SLyngshede-WMF) p:05Triage→03High [07:44:52] (03CR) 10Jcrespo: "Thank you. Note just "admin:" before the subject should be enough (the "module" name) 0:-). See the example at: https://www.mediawiki.org/" [puppet] - 10https://gerrit.wikimedia.org/r/806071 (https://phabricator.wikimedia.org/T310524) (owner: 10Slyngshede) [07:46:33] apergos: \o/ you were very good at the training fwiw, been doing it a while? [07:47:13] I have, I used to do other sorts of trainings in the organizer/activist realm and so I know some things about doing trainings from that experience :-) thanks! [07:47:26] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:47:50] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:05] (03CR) 10Volans: sre.hosts.pxe: Cookbook to configure dhcp option82 and reboot into pxe (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [07:50:56] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:53:10] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:58:50] (03CR) 10Filippo Giunchedi: [C: 03+2] Enforce alert names with no spaces [alerts] - 10https://gerrit.wikimedia.org/r/805393 (owner: 10Filippo Giunchedi) [07:58:56] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:51] (03CR) 10JMeybohm: [C: 03+1] mediawiki chart 0.2.3: Add before-hook-creation hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/803597 (owner: 10Ahmon Dancy) [08:11:58] (03CR) 10Tacsipacsi: CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [08:12:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806071 (https://phabricator.wikimedia.org/T310524) (owner: 10Slyngshede) [08:13:05] (03PS3) 10Muehlenhoff: Retire profile::logster_alarm [puppet] - 10https://gerrit.wikimedia.org/r/805734 [08:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:22:11] (03CR) 10Muehlenhoff: [C: 03+2] Retire profile::logster_alarm [puppet] - 10https://gerrit.wikimedia.org/r/805734 (owner: 10Muehlenhoff) [08:23:26] (03CR) 10Slyngshede: [C: 03+2] M:admin/data/data.yaml Fix LDAP / Puppet mismatch for cmyrick [puppet] - 10https://gerrit.wikimedia.org/r/806071 (https://phabricator.wikimedia.org/T310524) (owner: 10Slyngshede) [08:24:00] (03CR) 10David Caro: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [08:24:50] (03CR) 10Awight: "(I think this has the wrong bug number)" [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite) [08:25:56] PROBLEM - Check systemd state on ml-cache1001 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:40] this is wip --^ [08:33:05] (03CR) 10Volans: [C: 03+1] "LGTM, very minor nits inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [08:35:52] (03CR) 10Volans: [C: 04-1] "LGTM, thanks for the patch. Just one safety check to add." [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond) [08:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [08:36:54] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [08:36:56] (03PS1) 10Filippo Giunchedi: swift: drop REPLICATE 'access log' from container-server [puppet] - 10https://gerrit.wikimedia.org/r/806166 (https://phabricator.wikimedia.org/T309171) [08:39:22] (03CR) 10Filippo Giunchedi: "LGTM, modulo what Awight said" [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite) [08:39:27] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add test2 partition to ecs-test policy [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite) [08:40:13] (03CR) 10Volans: [C: 03+1] "LGTM, once the pre-requisite patches have been merged feel free to start testing it." [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [08:45:19] !log failover ganeti master in drmrs/2 to ganeti6004 [08:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:27] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, won't work as-is" [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [08:48:05] (03CR) 10Filippo Giunchedi: [C: 03+2] am: use SafeLoader for team regexes [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805383 (owner: 10Filippo Giunchedi) [08:48:10] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: use SafeLoader for team regexes [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805383 (owner: 10Filippo Giunchedi) [08:48:43] (03PS5) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 [08:49:08] PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:49:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [08:49:52] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:25] taavi: I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/802104 ok ? [08:51:31] sure! [08:51:36] godog: ^ [08:52:09] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [08:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:00] ack [08:53:03] (03PS2) 10Filippo Giunchedi: prometheus: use hostname for blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/805816 (https://phabricator.wikimedia.org/T305847) [08:53:05] (03PS3) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847) [08:53:07] (03PS2) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [08:53:09] (03CR) 10Filippo Giunchedi: [C: 03+2] P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [08:53:22] {{done}} [08:53:26] taavi: ^ [08:54:07] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use hostname for blackbox::check::http (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805816 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:54:21] thanks [08:55:45] sure np [08:56:42] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:59:24] (03PS1) 10Elukey: Add stub cassandra tls secrets for the ml-cache cluster [labs/private] - 10https://gerrit.wikimedia.org/r/806167 (https://phabricator.wikimedia.org/T302232) [08:59:26] (03CR) 10Slyngshede: [C: 03+2] class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [09:00:56] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:45] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti6002.drmrs.wmnet [09:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:00] PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:09:12] (03CR) 10Jbond: "Seems fine but we should clean up the exports vhost at the same time" [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [09:11:02] (03PS2) 10Elukey: Add stub cassandra tls secrets for the ml-cache cluster [labs/private] - 10https://gerrit.wikimedia.org/r/806167 (https://phabricator.wikimedia.org/T302232) [09:11:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:11:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29868 and previous config saved to /var/cache/conftool/dbconfig/20220616-091131-marostegui.json [09:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:37] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [09:12:03] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add stub cassandra tls secrets for the ml-cache cluster [labs/private] - 10https://gerrit.wikimedia.org/r/806167 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [09:14:48] (03PS8) 10Slyngshede: WIP: profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [09:15:02] (03PS1) 10Elukey: role::ml_cache::storage: add TLS settings for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/806168 (https://phabricator.wikimedia.org/T302232) [09:15:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35884/console" [puppet] - 10https://gerrit.wikimedia.org/r/806168 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [09:15:59] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 61174 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [09:16:29] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_cache::storage: add TLS settings for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/806168 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [09:16:32] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [09:18:22] RECOVERY - cassandra-a service on ml-cache1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:19:32] (03PS2) 10Jbond: SREBaseClass: Allow overriding actions [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 [09:21:31] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS buster [09:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:08] PROBLEM - cassandra-a service on ml-cache1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:24:31] (03CR) 10Ayounsi: [V: 03+1] Prometheus: scrap Netbox django metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [09:26:40] (03PS10) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [09:26:42] (03PS25) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [09:27:47] (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [09:27:58] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:29:42] (03CR) 10CI reject: [V: 04-1] Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:30:22] (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [09:30:54] * jayme ❤️ pylint [09:30:58] (03PS1) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [09:30:58] PROBLEM - cassandra-a SSL 10.64.130.9:7001 on ml-cache1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:31:51] (03CR) 10Jbond: SREBaseClass: Allow overriding actions (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond) [09:32:04] (03PS11) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [09:32:06] (03PS26) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [09:32:08] (03CR) 10Filippo Giunchedi: [C: 04-1] Prometheus: scrap Netbox django metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [09:32:28] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1003.eqiad.wmnet with OS buster [09:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage [09:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:56] (03PS2) 10JMeybohm: Align cumin aliases for wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/790662 (https://phabricator.wikimedia.org/T260661) [09:36:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage [09:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:04] (03CR) 10Jbond: "JKuyst noticed i forgot to" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [09:36:34] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:37:42] (03PS53) 10David Caro: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [09:39:32] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:39:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:40:57] (03PS2) 10Ayounsi: Netbox: expose Netbox on the frontend's FQDN [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) [09:40:59] (03PS2) 10Ayounsi: Prometheus: gently pull Netbox django metrics [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) [09:41:33] (03CR) 10Ayounsi: Netbox: expose Netbox on the frontend's FQDN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [09:41:44] (03CR) 10Ayounsi: Prometheus: gently pull Netbox django metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [09:42:52] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:44:41] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage [09:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage [09:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:13] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:49:41] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:56:59] jbond: re: the phab p a g e yesterday, that's icinga not prometheus that pages [09:57:16] godog: ack thanks i noticed but too late :) [09:57:47] hehe! I'm looking at the phab probes now though, definitely better with the hostname [09:57:53] i.e. https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=http&orgId=1&from=now-3h&to=now&var-site=All [09:58:01] (03CR) 10MVernon: [C: 03+1] "Seems good to me, thanks; once we're done with the bullseye upgrade, might worth seeing if swift has some knobs to twiddle to make it a bi" [puppet] - 10https://gerrit.wikimedia.org/r/806166 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [09:58:27] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: drop REPLICATE 'access log' from container-server [puppet] - 10https://gerrit.wikimedia.org/r/806166 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [09:58:36] (03CR) 10Filippo Giunchedi: [C: 03+2] "Cheers Matthew, merging" [puppet] - 10https://gerrit.wikimedia.org/r/806166 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [10:00:05] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1000). [10:02:36] !log ran `scap install-world --batch` on deploy1002 to allow scap/puppet to work on ml-cache100[2,3] [10:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:08] godog: definetly, delay was wondering and checking wht we only had phab1001 but its the only one with monitoring configured [10:06:29] (03CR) 10Jbond: "LGTM just minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [10:06:49] RECOVERY - cassandra-a service on ml-cache1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:07:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805448 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [10:08:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1003.eqiad.wmnet with OS buster [10:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [10:11:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1002.eqiad.wmnet with OS buster [10:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:31] PROBLEM - cassandra-a service on ml-cache1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:16:21] RECOVERY - Check systemd state on ml-cache1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:35] (03PS1) 10Filippo Giunchedi: swift: introduce rsyslog config to ban logs before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) [10:16:41] RECOVERY - cassandra-a SSL 10.64.130.9:7001 on ml-cache1001 is OK: SSL OK - Certificate ml-cache1001-a valid until 2024-06-15 08:50:14 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [10:16:45] RECOVERY - cassandra-a service on ml-cache1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:17:29] (03CR) 10CI reject: [V: 04-1] swift: introduce rsyslog config to ban logs before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [10:17:33] jbond: yeah I went with the existing guard for the active host only, though that should be revisited IMHO (in a future iteration) [10:18:25] RECOVERY - MariaDB Replica Lag: s7 on dbstore1003 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:45] (03PS2) 10Filippo Giunchedi: swift: introduce rsyslog config to ban logs before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) [10:18:45] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:18:56] godog: agree [10:20:05] (03PS1) 10Muehlenhoff: cas: Update to 6.5.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) [10:21:33] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: Rebooting to activate new kernel for T310483? [10:21:35] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: Rebooting to activate new kernel for T310483? [10:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:34] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35885/console" [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [10:23:01] hah, phab on ipv6 is failing because envoy isn't listening on :443 on ipv6 [10:23:05] "fair enough" [10:24:35] ok gotta go to lunch! [10:24:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:25:03] (03PS1) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) [10:25:57] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:20] -7 [10:28:22] uff [10:28:53] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: Rebooting to activate new kernel for T310483 [10:28:54] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: Rebooting to activate new kernel for T310483 [10:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:13] RECOVERY - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is OK: TCP OK - 0.000 second response time on 10.64.130.9 port 9042 https://phabricator.wikimedia.org/T93886 [10:31:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29869 and previous config saved to /var/cache/conftool/dbconfig/20220616-103117-marostegui.json [10:31:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic1089.eqiad.wmnet [10:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:21] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [10:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:33:17] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:34:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [10:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1089.eqiad.wmnet [10:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: reboots [10:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: reboots [10:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [10:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:11] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond) [10:41:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [10:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:44] (03CR) 10Volans: [C: 03+1] "I like it, see small nit inline for the naming, and yes might require some additional changes elsewhere." [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [10:41:46] (03CR) 10Jbond: [C: 03+2] SREBaseClass: Allow overriding actions [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond) [10:44:05] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [10:45:11] RECOVERY - Check systemd state on netflow5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:17] (03Merged) 10jenkins-bot: SREBaseClass: Allow overriding actions [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond) [10:45:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [10:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [10:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29870 and previous config saved to /var/cache/conftool/dbconfig/20220616-104622-marostegui.json [10:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on elastic[1100-1102].eqiad.wmnet with reason: reboots [10:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic[1100-1102].eqiad.wmnet with reason: reboots [10:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:50] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: simplify code [cookbooks] - 10https://gerrit.wikimedia.org/r/806177 [10:49:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [10:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:33] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:53:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet [10:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:01] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [10:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:56:00] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [10:56:46] (03PS2) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [10:57:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet [10:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:50] (03CR) 10Jbond: "thanks updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [10:58:26] (03CR) 10Jbond: sre.hosts.pxe: Cookbook to configure dhcp option82 and reboot into pxe (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [11:00:25] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:00:34] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet [11:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29871 and previous config saved to /var/cache/conftool/dbconfig/20220616-110127-marostegui.json [11:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:22] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1002.eqiad.wmnet [11:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:29] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1002.eqiad.wmnet [11:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:43] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:05] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1003.eqiad.wmnet [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:25] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:10:29] (03PS9) 10Slyngshede: profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [11:12:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2002.codfw.wmnet [11:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:15:04] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35886/console" [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede) [11:15:50] (03CR) 10Volans: [C: 03+1] "LGTM, see comment inline too." [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [11:16:05] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:16:21] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1003.eqiad.wmnet [11:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29873 and previous config saved to /var/cache/conftool/dbconfig/20220616-111632-marostegui.json [11:16:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:16:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:38] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [11:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2002.codfw.wmnet [11:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:31] SOrry about the BGP error noise. I'm rebooting ml k8s nodes for new kernels and that triggers it. I could put in a silence but it seems there are other BGP alerts that I might step on [11:18:38] (03PS3) 10Filippo Giunchedi: swift: introduce rsyslog config to ban logs before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) [11:18:42] (03CR) 10Filippo Giunchedi: swift: introduce rsyslog config to ban logs before centrallog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [11:19:04] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1004.eqiad.wmnet [11:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:59] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:20:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [11:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:55] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks :-)" [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [11:22:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [11:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:47] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi) [11:25:26] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1004.eqiad.wmnet [11:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:27] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1005.eqiad.wmnet [11:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:01] (03PS1) 10Jbond: DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 [11:31:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove rsync config only needed for stretch->bullseye migration [puppet] - 10https://gerrit.wikimedia.org/r/804339 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [11:32:07] (03PS3) 10Muehlenhoff: Enable ganeti4004 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/792670 [11:32:39] (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond) [11:32:59] (03PS2) 10Jbond: DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 [11:33:10] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1005.eqiad.wmnet [11:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on testvm[2001-2005].codfw.wmnet with reason: reboots [11:34:45] (03PS3) 10Jbond: DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 [11:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on testvm[2001-2005].codfw.wmnet with reason: reboots [11:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:57] (03CR) 10Jbond: [V: 04-1 C: 04-1] DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond) [11:35:05] (03CR) 10Jbond: [V: 04-1 C: 04-2] DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond) [11:35:21] !log trim swift logs older than 25d from centrallog hosts - T309171 [11:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:25] T309171: syslog / centrallog log volume growth - https://phabricator.wikimedia.org/T309171 [11:36:34] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:36:43] PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:37:02] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:32] (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond) [11:37:58] (03PS10) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [11:38:17] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1006.eqiad.wmnet [11:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:50] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:40:00] RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:40:18] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:41:00] (03CR) 10CI reject: [V: 04-1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [11:43:46] (03PS18) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [11:43:55] (03CR) 10Jbond: [C: 04-1] "We also need to keep everything in src/main/resources (theses are our skinning customisations)" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [11:44:36] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1006.eqiad.wmnet [11:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:18] (03PS11) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [11:45:29] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1007.eqiad.wmnet [11:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:55] (03CR) 10Muehlenhoff: cas: Update to 6.5.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [11:48:19] (03CR) 10Jbond: [C: 04-1] cas: Update to 6.5.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [11:50:18] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [11:50:20] (03CR) 10Jbond: Bump changelog for 6.5.5 and add some docs how to resync the overlay (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [11:50:22] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806177 (owner: 10Volans) [11:52:07] (03CR) 10Hnowlan: cassandra: load grants files upon change (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [11:52:28] (03PS1) 10Slyngshede: P:apt do not include private apt repo on cloud hosts. [puppet] - 10https://gerrit.wikimedia.org/r/806197 [11:52:30] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:53:03] (03CR) 10Btullis: "I'd be grateful for a review of this please. The idea is to be able to have real-time information in Prometheus about which servers are su" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [11:53:11] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1007.eqiad.wmnet [11:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:39] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1008.eqiad.wmnet [11:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35887/console" [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede) [11:55:56] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:44] (03CR) 10Jbond: [C: 03+1] SREBatchBase: Make action method a bit more dynamic (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [11:56:59] (03PS3) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [11:58:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede) [11:58:44] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:apt do not include private apt repo on cloud hosts. [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede) [11:58:51] (03CR) 10Muehlenhoff: P:apt do not include private apt repo on cloud hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede) [11:59:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 for schema change', diff saved to https://phabricator.wikimedia.org/P29874 and previous config saved to /var/cache/conftool/dbconfig/20220616-115924-root.json [11:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:40] (03PS1) 10Slyngshede: hiera:cloud fix comma [puppet] - 10https://gerrit.wikimedia.org/r/806198 [12:00:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806198 (owner: 10Slyngshede) [12:01:07] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1008.eqiad.wmnet [12:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:10] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:01:44] done with reboots for now. ANy remaining BGP alerts are, like, for real [12:01:58] (03CR) 10Slyngshede: [C: 03+2] hiera:cloud fix comma [puppet] - 10https://gerrit.wikimedia.org/r/806198 (owner: 10Slyngshede) [12:02:50] (03CR) 10Jbond: [V: 04-1 C: 04-2] DO NOT MERGE: try to reproduce an issue (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond) [12:11:05] (03PS1) 10Muehlenhoff: cas: Update to 6.5.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) [12:15:18] (03PS1) 10Btullis: Add a new check for the age of the standby namenode fsimage [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) [12:15:42] (03CR) 10Muehlenhoff: [C: 03+2] Enable ganeti4004 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/792670 (owner: 10Muehlenhoff) [12:16:00] (03CR) 10Lucas Werkmeister (WMDE): "Hm, what effect will this change have? As far as I can tell from WikibaseCirrusSearch code, this doesn’t look like a no-op…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse) [12:16:12] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:23] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35888/console" [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [12:19:01] (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [12:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:20:46] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:22:26] (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: simplify code [cookbooks] - 10https://gerrit.wikimedia.org/r/806177 (owner: 10Volans) [12:24:41] (03PS6) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 [12:24:44] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [12:25:52] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: simplify code [cookbooks] - 10https://gerrit.wikimedia.org/r/806177 (owner: 10Volans) [12:26:39] (03PS4) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [12:27:02] (03PS1) 10Slyngshede: C:apt actively absent privte repo if not requested. [puppet] - 10https://gerrit.wikimedia.org/r/806206 [12:27:36] (03CR) 10CI reject: [V: 04-1] WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [12:28:45] (03PS4) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847) [12:28:47] (03PS3) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [12:28:49] (03PS1) 10Filippo Giunchedi: phabricator: get envoy to listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) [12:29:23] (03CR) 10Jbond: [C: 03+2] wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond) [12:29:44] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35889/console" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:29:52] (03CR) 10CI reject: [V: 04-1] C:apt actively absent privte repo if not requested. [puppet] - 10https://gerrit.wikimedia.org/r/806206 (owner: 10Slyngshede) [12:31:17] (03PS19) 10Jbond: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [12:31:53] (03PS2) 10Slyngshede: C:apt actively absent privte repo if not requested. [puppet] - 10https://gerrit.wikimedia.org/r/806206 [12:33:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:33:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T302659)', diff saved to https://phabricator.wikimedia.org/P29875 and previous config saved to /var/cache/conftool/dbconfig/20220616-123357-marostegui.json [12:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:03] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [12:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [12:36:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35890/console" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [12:37:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35891/console" [puppet] - 10https://gerrit.wikimedia.org/r/806206 (owner: 10Slyngshede) [12:40:22] (03CR) 10Jbond: [V: 03+1 C: 03+1] "couple of nits but lgtm, please get a +1 from observability to check the prom file. Full output of which can be seen in the full diff" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [12:42:08] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:45:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806206 (owner: 10Slyngshede) [12:46:20] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:apt actively absent privte repo if not requested. [puppet] - 10https://gerrit.wikimedia.org/r/806206 (owner: 10Slyngshede) [12:46:43] (03CR) 10Filippo Giunchedi: [C: 03+1] Add a new check for the age of the standby namenode fsimage [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [12:48:39] (03PS12) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [12:48:49] (03PS27) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [12:52:35] (03CR) 10JMeybohm: [C: 03+2] Align cumin aliases for wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/790662 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [12:55:36] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a new check for the age of the standby namenode fsimage [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [12:56:29] 10SRE-swift-storage, 10Commons: HTTP 503 Backend fetch failed while editing Commons - https://phabricator.wikimedia.org/T307338 (10Aklapper) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:01:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4004.ulsfo.wmnet [13:01:02] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=ats-be [13:01:02] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=varnish-fe [13:01:03] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=ats-tls [13:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:31] (03CR) 10JMeybohm: [C: 03+2] Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:02:34] (03CR) 10JMeybohm: [C: 03+2] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [13:04:25] (03PS1) 10Jbond: wmflib::service: Reject empty string values [puppet] - 10https://gerrit.wikimedia.org/r/806208 [13:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P29876 and previous config saved to /var/cache/conftool/dbconfig/20220616-130438-root.json [13:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:24] (03Merged) 10jenkins-bot: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [13:05:26] (03Merged) 10jenkins-bot: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:06:42] (03CR) 10MVernon: "Looks reasonable to me, but I don't feel I know enough to weigh in on whether the validator should be pickier or not." [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [13:07:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4004.ulsfo.wmnet [13:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:43] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:09:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [13:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [13:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1089.eqiad.wmnet [13:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:21] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:02] (03PS1) 10JMeybohm: Update misc-clusters/example.txt... [cookbooks] - 10https://gerrit.wikimedia.org/r/806210 [13:15:28] (03CR) 10Ottomata: "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [13:19:18] (03PS1) 10Btullis: Add a sudo_user parameter to the hadoop fsimage freshness check [puppet] - 10https://gerrit.wikimedia.org/r/806212 (https://phabricator.wikimedia.org/T309649) [13:19:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P29877 and previous config saved to /var/cache/conftool/dbconfig/20220616-131942-root.json [13:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:32] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35893/console" [puppet] - 10https://gerrit.wikimedia.org/r/806212 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [13:21:47] !log jayme@cumin1001 START - Cookbook sre.misc-clusters.sretest rolling restart_daemons on A:sretest [13:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:00] !log jayme@cumin1001 END (PASS) - Cookbook sre.misc-clusters.sretest (exit_code=0) rolling restart_daemons on A:sretest [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:52] (03PS20) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [13:24:03] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp1089.eqiad.wmnet [13:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:05] (03PS1) 10Zabe: imagemagick: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806216 (https://phabricator.wikimedia.org/T308013) [13:25:07] (03PS1) 10Zabe: php: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806217 (https://phabricator.wikimedia.org/T308013) [13:25:09] (03PS1) 10Zabe: spamassassin: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806218 (https://phabricator.wikimedia.org/T308013) [13:25:11] (03PS1) 10Zabe: tomcat: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806219 (https://phabricator.wikimedia.org/T308013) [13:25:13] (03PS1) 10Zabe: vrts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) [13:25:45] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a sudo_user parameter to the hadoop fsimage freshness check [puppet] - 10https://gerrit.wikimedia.org/r/806212 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [13:27:14] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:27:38] (03PS2) 10Zabe: vrts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) [13:30:34] (03CR) 10Itamar Givon: [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:31:45] (03PS1) 10Volans: MW DB user: update username to wikiuser [puppet] - 10https://gerrit.wikimedia.org/r/806221 [13:33:03] (03CR) 10Marostegui: [C: 03+1] MW DB user: update username to wikiuser [puppet] - 10https://gerrit.wikimedia.org/r/806221 (owner: 10Volans) [13:35:17] (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:35:47] (03PS1) 10Marostegui: mariadb: Change wikiuser user [software] - 10https://gerrit.wikimedia.org/r/806222 [13:36:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806221 (owner: 10Volans) [13:37:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P29878 and previous config saved to /var/cache/conftool/dbconfig/20220616-133446-root.json [13:37:58] (03PS21) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:07] (03PS2) 10Volans: MW DB user: update username to wikiuser202206 [puppet] - 10https://gerrit.wikimedia.org/r/806221 [13:38:23] (03PS2) 10Marostegui: mariadb: Change wikiuser user [software] - 10https://gerrit.wikimedia.org/r/806222 [13:39:32] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:39] (03PS1) 10Volans: MW DB user: update username to wikiuser202206 [software] - 10https://gerrit.wikimedia.org/r/806223 [13:41:05] (03CR) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:41:15] (03PS22) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [13:44:01] (03CR) 10Jbond: "Sorry missed this one earlier" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:45:50] !log upload bird2_2.0.7-4.1wm1 to apt.wm.o (buster) - T310574 [13:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:54] T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 [13:46:52] (03CR) 10Marostegui: [C: 03+1] MW DB user: update username to wikiuser202206 [software] - 10https://gerrit.wikimedia.org/r/806223 (owner: 10Volans) [13:47:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806210 (owner: 10JMeybohm) [13:47:45] (03CR) 10JMeybohm: [C: 03+2] Update misc-clusters/example.txt... [cookbooks] - 10https://gerrit.wikimedia.org/r/806210 (owner: 10JMeybohm) [13:49:37] (03CR) 10Jbond: [C: 03+1] "LGTM but see comment, no action needed but something to consider as we deploy" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [13:49:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P29879 and previous config saved to /var/cache/conftool/dbconfig/20220616-134950-root.json [13:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:36] (03PS4) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [13:50:47] (03PS5) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [13:50:53] (03CR) 10Jbond: [C: 03+2] "thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [13:50:55] (03Merged) 10jenkins-bot: Update misc-clusters/example.txt... [cookbooks] - 10https://gerrit.wikimedia.org/r/806210 (owner: 10JMeybohm) [13:51:46] (03PS23) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [13:51:55] (03CR) 10Volans: [C: 03+2] MW DB user: update username to wikiuser202206 [software] - 10https://gerrit.wikimedia.org/r/806223 (owner: 10Volans) [13:54:00] (03PS1) 10Jbond: log: stop suppressing logging exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/806225 [13:54:22] (03Merged) 10jenkins-bot: MW DB user: update username to wikiuser202206 [software] - 10https://gerrit.wikimedia.org/r/806223 (owner: 10Volans) [13:56:26] (03PS6) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [13:56:33] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/806225 (owner: 10Jbond) [13:56:47] (03PS7) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [13:57:07] (03PS8) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [13:57:34] (03CR) 10Jbond: [C: 03+2] log: stop suppressing logging exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/806225 (owner: 10Jbond) [13:57:46] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:58:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:18] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35894/console" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:58:32] (03CR) 10Jbond: [C: 03+1] Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:58:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:58:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:14] PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:16] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:59:28] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:01:08] !log volans@cumin1001 dbctl commit (dc=all): 'Doesn't have new wikiuser', diff saved to https://phabricator.wikimedia.org/P29880 and previous config saved to /var/cache/conftool/dbconfig/20220616-140107-volans.json [14:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:19] (03CR) 10Ssingh: [C: 03+2] aptrepo: add repository component for bird2 [puppet] - 10https://gerrit.wikimedia.org/r/805448 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [14:02:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:30] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:03:58] RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:04:00] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:04:10] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:04:24] (03CR) 10Jbond: [C: 03+1] MW DB user: update username to wikiuser202206 [puppet] - 10https://gerrit.wikimedia.org/r/806221 (owner: 10Volans) [14:04:38] (03CR) 10Volans: [C: 03+2] MW DB user: update username to wikiuser202206 [puppet] - 10https://gerrit.wikimedia.org/r/806221 (owner: 10Volans) [14:04:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P29881 and previous config saved to /var/cache/conftool/dbconfig/20220616-140453-root.json [14:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302659)', diff saved to https://phabricator.wikimedia.org/P29882 and previous config saved to /var/cache/conftool/dbconfig/20220616-140613-marostegui.json [14:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:18] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [14:06:57] (03CR) 10Btullis: [V: 03+1] Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:07:03] (03PS24) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [14:07:30] (03Merged) 10jenkins-bot: log: stop suppressing logging exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/806225 (owner: 10Jbond) [14:09:10] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:17] (03CR) 10Ssingh: [V: 03+1] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [14:21:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29883 and previous config saved to /var/cache/conftool/dbconfig/20220616-142118-marostegui.json [14:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:36] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604927 seconds, message: migration to webperf2004, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:26:00] (03PS2) 10Muehlenhoff: Switch old Stretch arclamp nodes to role::insetup until eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/804341 (https://phabricator.wikimedia.org/T305460) [14:27:04] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has been disabled for 605108 seconds, message: migration to webperf1004, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:29:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P29884 and previous config saved to /var/cache/conftool/dbconfig/20220616-142923-ladsgroup.json [14:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=ats-be [14:29:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=varnish-fe [14:29:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=ats-tls [14:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29885 and previous config saved to /var/cache/conftool/dbconfig/20220616-143623-marostegui.json [14:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:43:28] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10wmde-team-a-tech, and 4 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10ItamarWMDE) [14:44:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 50%: Maint done', diff saved to https://phabricator.wikimedia.org/P29886 and previous config saved to /var/cache/conftool/dbconfig/20220616-144427-ladsgroup.json [14:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:52] 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I've run out of time to work on this for now, so I'm removing the #data-engineering-kanba... [14:45:30] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:51:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302659)', diff saved to https://phabricator.wikimedia.org/P29887 and previous config saved to /var/cache/conftool/dbconfig/20220616-145128-marostegui.json [14:51:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [14:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [14:51:32] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [14:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T302659)', diff saved to https://phabricator.wikimedia.org/P29888 and previous config saved to /var/cache/conftool/dbconfig/20220616-145136-marostegui.json [14:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:45] (03CR) 10Majavah: [V: 03+1 C: 03+1] Add profile::mediawiki::sharded_periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [14:58:57] (03PS4) 10Majavah: Separate metricsinfra nodes from prometheus_nodes on cloud [puppet] - 10https://gerrit.wikimedia.org/r/795143 [14:59:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P29889 and previous config saved to /var/cache/conftool/dbconfig/20220616-145931-ladsgroup.json [14:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:07] (03PS2) 10Ayounsi: wmf-netbox: simplify interface description for circuits [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/805898 (https://phabricator.wikimedia.org/T310591) [15:00:53] (03CR) 10Majavah: [C: 03+1] Reenable U2F for now [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff) [15:00:56] (03CR) 10Muehlenhoff: cas: Update to 6.5.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [15:01:29] (03CR) 10Muehlenhoff: [C: 03+2] Switch old Stretch arclamp nodes to role::insetup until eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/804341 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [15:02:04] (03CR) 10Ayounsi: "Example diff ran locally:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/805898 (https://phabricator.wikimedia.org/T310591) (owner: 10Ayounsi) [15:03:53] (03CR) 10Itamar Givon: [cirrus] Fix typo in config var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse) [15:05:48] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:54] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:22] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:26] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:08:43] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/795143 (owner: 10Majavah) [15:10:58] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:11:08] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:13:40] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:14:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P29890 and previous config saved to /var/cache/conftool/dbconfig/20220616-151434-ladsgroup.json [15:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:10] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete. [15:16:15] (03PS1) 10Btullis: Update the container image used by DataHub 0.8.38 [deployment-charts] - 10https://gerrit.wikimedia.org/r/806232 (https://phabricator.wikimedia.org/T310079) [15:22:29] (03CR) 10Btullis: [C: 03+2] Update the container image used by DataHub 0.8.38 [deployment-charts] - 10https://gerrit.wikimedia.org/r/806232 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [15:23:08] 10SRE, 10ops-codfw: Degraded RAID on aqs2005 - https://phabricator.wikimedia.org/T310610 (10Papaul) 05Open→03Resolved a:03Papaul Icinga is show on green on the raid check ` MD RAID This service is currently in a period of scheduled downtime View Extra Service Notes OK 2022-06-16 15:16:31 2d 2h... [15:25:52] (03Merged) 10jenkins-bot: Update the container image used by DataHub 0.8.38 [deployment-charts] - 10https://gerrit.wikimedia.org/r/806232 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [15:26:18] (03CR) 10Btullis: [C: 03+2] Update the trafficserver rule for datahub [puppet] - 10https://gerrit.wikimedia.org/r/805331 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:27:21] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:38] (03PS4) 10Clare Ming: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) [15:28:39] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:04] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [15:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:30] PROBLEM - AQS root url on aqs2012 is CRITICAL: connect to address 10.192.48.189 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:29:32] PROBLEM - Check systemd state on aqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:36] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [15:30:02] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [15:30:04] PROBLEM - AQS root url on aqs2009 is CRITICAL: connect to address 10.192.48.186 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:18] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:22] PROBLEM - Check systemd state on aqs2004 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:30] PROBLEM - AQS root url on aqs2006 is CRITICAL: connect to address 10.192.16.168 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:30:34] PROBLEM - Check systemd state on aqs2010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:34] PROBLEM - AQS root url on aqs2005 is CRITICAL: connect to address 10.192.16.42 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:30:35] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [15:30:36] PROBLEM - Check systemd state on aqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:30] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [15:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:55] ---^ These AQS alerts should have been in downtime I believe, as that cluster on codfw is still being set up. I will check. [15:33:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302659)', diff saved to https://phabricator.wikimedia.org/P29891 and previous config saved to /var/cache/conftool/dbconfig/20220616-153320-marostegui.json [15:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:25] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [15:35:04] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 269.39 ms [15:35:46] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 0%, RTA = 668.78 ms [15:38:24] PROBLEM - AQS root url on aqs2011 is CRITICAL: connect to address 10.192.48.188 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:38:26] (03Abandoned) 10Samtar: Raise $wgAutoblockExpiry from 1 day to 3 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767912 (https://phabricator.wikimedia.org/T43479) (owner: 10Samtar) [15:42:04] (03PS1) 10Krinkle: Only try to create User object if username is not null [extensions/CheckUser] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/806246 (https://phabricator.wikimedia.org/T310747) [15:47:53] 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309595 (10Papaul) 05Open→03Resolved i checked with @MatthewVernon on irc he said: ` yeah, that was a consequence of changing the SSDs in that box from RAID-0 to non-RAID, it's OK to close that task ` so we are good to clos... [15:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P29892 and previous config saved to /var/cache/conftool/dbconfig/20220616-154825-marostegui.json [15:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:25] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) a:05ayounsi→03RobH Ideally I'd like DCops to take care of link/interface level problems. I'm happy to help if needed though. [15:51:18] PROBLEM - Host lvs2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:56:10] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:18] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:45] (03PS2) 10Cwhite: logstash: add test2 partition to ecs-test policy [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T310760) [16:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:06] lvs2009 was me sorry about that [16:00:54] (03CR) 10Cwhite: [C: 03+2] "Thanks for catching that!" [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite) [16:02:50] 10SRE, 10Analytics, 10Data-Engineering: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10odimitrijevic) [16:03:04] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P29893 and previous config saved to /var/cache/conftool/dbconfig/20220616-160330-marostegui.json [16:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] RECOVERY - Host lvs2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 46.55 ms [16:07:11] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) [16:07:31] 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) a:05Jelto→03None [16:08:57] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10JArguello-WMF) [16:13:45] (03CR) 10Jbond: Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [16:14:00] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) [16:18:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [16:18:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302659)', diff saved to https://phabricator.wikimedia.org/P29894 and previous config saved to /var/cache/conftool/dbconfig/20220616-161835-marostegui.json [16:18:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:18:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:42] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [16:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T302659)', diff saved to https://phabricator.wikimedia.org/P29895 and previous config saved to /var/cache/conftool/dbconfig/20220616-161844-marostegui.json [16:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:10] (03CR) 10Jbond: [C: 03+1] "LGTM, might be worth a pcc just to make sure" [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff) [16:19:47] (03CR) 10Jbond: [C: 03+1] cas: Update to 6.5.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [16:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:24:00] (03CR) 10Ayounsi: [C: 03+2] Netbox: expose Netbox on the frontend's FQDN [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [16:28:57] (03PS1) 10Majavah: perl: add libfile-slurp-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806242 (https://phabricator.wikimedia.org/T305308) [16:29:18] (03CR) 10Majavah: [C: 03+2] perl: add libfile-slurp-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806242 (https://phabricator.wikimedia.org/T305308) (owner: 10Majavah) [16:30:35] (03Merged) 10jenkins-bot: perl: add libfile-slurp-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806242 (https://phabricator.wikimedia.org/T305308) (owner: 10Majavah) [16:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [16:37:20] (03CR) 10Ayounsi: Prometheus: gently pull Netbox django metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [16:38:37] (03CR) 10Dave Pifke: [C: 03+1] "PCC failure was:" [puppet] - 10https://gerrit.wikimedia.org/r/804546 (owner: 10Muehlenhoff) [16:43:23] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Worked on the email draft with Arzhel and just emailed it in CC'd both Arzhel and Cathal. Once I have more info I'll update this ticket. [16:47:34] 10SRE-tools, 10Infrastructure-Foundations: Q3 2018/19 Goal: TEC6: Build automated workflows for server provisioning (Tracking Task) - https://phabricator.wikimedia.org/T213114 (10ayounsi) [16:47:40] 10SRE, 10Infrastructure-Foundations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) 05Open→03Resolved a:03ayounsi Finally time to close this task. We've added more things to Netbox since, but no need for a tracking task anymore. Tracking core sites pow... [16:52:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302659)', diff saved to https://phabricator.wikimedia.org/P29896 and previous config saved to /var/cache/conftool/dbconfig/20220616-165210-marostegui.json [16:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:16] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [16:53:06] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:53:48] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:32] (03Abandoned) 10Marostegui: mariadb: Change wikiuser user [software] - 10https://gerrit.wikimedia.org/r/806222 (owner: 10Marostegui) [16:58:50] mr1-eqsin should be the scheduled maintenance of eqsin's provider for one power feed [17:00:05] brennen and thcipriani: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1700). [17:00:57] o/ - but this may or may not be happening. continuing to hold the time in case we figure out what we're doing. [17:02:03] (03PS1) 10Majavah: Provide a nodejs16 image based on Bullseye and Nodesource [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) [17:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:07:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29897 and previous config saved to /var/cache/conftool/dbconfig/20220616-170715-marostegui.json [17:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29898 and previous config saved to /var/cache/conftool/dbconfig/20220616-172220-marostegui.json [17:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:20] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:42] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel [17:26:44] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel [17:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:12] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phabricator.wikimedia.org with reason: bug fix [17:27:13] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phabricator.wikimedia.org with reason: bug fix [17:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:31] (03PS1) 10Dzahn: langlist: add blk, Pa'O language [dns] - 10https://gerrit.wikimedia.org/r/806267 (https://phabricator.wikimedia.org/T310777) [17:31:24] (03PS1) 10Dzahn: langlist: add pcm, Nigerian Pidgin language [dns] - 10https://gerrit.wikimedia.org/r/806268 (https://phabricator.wikimedia.org/T310776) [17:31:25] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1001.eqiad.wmnet with reason: bug fix [17:31:27] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1001.eqiad.wmnet with reason: bug fix [17:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302659)', diff saved to https://phabricator.wikimedia.org/P29899 and previous config saved to /var/cache/conftool/dbconfig/20220616-173725-marostegui.json [17:37:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:37:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:37:31] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [17:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T302659)', diff saved to https://phabricator.wikimedia.org/P29900 and previous config saved to /var/cache/conftool/dbconfig/20220616-173738-marostegui.json [17:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "lol @ "gently pull". LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [17:41:00] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) [17:41:35] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Arelion support case 01418061​ to investigate things. I'll followup with them as they progress the case. [17:42:18] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab.wmfusercontent.org with reason: bug fix [17:42:20] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab.wmfusercontent.org with reason: bug fix [17:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:10] (03CR) 10Jdlrobson: [C: 03+1] Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE)) [17:43:19] (03CR) 10Ayounsi: [C: 03+2] Prometheus: gently pull Netbox django metrics [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [17:46:13] !log starting phabricator deploy, momentary downtime expected while services restart [17:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:21] \o/ [17:51:13] (03CR) 10Cwhite: [C: 03+2] opensearch: ensure elasticsearch-curator on opensearch compatible fork [puppet] - 10https://gerrit.wikimedia.org/r/803587 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [17:52:45] (JobUnavailable) firing: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:54:07] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: Allow subdirs in image paths [puppet] - 10https://gerrit.wikimedia.org/r/805247 (https://phabricator.wikimedia.org/T310535) (owner: 10Brennen Bearnes) [17:58:44] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:59:14] !log end of phabricator deploy [17:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] brennen and jeena: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1800). [18:01:06] wheeeee [18:01:39] i'm gonna take 5 here, deploy a backport, then go ahead to all wikis. [18:02:45] (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:05:36] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:06:40] what a schedule and timing there, brennen. kudos [18:06:52] :59 [18:06:56] :D [18:10:00] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main [18:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302659)', diff saved to https://phabricator.wikimedia.org/P29901 and previous config saved to /var/cache/conftool/dbconfig/20220616-181005-marostegui.json [18:10:07] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [18:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:10] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [18:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:16] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main [18:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:20] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [18:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:18] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 217.14 ms [18:11:27] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main [18:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:06] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.93 ms [18:12:12] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [18:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:36] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main [18:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:19] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [18:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:00] (03CR) 10Brennen Bearnes: [C: 03+2] Only try to create User object if username is not null [extensions/CheckUser] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/806246 (https://phabricator.wikimedia.org/T310747) (owner: 10Krinkle) [18:17:36] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:25] (03PS1) 10AOkoth: install_server: remove gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/806273 (https://phabricator.wikimedia.org/T307142) [18:21:39] (03CR) 10Dzahn: [C: 03+2] install_server: remove gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/806273 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth) [18:25:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29902 and previous config saved to /var/cache/conftool/dbconfig/20220616-182510-marostegui.json [18:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:22] (03CR) 10Zabe: [C: 03+1] langlist: add blk, Pa'O language [dns] - 10https://gerrit.wikimedia.org/r/806267 (https://phabricator.wikimedia.org/T310777) (owner: 10Dzahn) [18:25:42] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10SLyngshede-WMF) p:05Triage→03Medium [18:25:50] (03CR) 10Zabe: [C: 03+1] langlist: add pcm, Nigerian Pidgin language [dns] - 10https://gerrit.wikimedia.org/r/806268 (https://phabricator.wikimedia.org/T310776) (owner: 10Dzahn) [18:26:46] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [18:27:36] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:29:30] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gitlab-runner1001.eqiad.wmnet [18:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:07] (03Merged) 10jenkins-bot: Only try to create User object if username is not null [extensions/CheckUser] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/806246 (https://phabricator.wikimedia.org/T310747) (owner: 10Krinkle) [18:34:00] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 232.88 ms [18:38:02] (03PS1) 10AOkoth: install_server: remove gitlab-runner 2001 [puppet] - 10https://gerrit.wikimedia.org/r/806276 (https://phabricator.wikimedia.org/T307142) [18:38:12] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) 05Open→03Resolved All merged. Thanks! 🎉 [18:38:37] (03CR) 10Dzahn: [C: 03+1] install_server: remove gitlab-runner 2001 [puppet] - 10https://gerrit.wikimedia.org/r/806276 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth) [18:39:00] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:39:44] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 217.20 ms [18:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29903 and previous config saved to /var/cache/conftool/dbconfig/20220616-184015-marostegui.json [18:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:33] oh thcipriani while I remember, ref T305191, the next training session it might be good to be walked through a deploy - can I get access to do that? [18:40:34] T305191: Deployment training request for TheresNoTime - https://phabricator.wikimedia.org/T305191 [18:40:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:42:06] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.16/extensions/CheckUser/src/Hooks.php: Backport: [[gerrit:806246|Only try to create User object if username is not null (T310747)]] (duration: 03m 23s) [18:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] T310747: TypeError: Argument 1 passed to MediaWiki\User\UserFactory::newFromName() must be of the type string, null given - https://phabricator.wikimedia.org/T310747 [18:44:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:44:27] !log train 1.39.0-wmf.16 (T308069): no current blockers - rolling to all wikis [18:44:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:33] T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069 [18:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:07] (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806278 (https://phabricator.wikimedia.org/T308069) [18:45:09] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806278 (https://phabricator.wikimedia.org/T308069) (owner: 10Brennen Bearnes) [18:45:11] (03CR) 10AOkoth: [C: 03+2] install_server: remove gitlab-runner 2001 [puppet] - 10https://gerrit.wikimedia.org/r/806276 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth) [18:45:53] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806278 (https://phabricator.wikimedia.org/T308069) (owner: 10Brennen Bearnes) [18:48:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:57] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.16 refs T308069 [18:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:02] T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069 [18:50:16] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:18] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:53:19] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab-runner1001.eqiad.wmnet [18:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:10] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gitlab-runner1001.eqiad.wmnet [18:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:54:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302659)', diff saved to https://phabricator.wikimedia.org/P29904 and previous config saved to /var/cache/conftool/dbconfig/20220616-185520-marostegui.json [18:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:24] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [18:57:22] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:30] (03PS1) 10AOkoth: site: remove old gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/806279 (https://phabricator.wikimedia.org/T307142) [18:57:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:13] (03CR) 10Dzahn: [C: 03+1] "lgtm but only merge after cookbook is done" [puppet] - 10https://gerrit.wikimedia.org/r/806279 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth) [19:00:54] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:00:55] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gitlab-runner1001.eqiad.wmnet [19:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:53] !log aokoth@cumin1001 START - Cookbook sre.hosts.decommission for hosts gitlab-runner2001.codfw.wmnet [19:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:28] 10SRE: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) [19:11:16] 10SRE: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) [19:11:46] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) [19:19:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:20:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:11] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [19:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:24] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) After this I ran only the DNS cookbook directly and this time it finished without such an error. I am not sure if it tried though because it said... [19:24:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.544 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:24:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48250 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:03] (03PS1) 10JMeybohm: Allow to dry-run SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 [19:26:05] (03PS1) 10JMeybohm: SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 [19:26:07] (03PS1) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) [19:26:09] (03PS1) 10JMeybohm: sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) [19:30:54] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:29] (03CR) 10RhinosF1: [C: 03+1] langlist: add pcm, Nigerian Pidgin language [dns] - 10https://gerrit.wikimedia.org/r/806268 (https://phabricator.wikimedia.org/T310776) (owner: 10Dzahn) [19:36:50] (03CR) 10RhinosF1: [C: 03+1] langlist: add blk, Pa'O language [dns] - 10https://gerrit.wikimedia.org/r/806267 (https://phabricator.wikimedia.org/T310777) (owner: 10Dzahn) [19:38:39] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) The run of the decom book was at: `2022-06-16 18:54:09,812 dzahn 2165070 [INFO] START - Cookbook sre.hosts.decommission for hosts gitlab-runner10... [19:39:38] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:39:39] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab-runner2001.codfw.wmnet [19:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:56] (03CR) 10DannyS712: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [19:49:07] mutante: +1'd both I saw in my email [19:52:16] RhinosF1: thank you, ACK [19:53:48] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Arnoldokoth) fatal: unable to access 'https://netbox1002.eqiad.wmnet/dns.git/': The requested URL returned error: 403 0.0% (0/1) success ratio (< 100.0%... [19:56:26] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10thcipriani) >>! In T309375#7963808, @Dzahn wrote: > checked off boxes (L3 signed, NDA, has existing shell access, etc). > > > Will need approval from group approver (Tyler). @hashar and... [19:58:39] (03PS4) 10DannyS712: CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 [19:58:43] (03CR) 10DannyS712: CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [20:00:05] brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T2000). [20:00:05] cjming: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:00] (03PS5) 10DannyS712: CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 [20:01:36] brennen also me, added my patch a few seconds too late [20:01:53] (i.e. I also have patches scheduled for the deployment window) [20:01:59] tsk /j [20:03:35] o/ [20:03:37] i'll deploy [20:04:41] (03PS3) 10DannyS712: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) [20:05:03] (03CR) 10Clare Ming: [C: 03+2] Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming) [20:06:26] (03Merged) 10jenkins-bot: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming) [20:10:38] (03PS4) 10Clare Ming: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:11:16] hi DannyS712: doing your patches here next [20:12:14] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:805179|Turn off TOC A/B test for pilot wikis (T309683)]] (duration: 03m 37s) [20:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:18] T309683: Turn off table of contents A/B test - https://phabricator.wikimedia.org/T309683 [20:13:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:19] (03CR) 10Clare Ming: [C: 03+2] phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:14:49] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806248 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:14:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:59] ^ another patch I'm going to add for the current window [20:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:14] (03Merged) 10jenkins-bot: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:16:00] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:46] DannyS712: going to sync your 1st patch since it's comments [20:19:23] okay, then I have two more that aren't comments [20:19:39] sorry, 3 more [20:20:39] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806249 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:20:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:23:30] !log cjming@deploy1002 Synchronized wmf-config/: Config: [[gerrit:805432|phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (T171115)]] (duration: 03m 22s) [20:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:35] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [20:23:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:26] (03PS6) 10Clare Ming: CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [20:24:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:24:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:24:41] DannyS712: we might run out of time in this window, but we're looking. Scap is a little bit slower since we're doing PHP restarts for every deploy. [20:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:01] okay, no rush [20:25:05] <3 [20:25:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:23] !log cjming@deploy1002 Synchronized phpcs.xml: Config: [[gerrit:805432|phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (T171115)]] (duration: 03m 27s) [20:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:45] (03PS1) 10Dzahn: Revert "Revert "Provide buildkitd to GitLab runners"" [puppet] - 10https://gerrit.wikimedia.org/r/806250 [20:31:27] (03PS2) 10Dzahn: Revert "Revert "Provide buildkitd to GitLab runners"" [puppet] - 10https://gerrit.wikimedia.org/r/806250 [20:32:04] (03CR) 10Thcipriani: [C: 03+2] phpcs: enable PrefixedGlobalFunctions.allowedPrefix and rename functions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806248 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:33:40] (03Merged) 10jenkins-bot: phpcs: enable PrefixedGlobalFunctions.allowedPrefix and rename functions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806248 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:34:43] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "Provide buildkitd to GitLab runners"" [puppet] - 10https://gerrit.wikimedia.org/r/806250 (owner: 10Dzahn) [20:34:49] DannyS712: I fetched your function rename one down to mwdebug1002, if you want to take a look [20:35:13] not really sure where I can test that [20:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [20:36:17] so what should I do? [20:37:30] How do you mean? You're not sure what to test there? Or not sure what part of the front end exercises that since it happens before hitting mwcore? [20:37:57] both - not sure what to test or where to test it [20:38:05] (03CR) 10JHathaway: [C: 03+2] exim: update comment on BDAT issue [puppet] - 10https://gerrit.wikimedia.org/r/803601 (https://phabricator.wikimedia.org/T307873) (owner: 10JHathaway) [20:38:12] (03PS3) 10JHathaway: exim: update comment on BDAT issue [puppet] - 10https://gerrit.wikimedia.org/r/803601 (https://phabricator.wikimedia.org/T307873) [20:38:28] (03CR) 10JHathaway: [V: 03+2] exim: update comment on BDAT issue [puppet] - 10https://gerrit.wikimedia.org/r/803601 (https://phabricator.wikimedia.org/T307873) (owner: 10JHathaway) [20:39:58] DannyS712: ok, I'll make sure nothing explodes on the backend and sync [20:40:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:41:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:01] !log thcipriani@deploy1002 Started scap: Config: [[gerrit:806248|phpcs: enable PrefixedGlobalFunctions.allowedPrefix and rename functions (T171115)]] [20:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:04] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [20:42:08] (03Abandoned) 10BCornwall: Traffic: Port IPsec/Strongswan connection alert [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [20:42:33] (03CR) 10Thcipriani: [C: 03+2] MWRealm.php: remove unused getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806249 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:42:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:17] (03Merged) 10jenkins-bot: MWRealm.php: remove unused getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806249 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [20:43:33] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10jhathaway) [20:44:18] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) 05Open→03Resolved This has now been fixed upstream, https://git.exim.org/exim.git/commit/462e2cd30. We w... [20:45:28] 10SRE, 10Infrastructure-Foundations, 10Mail: Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10jhathaway) [20:45:46] (03CR) 10Cwhite: "scap.announce will be the first stream to go to Loki if releng approves." [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:46:04] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10jhathaway) [20:46:14] 10SRE, 10Infrastructure-Foundations, 10Mail: Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10jhathaway) 05Open→03Stalled This is stalled until 4.96 is available in Debian. [20:47:44] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:48:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:20] thcipriani will https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805433 be included in this window? [20:51:42] CommonSettings cleanup [20:52:20] (03CR) 10Thcipriani: [C: 03+2] CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [20:52:24] DannyS712: sure :) [20:53:53] (03PS1) 10Dzahn: gitlab::runner: set sysctl kernel.unprivileged_userns_clone = 1 [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [20:54:01] (03Merged) 10jenkins-bot: CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [20:54:58] (03CR) 10CI reject: [V: 04-1] gitlab::runner: set sysctl kernel.unprivileged_userns_clone = 1 [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [20:55:15] (03PS2) 10Dzahn: gitlab::runner: set sysctl kernel.unprivileged_userns_clone = 1 [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [20:56:56] (03PS1) 10BCornwall: Traffic: Port over purged lag/queue monitors [alerts] - 10https://gerrit.wikimedia.org/r/806332 (https://phabricator.wikimedia.org/T300723) [20:58:35] (03PS1) 10Hnowlan: Port Dockerfile to use buster [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/806333 [20:58:59] !log thcipriani@deploy1002 Finished scap: Config: [[gerrit:806248|phpcs: enable PrefixedGlobalFunctions.allowedPrefix and rename functions (T171115)]] (duration: 16m 57s) [20:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:03] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [20:59:10] ok, well that's syncd [20:59:28] still need to sync CommonSettings though, right? [20:59:40] 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle) [21:00:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:06] (03CR) 10BCornwall: Traffic: Port IPsec/Strongswan connection alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [21:00:20] DannyS712: yeah, we still have your last two to go, does that sound right to you? [21:00:45] they're both on mwdebug1002 if there's anything to check for either of those [21:01:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:01:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:40] nothing to check for removing an unused function, and the common settings should be a no-op, so I think should be good to sync [21:02:13] OK: syncing mwrealm.php then commonsettings.php [21:02:40] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:00] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:12] ACKNOWLEDGEMENT - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service daniel_zahn deployment in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:06:29] !log thcipriani@deploy1002 Synchronized multiversion/MWRealm.php: Config: [[gerrit:806249|MWRealm.php: remove unused getRealmSpecificFilename() (T171115)]] (duration: 03m 35s) [21:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:34] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [21:06:37] next one going live now [21:07:42] (03CR) 10CI reject: [V: 04-1] Port Dockerfile to use buster [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/806333 (owner: 10Hnowlan) [21:10:55] !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:805433|CommonSettings: clean up and simplify some code]] (duration: 03m 42s) [21:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:07] ^ DannyS712 all done! [21:11:55] thanks for the deployments! [21:12:13] sure thing, thanks for making code better :) [21:12:35] (03PS1) 10Dzahn: docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) [21:13:37] (03CR) 10CI reject: [V: 04-1] docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [21:14:09] !log thcipriani@deploy1002 Started scap: noop test [21:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:42] (03PS2) 10Dzahn: docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) [21:18:16] !log thcipriani@deploy1002 Finished scap: noop test (duration: 04m 07s) [21:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:42] (03CR) 10Dzahn: "arr.. Could not find resource 'Service[docker]'" [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [21:18:50] (03CR) 10Dzahn: [C: 04-1] docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [21:19:31] brennen: no-op was 4m07s so there was some unsync'd localization change lurking somewhere [21:19:49] yeah, makes sense. [21:19:55] which is...kinda worrisome [21:20:16] possibly we oughta make scap say _why_ it's doing a cdb rebuild [21:22:53] huh [21:23:30] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:10] brennen: Scap doesn't know. It's the mediawiki maintenance script that does the deed. [21:33:40] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:26] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:18] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:07] ACKNOWLEDGEMENT - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service daniel_zahn deployment in progress - needs manual steps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:07] ACKNOWLEDGEMENT - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service daniel_zahn deployment in progress - needs manual steps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:07] ACKNOWLEDGEMENT - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service daniel_zahn deployment in progress - needs manual steps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:04] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:58] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:38] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:04] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:22] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:53:52] (03PS3) 10Dzahn: gitlab::runner: set sysctl kernel.unprivileged_userns_clone = 1 [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) [22:03:00] (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:07:32] 10SRE, 10Maps: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10ZnashBR) >>! In T310761#8008054, @Aklapper wrote: > @ZnashBR: Hi and welcome! Can you please elaborate on the Wikimedia Affiliate supporting project and who you have been in contact with? Sorry, i'm n... [22:09:41] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Volans) p:05Triage→03High It seems the vhost has changed: ` root@netbox2002:~# runuser -u netbox -- git -C "/srv/netbox-exports/dns.git" fetch -v ne... [22:19:50] (03CR) 10Dzahn: "might have caused https://phabricator.wikimedia.org/T310831" [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [22:20:36] (03PS1) 10Volans: Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 [22:20:55] (03PS2) 10Volans: Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831) [22:22:42] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:50] (03CR) 10CI reject: [V: 04-1] Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831) (owner: 10Volans) [22:24:25] (03PS3) 10Volans: Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831) [22:26:24] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10thcipriani) 05Stalled→03Open > - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml Approved! @TheresNoTime is attending [[... [22:26:50] (03CR) 10Ayounsi: [C: 03+1] Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831) (owner: 10Volans) [22:27:36] (03CR) 10Volans: [C: 03+2] Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831) (owner: 10Volans) [22:31:29] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Volans) Run puppet on both netbox hosts (1002/2002) [22:32:02] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:17] !log volans@cumin2002 START - Cookbook sre.dns.netbox [22:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:24] (03PS1) 10Cwhite: logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) [22:36:10] (03PS3) 10Cwhite: logstash: duplicate scap.announce logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826) [22:37:04] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:55] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Volans) Run dns cookbook to force sync the data everywhere (the last couple of commits where not deployed to the authdns hosts). The procedure is describ... [22:39:02] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Volans) 05Open→03Resolved a:03Volans This should have fixed the issue. I'm resolving it, but feel free to re-open in case it's not fully solved. [22:39:36] (03PS1) 10Andrew Bogott: haproxy/nova-api-metadata use the /healthcheck endpoint for health check [puppet] - 10https://gerrit.wikimedia.org/r/806350 [22:41:20] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:29] 10SRE, 10MediaWiki-General, 10Traffic: Query canonicalization for MediaWiki - https://phabricator.wikimedia.org/T310087 (10Krinkle) This reminds me of T140664, which is a proposal from a few years ago going in a similar direction (albeit for a different use case). In any event, establishing such a router wi... [22:49:34] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:50:56] 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) Thank you for the very quick response! [22:52:45] (03CR) 10Dzahn: [C: 03+2] langlist: add blk, Pa'O language [dns] - 10https://gerrit.wikimedia.org/r/806267 (https://phabricator.wikimedia.org/T310777) (owner: 10Dzahn) [22:53:10] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:56:05] (03CR) 10Dzahn: [C: 03+2] langlist: add pcm, Nigerian Pidgin language [dns] - 10https://gerrit.wikimedia.org/r/806268 (https://phabricator.wikimedia.org/T310776) (owner: 10Dzahn) [22:59:10] !log new Wikipedia languages added to DNS: blk = https://en.wikipedia.org/wiki/Pa%27O_language | pcm = https://en.wikipedia.org/wiki/Nigerian_Pidgin [22:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:50] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) a:05RobH→03ayounsi @ayounsi, So as you can see they advised they want us to go and investigate the cross-connect, and if they result in charges we'll use that thread to get a credit on our Arelion... [23:04:16] (03PS2) 10Tim Starling: Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE)) [23:08:34] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:14:16] (03CR) 10Tim Starling: "Really needs a +1 from Tyler." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE)) [23:18:51] 10SRE, 10Data-Engineering-Icebox, 10Traffic-Icebox: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10odimitrijevic) [23:34:42] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:45] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS bullseye [23:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:52] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye [23:38:03] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage [23:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:16] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage [23:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:22] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1016.eqiad.wmnet with OS bullseye [23:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:27] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye completed: - aqs1016 (**WARN**)...