[00:01:01] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:05:27] PROBLEM - dump of es4 in codfw on alert1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than 8 days ago: Most recent backup 2022-05-10 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:17:51] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:20:53] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:08] (03PS4) 10Andrew Bogott: wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601) [00:26:10] (03PS1) 10Andrew Bogott: wmcs-image-create: add a few missing args to create_puppetized_vm() [puppet] - 10https://gerrit.wikimedia.org/r/792746 [00:27:02] (03CR) 10Andrew Bogott: [C: 04-1] "This doesn't help -- I must be missing something." [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601) (owner: 10Andrew Bogott) [00:27:59] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than 8 days ago: Most recent backup 2022-05-10 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:28:19] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: add a few missing args to create_puppetized_vm() [puppet] - 10https://gerrit.wikimedia.org/r/792746 (owner: 10Andrew Bogott) [00:31:10] (03PS1) 10Stang: zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) [00:31:21] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:47] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:31] RECOVERY - dump of es4 in codfw on alert1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-05-17 00:00:01 (3049 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:40:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:54] (03PS1) 10Ssingh: certspotter: remove CT log returning 502 [puppet] - 10https://gerrit.wikimedia.org/r/792751 [00:46:44] (03CR) 10Ssingh: [C: 03+2] certspotter: remove CT log returning 502 [puppet] - 10https://gerrit.wikimedia.org/r/792751 (owner: 10Ssingh) [00:59:05] RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-05-17 00:00:02 (3049 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:17:08] (03PS1) 10Stang: zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) [01:38:12] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) @Dzahn re-imaging will be the easier option. Thanks [01:43:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:05:17] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:22:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:45:25] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10RKemper) [04:47:47] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10RKemper) `elastic1065` is ready to be removed whenever. Search team doesn't require any advance notice, since I've already depoooled & banned the host in preparation. Please let me/us know wh... [04:53:25] (03CR) 10Marostegui: [C: 03+1] db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/792708 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [04:53:57] (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/792707 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [04:54:15] (03CR) 10Marostegui: [C: 03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/792709 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [04:56:50] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RKemper) For `elastic1059` we just need some advance notice (~24h) so we can depool & ban [at the elasticsearch cluster level] the host. I'd do that now but it's not clear from the ticket if... [04:58:18] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Marostegui) Verified that Mazevedo has a WMF email assigned. [05:06:59] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Marostegui) @Mazevedo are you sure you already have shell access? I am not able to find your entry. [05:07:35] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Marostegui) p:05Triage→03Medium [05:07:38] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for Tsevener - https://phabricator.wikimedia.org/T308616 (10Marostegui) p:05Triage→03Medium [05:10:21] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for Tsevener - https://phabricator.wikimedia.org/T308616 (10Marostegui) Verified the user as WMF employee. [05:13:47] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for Tsevener - https://phabricator.wikimedia.org/T308616 (10Marostegui) 05Open→03Resolved a:03Marostegui @Tsevener I have added you to the WMF ldap group (you were not there), please test again and reopen if you still find issues. Added also to Phabric... [05:19:03] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:49:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:53] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:45] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:31] 10SRE: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10Marostegui) @MoritzMuehlenhoff can we close this? [06:46:19] 10SRE, 10Recommendation-API, 10serviceops: recommendation-api alerting and api errors - https://phabricator.wikimedia.org/T262587 (10Marostegui) 05Open→03Resolved I am going to close this, it's been 1.5y and it is of course impossible to troubleshoot this specific issue anymore [06:47:32] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10Marostegui) @MoritzMuehlenhoff can we close this? [07:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T0700). [07:00:06] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:42] o/ [07:00:51] I can deploy [07:00:57] Go ahead dcausse :) [07:01:25] 10SRE: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Done! [07:01:25] (03CR) 10DCausse: [C: 03+2] haslicense: Apply minimum_should_match for elastic 7.x [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792650 (https://phabricator.wikimedia.org/T288765) (owner: 10Ebernhardson) [07:01:39] (03CR) 10DCausse: [C: 03+2] Resolve minimum_should_match warnings during random scoring [extensions/CirrusSearch] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792649 (https://phabricator.wikimedia.org/T288765) (owner: 10Ebernhardson) [07:01:41] (03PS1) 10Giuseppe Lavagetto: envoyproxy: add spdx license headers [puppet] - 10https://gerrit.wikimedia.org/r/792969 [07:01:53] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [07:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:22] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) 05Open→03Resolved All done [07:06:30] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey) [07:07:08] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey) 05Stalled→03Open a:05elukey→03None [07:08:41] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 (10Majavah) [07:09:16] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey) Resetting the task to open, since I think that Kevin and Aiko should end up in the `deployment` group. They... [07:09:38] (03PS2) 10Elukey: Add Aiko and Kevin to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/791036 (https://phabricator.wikimedia.org/T307927) [07:11:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I'll merge this some time this week." [puppet] - 10https://gerrit.wikimedia.org/r/792692 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:12:25] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:14:15] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2022-08-09 06:51:41 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [07:14:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet [07:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:50] !log Cold reset wtp1045.mgmt ipmi [07:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:06] (03Merged) 10jenkins-bot: haslicense: Apply minimum_should_match for elastic 7.x [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792650 (https://phabricator.wikimedia.org/T288765) (owner: 10Ebernhardson) [07:19:10] (03Merged) 10jenkins-bot: Resolve minimum_should_match warnings during random scoring [extensions/CirrusSearch] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792649 (https://phabricator.wikimedia.org/T288765) (owner: 10Ebernhardson) [07:22:43] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:23:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet [07:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:53] PROBLEM - Keyholder SSH agent on cumin1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [07:23:53] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:03] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:26:57] (03PS2) 10Giuseppe Lavagetto: varnish: annotate X-Analytics header with matching requestctl actions [puppet] - 10https://gerrit.wikimedia.org/r/791372 (https://phabricator.wikimedia.org/T305582) [07:26:59] (03PS2) 10Giuseppe Lavagetto: varnish: set retry-after based on throttle duration in requestctl [puppet] - 10https://gerrit.wikimedia.org/r/791373 (https://phabricator.wikimedia.org/T305824) [07:27:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:31] RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [07:28:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:28:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:45] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Urbanecm) [07:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:16] !log Restarting CI Jenkins [07:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:51] !log dcausse@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/CirrusSearch/includes/Query/FullTextSimpleMatchQueryBuilder.php: Backport: [[gerrit:792649|Resolve minimum_should_match warnings during random scoring (T288765)]] (duration: 00m 56s) [07:32:55] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:57] T288765: Always provide minimum_should_match in bool queries - https://phabricator.wikimedia.org/T288765 [07:34:50] !log dcausse@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/WikibaseCirrusSearch/src/Query/HasLicenseFeature.php: Backport: [[gerrit:792650|haslicense: Apply minimum_should_match for elastic 7.x (T288765)]] (duration: 00m 52s) [07:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:30] !log closing UTC morning backport window [07:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:04] (03PS1) 10Elukey: Add comments related to kubelet labels used in ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) [07:40:40] (03PS2) 10Elukey: Add comments related to kubelet labels used in ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) [07:41:48] !log imported jenkins 2.332.3 to thirdparty/ci for buster-wikimedia [07:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:21] (03CR) 10JMeybohm: "IIRC you don't need to have them commented out. kubelet will not complain on restarts if the labels are set at the API objects already" [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [07:50:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2002:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:50:42] (03PS1) 10Stang: zhwikisource: Adjust workmark size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792971 (https://phabricator.wikimedia.org/T308620) [07:52:09] jouncebot: now [07:52:09] For the next 0 hour(s) and 7 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T0700) [07:52:49] PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [07:54:06] !log Restarting CI Jenkins [07:54:07] PROBLEM - HTTP releases-jenkins.wikimedia.org on releases1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 553 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [07:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:29] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: jenkins.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2002:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:55:27] (03CR) 10Elukey: Add comments related to kubelet labels used in ml-serve clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [07:55:37] (03PS1) 10Jcrespo: dbbackups: Delay bacula run for es dbs from 24 to 48 hours [puppet] - 10https://gerrit.wikimedia.org/r/792972 (https://phabricator.wikimedia.org/T298120) [07:55:50] damn I broke Jenkins [07:55:51] (03PS1) 10Stang: zhwikiquote: Add logo variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792973 (https://phabricator.wikimedia.org/T308620) [07:55:59] PROBLEM - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [07:56:06] (03CR) 10Stang: "Similar to Id4d7da86bfbfe0f115469d09fbbbac27e8aed663" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792973 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [07:56:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:56:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298560)', diff saved to https://phabricator.wikimedia.org/P27898 and previous config saved to /var/cache/conftool/dbconfig/20220518-075620-ladsgroup.json [07:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:25] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [07:58:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:58:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27900 and previous config saved to /var/cache/conftool/dbconfig/20220518-075826-ladsgroup.json [07:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:30] (03PS3) 10Elukey: Add kubelet labels used in ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) [07:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:32] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [07:59:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4003.ulsfo.wmnet with OS bullseye [07:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:52] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4003.ulsfo.wmnet with OS bullseye [08:00:05] jnuche and hashar: Time to snap out of that daydream and deploy MediaWiki train - Utc-0 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T0800). [08:00:31] RECOVERY - jenkins_service_running on contint2001 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [08:01:12] (03PS1) 10Hashar: jenkins: update path to war file [puppet] - 10https://gerrit.wikimedia.org/r/792974 (https://phabricator.wikimedia.org/T307339) [08:01:14] hi, I will roll out to group1 in the next 10 mins or so [08:01:37] RECOVERY - jenkins_service_running on releases1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [08:02:09] RECOVERY - HTTP releases-jenkins.wikimedia.org on releases1002 is OK: HTTP OK: HTTP/1.1 200 OK - 2988 bytes in 0.974 second response time https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org%23Jenkins [08:02:15] (03PS1) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 [08:02:29] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2056.codfw.wmnet with OS bullseye [08:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:56] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2056.codfw.wmnet with OS bullseye [08:03:09] page [08:03:12] sorry [08:03:17] (03PS4) 10Elukey: Add kubelet labels used in ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) [08:03:20] forgot to strg+f [08:03:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance [08:03:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance [08:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298555)', diff saved to https://phabricator.wikimedia.org/P27902 and previous config saved to /var/cache/conftool/dbconfig/20220518-080339-ladsgroup.json [08:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:45] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [08:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27903 and previous config saved to /var/cache/conftool/dbconfig/20220518-080347-ladsgroup.json [08:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:52] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:04:15] (03CR) 10Elukey: "Tobias: I removed the ml-staging config since we are building the cluster and I am not sure if kube_env etc.. is already set up to add the" [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [08:06:37] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Doing): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10Aklapper) Updated https://wikitech.wikimedia.org/w/index.php?title=Deployments%2FBlocking_tasks&type=revision&diff=1981165&oldid=1896865 [08:09:47] (03PS1) 10Jaime Nuche: group1 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792976 [08:09:51] (03CR) 10Jaime Nuche: [C: 03+2] group1 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792976 (owner: 10Jaime Nuche) [08:11:13] !log upgrading ganeti packages in eqsin to Ganeti 3.0 T308211 [08:11:15] !log Jenkins CI is down, can't connect to the agents [08:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:19] T308211: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 [08:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:37] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:12:00] I am digging into it :( [08:12:07] [05/18/22 08:06:32] [SSH] Copying latest remoting.jar... [08:12:07] java.io.IOException: Could not copy remoting.jar into '/srv/jenkins/workspace' on agent [08:12:43] hashar: just saw the messages about Jenkins, I'll cancel the promote for now [08:12:45] !log jnuche@deploy1002 deploy-promote aborted: (duration: 03m 02s) [08:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:15] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:16:01] somehow Jenkins is unable to copy the `remoting.jar` file to the hosts :-( [08:16:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4003.ulsfo.wmnet with reason: host reimage [08:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:32] gotta delete them manually [08:16:39] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:07] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2056.codfw.wmnet with reason: host reimage [08:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:24] (03CR) 10Jbond: [C: 03+1] "LGTM, minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:18:04] hashar: no indication why it can't copy over? that's annoying :( [08:18:27] (03CR) 10Majavah: [V: 03+1] nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27904 and previous config saved to /var/cache/conftool/dbconfig/20220518-081852-ladsgroup.json [08:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:13] (03CR) 10David Caro: [C: 03+1] "Just questions, I'll leave for others to give the final +2" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:19:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4003.ulsfo.wmnet with reason: host reimage [08:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:26] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Delay bacula run for es dbs from 24 to 48 hours [puppet] - 10https://gerrit.wikimedia.org/r/792972 (https://phabricator.wikimedia.org/T298120) (owner: 10Jcrespo) [08:20:29] (03CR) 10jerkins-bot: [V: 04-1] Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [08:20:39] (03PS2) 10Ladsgroup: wmfmariadbpy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792607 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:20:45] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792976 (owner: 10Jaime Nuche) [08:20:49] (03CR) 10Ladsgroup: [C: 03+2] wmfmariadbpy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792607 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:20:53] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmfmariadbpy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792607 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:21:56] jnuche: some of CI is back up sorr yfor the mess [08:22:04] (03PS2) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 [08:22:07] (03CR) 10David Caro: "Got a question there, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:22:30] hashar: no worries at all, it happens. Thanks :) [08:22:42] (03PS2) 10Jcrespo: dbbackups: Delay bacula run for es dbs from 24 to 48 hours [puppet] - 10https://gerrit.wikimedia.org/r/792972 (https://phabricator.wikimedia.org/T298120) [08:22:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2056.codfw.wmnet with reason: host reimage [08:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:59] continuing with the promote now [08:23:29] (03CR) 10Majavah: [V: 03+1] nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:23:31] (03PS1) 10Jaime Nuche: group1 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792978 [08:23:33] (03CR) 10Jaime Nuche: [C: 03+2] group1 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792978 (owner: 10Jaime Nuche) [08:24:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35322/console" [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [08:24:22] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792978 (owner: 10Jaime Nuche) [08:24:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) [08:24:55] !log CI Jenkins hosts are all back and operational [08:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:25:15] (03CR) 10Jbond: [C: 03+2] "going to be bold and merge seems questions are addresses" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:32] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:25:43] !log sudo gnt-cluster upgrade --to 3.0 for ganeti/eqsin T308211 [08:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:48] T308211: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 [08:26:06] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.12 refs T305218 [08:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:11] T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 [08:26:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:26:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:57] !log drain ganeti5002 T308211 [08:26:59] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.12 refs T305218 (duration: 00m 53s) [08:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:59] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) 05Resolved→03Open Re-opening because if there is no technical blocker for having the AAAA records on those hosts a... [08:29:05] jnuche: hashar the error " InvalidArgumentException: Missing field: page_restrictions" is sorta me, the backward compatibility of cache values between mw versions is missing but it should recover on its own, do you need an action from me? [08:29:59] no idea, I am still fixing up the CI Jenkins :\ [08:30:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298555)', diff saved to https://phabricator.wikimedia.org/P27905 and previous config saved to /var/cache/conftool/dbconfig/20220518-083022-ladsgroup.json [08:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:28] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [08:30:42] amir1: thanks for the headsup, I guess let's see if they do go down [08:31:23] (03CR) 10Jcrespo: [C: 03+1] "This is ready to deploy https://puppet-compiler.wmflabs.org/pcc-worker1002/35323/backup1001.eqiad.wmnet/index.html , but waiting until tom" [puppet] - 10https://gerrit.wikimedia.org/r/792972 (https://phabricator.wikimedia.org/T298120) (owner: 10Jcrespo) [08:31:50] (03CR) 10Majavah: [V: 03+1] base::firewall: migrate to nrpe::plugin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:31:57] (03CR) 10Hashar: "That caused our custom systemd unit to fail miserably after the jenkins.deb package moved the .war to a different path." [puppet] - 10https://gerrit.wikimedia.org/r/792974 (https://phabricator.wikimedia.org/T307339) (owner: 10Hashar) [08:33:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4003.ulsfo.wmnet with OS bullseye [08:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:50] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4003.ulsfo.wmnet with OS bullseye completed: - ganeti4003 (**PASS**) - Downtimed on Icinga/Aler... [08:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27906 and previous config saved to /var/cache/conftool/dbconfig/20220518-083357-ladsgroup.json [08:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:51] hashar: we have a big spike of this: https://logstash.wikimedia.org/goto/da90353f798d875a53863ee53d58b897 [08:34:56] I think I'll roll back [08:35:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:04] jnuche: this is not me but let me take a look and see if I can fix it [08:36:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:36:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Remove legacy functions - https://phabricator.wikimedia.org/T308639 (10jbond) [08:36:27] (03CR) 10Vgutierrez: [C: 03+1] varnish: set retry-after based on throttle duration in requestctl [puppet] - 10https://gerrit.wikimedia.org/r/791373 (https://phabricator.wikimedia.org/T305824) (owner: 10Giuseppe Lavagetto) [08:36:27] jnuche: if you create a ticket in the meantime, it'd be amazing [08:37:05] amir1: will wait a bit for the rollback then and create the ticket, thanks! [08:38:13] Jenkins back up. I am catching up with the train stuff above [08:38:41] (03CR) 10David Caro: [C: 03+1] base::firewall: migrate to nrpe::plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:38:43] (03PS2) 10Stang: zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) [08:39:31] (03CR) 10David Caro: [C: 03+1] base::firewall: migrate to nrpe::plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:39:44] (03PS3) 10Stang: zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) [08:39:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:38] dcausse: we have hit ` Error: Call to undefined method CirrusSearch\Connection::getPageType()` :\ https://logstash.wikimedia.org/goto/da90353f798d875a53863ee53d58b897 [08:40:54] dcausse: it's caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/790734/7/includes/Connection.php [08:40:54] jnuche: +1 on rolling back [08:41:03] hm looking [08:41:20] !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia [08:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:40] It seems GeoData is relying on that [08:41:51] dcausse: /srv/mediawiki/php-1.39.0-wmf.12/extensions/GeoData/includes/Searcher.php(51) [08:41:56] (03CR) 10David Caro: [C: 03+1] base::firewall: migrate to nrpe::plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:42:00] hashar: creating the ticket, will roll back after that [08:42:17] (03PS1) 10Giuseppe Lavagetto: scap: do not restart jobrunners on deployment [puppet] - 10https://gerrit.wikimedia.org/r/792980 (https://phabricator.wikimedia.org/T266055) [08:42:19] (03PS1) 10Giuseppe Lavagetto: scap: enable restarting php-fpm on deployment [puppet] - 10https://gerrit.wikimedia.org/r/792981 (https://phabricator.wikimedia.org/T266055) [08:42:21] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: check opcache revalidation in restart script [puppet] - 10https://gerrit.wikimedia.org/r/792982 (https://phabricator.wikimedia.org/T266055) [08:42:23] (03PS1) 10Giuseppe Lavagetto: mediawiki_canaries: disable opcache revalidation [puppet] - 10https://gerrit.wikimedia.org/r/792983 (https://phabricator.wikimedia.org/T266055) [08:42:25] (03PS1) 10Giuseppe Lavagetto: mediawiki: disable revalidation everywhere [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) [08:42:40] I am wondering how phan did not catch it :D [08:42:53] I assume CI dependency is missing [08:43:00] oh [08:43:02] should have been handled by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GeoData/+/790732 [08:43:06] hm... [08:43:21] dcausse: merged after branch cut I think :D [08:43:22] so yeah the method got removed from CirrusSearch but the phan job does not test other extensions which might be using it [08:43:24] like GeoData [08:43:33] (03PS1) 10Ladsgroup: Remove reference to Elastica\Type [extensions/GeoData] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792652 (https://phabricator.wikimedia.org/T308044) [08:43:34] we would need some kind of full analysis job [08:43:37] dependencies are hard [08:43:39] there was a depends-on on the cirrus patch [08:43:40] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GeoData/+/792652 [08:43:52] This should fix the issue [08:43:57] sure [08:44:02] Shall I backport hashar jnuche ? [08:44:07] (03CR) 10DCausse: [C: 03+1] Remove reference to Elastica\Type [extensions/GeoData] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792652 (https://phabricator.wikimedia.org/T308044) (owner: 10Ladsgroup) [08:44:12] (03CR) 10Ladsgroup: [C: 03+2] Remove reference to Elastica\Type [extensions/GeoData] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792652 (https://phabricator.wikimedia.org/T308044) (owner: 10Ladsgroup) [08:44:18] and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GeoData/+/790732 did not make it in the wmf branch :-\ [08:44:34] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:44:37] the backport is being deployed now [08:44:38] amir1: I think it's worth a try, hashar? [08:44:45] +1 [08:45:12] there is definitely a breakage in the patches lifecycle :D [08:45:27] (03CR) 10Muehlenhoff: "LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/792974 (https://phabricator.wikimedia.org/T307339) (owner: 10Hashar) [08:45:29] (03CR) 10Muehlenhoff: [C: 03+2] jenkins: update path to war file [puppet] - 10https://gerrit.wikimedia.org/r/792974 (https://phabricator.wikimedia.org/T307339) (owner: 10Hashar) [08:45:29] as to how we could potentially have caught it with CI , I am not sure really [08:45:42] trying to understand how it could have happen so that I don't make the mistake again [08:46:13] I am guessing patches made to CirrusSearch should be tested with all extensions that depends on it [08:46:32] dcausse: one random note is to make sure patches don't fall between branch cut (Tuesday early morning UTC) [08:46:59] when a chain I'm planning to merge, I usually merge them all before or after that time [08:47:05] which apparently is the case (CirrusSearch depends on Geodata and Geodata depends on CirrusSearch). So I am guessing that specific code does not have a Phpunit test [08:47:10] or that worked on master [08:47:14] ticket created, we can close it right away if the backport works: https://phabricator.wikimedia.org/T308640 [08:47:16] but we have cut the wmf branch in between the two patches [08:47:48] (03PS3) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 [08:47:54] oh I forgot Depends-On on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/790734 [08:48:10] so yes I might include GeoData in cirrus phan analysis at least [08:48:18] (03PS1) 10Stang: zhwikiversity: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792985 (https://phabricator.wikimedia.org/T308620) [08:48:19] sorry about that! [08:49:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27907 and previous config saved to /var/cache/conftool/dbconfig/20220518-084902-ladsgroup.json [08:49:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:49:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:49:06] it happens :D [08:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:09] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:49:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27908 and previous config saved to /var/cache/conftool/dbconfig/20220518-084910-ladsgroup.json [08:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:23] thanks for fixing my mess :) [08:49:42] (03PS1) 10Ayounsi: Network report: MTU check [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792986 [08:50:40] (03CR) 10jerkins-bot: [V: 04-1] Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [08:50:47] (03PS1) 10Jcrespo: dbbackups: Copy database check password in anticipation of role move [labs/private] - 10https://gerrit.wikimedia.org/r/792987 (https://phabricator.wikimedia.org/T283017) [08:51:00] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 826 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:51:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4003.ulsfo.wmnet [08:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:28] nine minutes left [08:51:33] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [08:51:34] I am seeing stuff going on on s7 [08:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:49] marostegui: I'm not doing anything on s7 [08:51:52] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbbackups: Copy database check password in anticipation of role move [labs/private] - 10https://gerrit.wikimedia.org/r/792987 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [08:51:56] did you get paged for mw errors? [08:52:00] nop [08:52:06] but I am seeing lots of errors on logstash [08:52:15] (03CR) 10Jbond: [C: 03+1] "LGTM but i need to leave soon so will need to get someone else to merge (or i can do this afternoon)" [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:52:15] might have been a spike [08:52:17] I am still checking [08:52:42] https://logstash.wikimedia.org/goto/ede19894b5537e826318ec2c90646145 is clean [08:53:08] marostegui: you might be seeing T308640 which I'm deploying the fix right now [08:53:09] T308640: Error: Call to undefined method CirrusSearch\Connection::getPageType() - https://phabricator.wikimedia.org/T308640 [08:53:52] no, saw this https://logstash.wikimedia.org/goto/d11d630abb237950ad6a832d162638bb [08:53:54] which is now gone [08:54:34] aah, it can be because of templatelinks backfill [08:54:46] it's mostly done on s7 [08:55:47] marostegui: oh and the new mw version also completely changes how making connection to dbs work https://gerrit.wikimedia.org/r/c/mediawiki/core/+/767769 [08:55:49] (03CR) 10David Caro: [C: 03+2] base::firewall: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [08:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27909 and previous config saved to /var/cache/conftool/dbconfig/20220518-085551-ladsgroup.json [08:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:57] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:56:57] (03CR) 10Jcrespo: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [08:57:13] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Remove legacy functions - https://phabricator.wikimedia.org/T308639 (10jbond) [08:57:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4003.ulsfo.wmnet [08:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:04] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) [08:59:15] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [09:01:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:01:40] (03PS1) 10Btullis: Update DataHub version to 0.8.34 [deployment-charts] - 10https://gerrit.wikimedia.org/r/792988 (https://phabricator.wikimedia.org/T308052) [09:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4003.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [09:01:53] (03PS7) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [09:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4003.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [09:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:38] (03PS1) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [09:04:27] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [09:04:28] PROBLEM - jenkins_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [09:04:30] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:49] (03Merged) 10jenkins-bot: Remove reference to Elastica\Type [extensions/GeoData] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792652 (https://phabricator.wikimedia.org/T308044) (owner: 10Ladsgroup) [09:05:05] (03PS1) 10Majavah: confd: Use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792990 (https://phabricator.wikimedia.org/T308601) [09:05:18] !log rolling upgrade to HAProxy 2..4.17 in ulsfo [09:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:21] jnuche: hashar the backport is being deployed now [09:06:25] (03PS2) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [09:06:45] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/GeoData/includes/Searcher.php: Backport: [[gerrit:792652|Remove reference to Elastica\Type (T308044)]] (duration: 00m 52s) [09:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:50] T308044: Remove reference to Elastica\Type from CirrusSearch and related extensions and upgrade to Elastica 7.1.5 - https://phabricator.wikimedia.org/T308044 [09:06:58] Amir1: <3 [09:07:15] ^^ [09:07:24] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35325/console" [puppet] - 10https://gerrit.wikimedia.org/r/792990 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [09:07:54] zeroed now https://logstash.wikimedia.org/goto/3411e2f0aabe02d95af4ab1c07bf54cf [09:07:59] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM for backupmon1001 - https://phabricator.wikimedia.org/T308643 (10jcrespo) [09:08:00] !log Restarting CI Jenkins once more [09:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:21] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM for backupmon1001 - https://phabricator.wikimedia.org/T308643 (10jcrespo) [09:08:36] (03PS3) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [09:08:58] RECOVERY - jenkins_service_running on contint2001 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [09:09:27] amir1 hashar: getPageType errors have stopped :) [09:09:31] dcausse: I have no idea how the issue managed to pass CI though :-\ [09:09:53] amir1: thank you so much [09:10:00] nada [09:10:03] (03PS1) 10David Caro: varnish: move to nagios::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792991 (https://phabricator.wikimedia.org/T308601) [09:10:06] :D [09:10:10] I'm guessing phan is not analyzing GeoData on cirrus patches, filing a task about that [09:10:14] (03CR) 10Btullis: [C: 03+2] Update DataHub version to 0.8.34 [deployment-charts] - 10https://gerrit.wikimedia.org/r/792988 (https://phabricator.wikimedia.org/T308052) (owner: 10Btullis) [09:10:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:23] there was a patch made to CirrusSearch earlier today https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/792649 which does include Geodata as a dependency and passed just fine [09:10:34] so I am guessing GeoData does not have any test covering that area of the code [09:10:34] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM for backupmon1001 - https://phabricator.wikimedia.org/T308643 (10MoritzMuehlenhoff) LGTM, let's not use less than 20G for the disk, smaller disks than that tend to amass disk space alerts when new kernels gets installed etc. [09:10:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27910 and previous config saved to /var/cache/conftool/dbconfig/20220518-091056-ladsgroup.json [09:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:09] (03PS1) 10Majavah: burrow: move to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792992 (https://phabricator.wikimedia.org/T308601) [09:11:11] and indeed Phan running against CirrusSearch only analyzes CirrusSearch and would not catch an issue to Geodata [09:11:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:11:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:20] so I guess Geodata lacks some test [09:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:31] hashar: GeoData is certainly lacking unit tests but I would hope that phan could catch this kind of mistake? [09:11:33] and maybe one day we will need a phan job which tests all extensions together [09:12:03] dcausse: I think you haven't set GeoData as dependency of CirrusSearch [09:12:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:13] Amir1: yes that's my guess too [09:12:19] I think a patch send to Geodata wmf branch would have caused phan to fail [09:12:19] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35328/console" [puppet] - 10https://gerrit.wikimedia.org/r/792992 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [09:12:28] (03PS2) 10Jelto: aptrepo: add gitlab package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/792108 (https://phabricator.wikimedia.org/T307142) [09:12:38] in catching stuff like this the dependency should be the the opposite so you get to run phan tests of geodata when making cirrussearch patches [09:12:40] (03PS4) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [09:13:26] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:14:04] (03PS4) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 [09:14:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2056.codfw.wmnet with OS bullseye [09:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:23] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2056.codfw.wmnet with OS bullseye completed: - ms-be2056 (**PASS**) - Downtim... [09:14:59] (03Merged) 10jenkins-bot: Update DataHub version to 0.8.34 [deployment-charts] - 10https://gerrit.wikimedia.org/r/792988 (https://phabricator.wikimedia.org/T308052) (owner: 10Btullis) [09:15:35] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10MatthewVernon) [09:15:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:15:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27911 and previous config saved to /var/cache/conftool/dbconfig/20220518-091544-ladsgroup.json [09:15:47] (03CR) 10Jelto: aptrepo: add gitlab package for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792108 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:50] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [09:16:24] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792990 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [09:17:14] filed T308645 [09:17:14] T308645: CirrusSearch should include GeoData in its phan analysis - https://phabricator.wikimedia.org/T308645 [09:17:27] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:36] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:18:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35330/console" [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [09:18:43] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:49] (03PS1) 10Majavah: eventlogging: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792996 (https://phabricator.wikimedia.org/T308601) [09:20:50] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:23:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1005.eqiad.wmnet [09:23:50] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:07] looking ^ [09:24:15] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35332/console" [puppet] - 10https://gerrit.wikimedia.org/r/792991 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [09:25:32] (03PS8) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [09:26:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27912 and previous config saved to /var/cache/conftool/dbconfig/20220518-092601-ladsgroup.json [09:26:02] PROBLEM - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [09:26:04] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Observability-Alerting, and 3 others: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 (10dcaro) [09:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:12] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Observability-Alerting, and 3 others: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 (10dcaro) 05Open→03In progress [09:26:20] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Observability-Alerting, and 3 others: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 (10dcaro) a:03dcaro [09:26:31] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Observability-Alerting, and 5 others: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 (10dcaro) [09:26:36] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Observability-Alerting, and 5 others: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 (10dcaro) [09:27:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1005.eqiad.wmnet [09:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:48] !log depooling elastic2054 seeing hardware errors (Hardware error from APEI Generic Hardware Error Source: 65534) [09:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:03] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [09:28:41] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10akosiaris) deploy1002 will need to be scheduled well in advance and/or failed over to deploy2002 as it is the canonical deployment host. the mw* hosts can be done at any time I guess. What... [09:28:41] moritzm: kubestagetcd1004 above down is because of ganeti1005 reboot? [09:28:48] (03PS1) 10David Caro: redis: move to nagios::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) [09:29:24] (03CR) 10jerkins-bot: [V: 04-1] redis: move to nagios::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [09:30:28] RECOVERY - Host kubestagetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [09:31:37] (03PS2) 10David Caro: redis: move to nagios::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) [09:31:59] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [09:32:03] is DC-Ops the right phab tag for H/W related problems? [09:32:21] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM for backupmon1001 - https://phabricator.wikimedia.org/T308643 (10jcrespo) my intention is to run: ` sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1 --disk 10 eqiad_D --network private backupmon ` [09:33:14] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) > What's the timeline for this? Current step is to gather limitations that can impact scheduling. Then it will depends on DCops. I'd like `D5` to happen in the next 3 months. Less ur... [09:34:29] disregard my question, found https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ [09:34:52] (03PS1) 10JMeybohm: Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 [09:34:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:35:32] (03PS2) 10JMeybohm: Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) [09:36:09] (03PS9) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [09:36:33] 10ops-codfw, 10CirrusSearch, 10DC-Ops, 10Discovery-Search: elasticsearch2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 (10dcausse) [09:36:58] 10ops-codfw, 10CirrusSearch, 10DC-Ops, 10Discovery-Search: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 (10dcausse) [09:36:58] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [09:39:35] (03PS1) 10Volans: drmrs: add missing Netbox include for PTRs [dns] - 10https://gerrit.wikimedia.org/r/793001 (https://phabricator.wikimedia.org/T155761) [09:39:37] (03PS1) 10Volans: pfw3-codfw: remove manual record managed by Netbox [dns] - 10https://gerrit.wikimedia.org/r/793002 (https://phabricator.wikimedia.org/T155761) [09:39:39] (03PS1) 10Volans: cloud codfw1dev: remove records managed by Netbox [dns] - 10https://gerrit.wikimedia.org/r/793003 (https://phabricator.wikimedia.org/T155761) [09:39:41] (03PS1) 10Volans: iwiki-mail-codfw: remove records managed by Netbox [dns] - 10https://gerrit.wikimedia.org/r/793004 (https://phabricator.wikimedia.org/T155761) [09:39:43] (03PS3) 10David Caro: redis: move to nagios::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) [09:41:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27913 and previous config saved to /var/cache/conftool/dbconfig/20220518-094106-ladsgroup.json [09:41:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:41:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:13] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:30] (03PS2) 10Volans: wiki-mail-codfw: remove records managed by Netbox [dns] - 10https://gerrit.wikimedia.org/r/793004 (https://phabricator.wikimedia.org/T155761) [09:41:53] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [09:42:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM for backupmon1001 - https://phabricator.wikimedia.org/T308643 (10jcrespo) @MoritzMuehlenhoff Thank you a lot for the quick review. My intention, on the second iteration (first I will just split the icinga-npre checks) is to (temporarilly) dupli... [09:42:35] (03CR) 10Volans: "Moritz, I'm adding you because of git blame, feel free to add any other relevant people." [dns] - 10https://gerrit.wikimedia.org/r/793004 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [09:43:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM for backupmon1001 - https://phabricator.wikimedia.org/T308643 (10jcrespo) [09:45:18] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:45:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:45:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:45:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [09:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [09:45:52] !log rolling upgrade to HAProxy 2.4.17 in eqsin - T307444 [09:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:00] T307444: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 [09:46:43] !log T308647: banning elastic2054 from production-search-psi-codfw and elastic2054-production-search-codfw [09:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:47] T308647: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 [09:47:23] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/793005 [09:48:14] (03PS10) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [09:49:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:49:30] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:50:18] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:54:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27914 and previous config saved to /var/cache/conftool/dbconfig/20220518-095442-ladsgroup.json [09:54:47] (03CR) 10David Caro: [C: 03+1] "> Before merging this change the DNS name for 208.80.153.50 in Netbox" [dns] - 10https://gerrit.wikimedia.org/r/793003 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [09:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:47] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:54:50] !log root@cumin1001 START - Cookbook sre.ganeti.makevm for new host backupmon1001.eqiad.wmnet [09:54:51] !log root@cumin1001 START - Cookbook sre.dns.netbox [09:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:58] (03PS5) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 [09:56:56] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:58:25] (03PS4) 10David Caro: redis: move to nagios::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) [09:59:54] (03PS7) 10Filippo Giunchedi: netops: move network routers/devices definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [09:59:56] (03PS9) 10Filippo Giunchedi: netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [09:59:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35334/console" [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [10:00:36] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35335/console" [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [10:01:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27915 and previous config saved to /var/cache/conftool/dbconfig/20220518-100105-ladsgroup.json [10:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:12] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [10:01:31] (03PS6) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 [10:01:43] (03PS1) 10Marostegui: dbproxy2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/793009 (https://phabricator.wikimedia.org/T307673) [10:03:45] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/793009 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [10:04:21] !log root@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:37] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Remove legacy functions - https://phabricator.wikimedia.org/T308639 (10Marostegui) [10:04:41] (03CR) 10Marostegui: [C: 03+2] dbproxy2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/793009 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [10:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27917 and previous config saved to /var/cache/conftool/dbconfig/20220518-100531-ladsgroup.json [10:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:36] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [10:06:16] !log Reboot dbproxy2* for kernel upgrade T307673 [10:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35336/console" [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [10:07:31] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35337/console" [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [10:08:04] (03PS2) 10Majavah: eventlogging: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792996 (https://phabricator.wikimedia.org/T308601) [10:08:50] (03CR) 10Volans: [C: 03+1] "makes sense to me." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792986 (owner: 10Ayounsi) [10:09:22] (03PS1) 10Majavah: gdnsd: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793010 (https://phabricator.wikimedia.org/T308601) [10:09:49] (03CR) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [10:09:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:10:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35338/console" [puppet] - 10https://gerrit.wikimedia.org/r/792996 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [10:13:09] (03PS5) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [10:13:43] (03PS11) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [10:14:02] (03PS1) 10Marostegui: Revert "dbproxy2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/792653 [10:14:04] !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host backupmon1001.eqiad.wmnet [10:14:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35340/console" [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [10:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:15] (03CR) 10jerkins-bot: [V: 04-1] admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [10:15:10] (03PS1) 10Majavah: Add dummy authdns keys to fix PCC [labs/private] - 10https://gerrit.wikimedia.org/r/793013 [10:15:17] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/792653 (owner: 10Marostegui) [10:15:44] (03PS2) 10Majavah: Add dummy authdns keys to fix PCC [labs/private] - 10https://gerrit.wikimedia.org/r/793013 [10:15:59] (03CR) 10Filippo Giunchedi: sre: port mx queue high page (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27918 and previous config saved to /var/cache/conftool/dbconfig/20220518-101610-ladsgroup.json [10:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:18] (03PS1) 10Marostegui: dbproxy(1012,1015,1016,1021): Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/793014 (https://phabricator.wikimedia.org/T307673) [10:18:21] (03CR) 10Marostegui: [C: 03+2] dbproxy(1012,1015,1016,1021): Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/793014 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [10:18:47] (03CR) 10Ayounsi: [C: 03+1] drmrs: add missing Netbox include for PTRs [dns] - 10https://gerrit.wikimedia.org/r/793001 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [10:19:44] (03PS1) 10Majavah: haproxy: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793015 (https://phabricator.wikimedia.org/T308601) [10:19:50] (03CR) 10Ayounsi: [C: 03+1] pfw3-codfw: remove manual record managed by Netbox [dns] - 10https://gerrit.wikimedia.org/r/793002 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [10:20:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27919 and previous config saved to /var/cache/conftool/dbconfig/20220518-102036-ladsgroup.json [10:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:34] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35342/console" [puppet] - 10https://gerrit.wikimedia.org/r/793015 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [10:22:36] (03CR) 10Majavah: haproxy: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793015 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [10:23:12] (03PS12) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [10:25:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35343/console" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [10:29:20] (03PS8) 10Filippo Giunchedi: netops: move network routers/devices definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [10:29:22] (03PS10) 10Filippo Giunchedi: netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [10:29:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2059.codfw.wmnet with OS bullseye [10:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:45] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2059.codfw.wmnet with OS bullseye [10:30:43] (03CR) 10jerkins-bot: [V: 04-1] netops: move network routers/devices definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [10:31:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27920 and previous config saved to /var/cache/conftool/dbconfig/20220518-103115-ladsgroup.json [10:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35345/console" [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [10:32:25] (03PS9) 10Filippo Giunchedi: netops: move network routers/devices definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [10:32:27] (03PS11) 10Filippo Giunchedi: netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [10:33:13] (03CR) 10Majavah: [C: 03+1] varnish: move to nagios::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792991 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [10:33:52] (03CR) 10Majavah: [C: 03+1] varnish: move to nagios::plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792991 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [10:34:37] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [10:35:32] (03CR) 10jerkins-bot: [V: 04-1] netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [10:35:34] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35346/console" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [10:35:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27921 and previous config saved to /var/cache/conftool/dbconfig/20220518-103541-ladsgroup.json [10:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/792108 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [10:41:04] (03CR) 10Muehlenhoff: [C: 03+1] "All great minds think alike :-) See https://gerrit.wikimedia.org/r/c/operations/dns/+/723432 where I added you as reviewer back then." [dns] - 10https://gerrit.wikimedia.org/r/793004 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [10:41:37] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35347/console" [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [10:42:28] 10SRE, 10ops-codfw, 10CirrusSearch, 10DC-Ops, 10Discovery-Search: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 (10Marostegui) p:05Triage→03Medium a:03Papaul The idrac doesn't show anything: ` /admin1/system1/logs1/log1-> show record1 associations targets verbs... [10:42:38] (03CR) 10Volans: [C: 03+2] drmrs: add missing Netbox include for PTRs [dns] - 10https://gerrit.wikimedia.org/r/793001 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [10:43:01] (03CR) 10Volans: [C: 03+2] pfw3-codfw: remove manual record managed by Netbox [dns] - 10https://gerrit.wikimedia.org/r/793002 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [10:43:18] (03CR) 10Volans: [C: 03+2] cloud codfw1dev: remove records managed by Netbox [dns] - 10https://gerrit.wikimedia.org/r/793003 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [10:43:37] (03CR) 10Volans: [C: 03+2] wiki-mail-codfw: remove records managed by Netbox (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/793004 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [10:45:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti5002.eqsin.wmnet with reason: Remove from cluster for firmware update and eventual reimage [10:45:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti5002.eqsin.wmnet with reason: Remove from cluster for firmware update and eventual reimage [10:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2059.codfw.wmnet with reason: host reimage [10:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T303603)', diff saved to https://phabricator.wikimedia.org/P27922 and previous config saved to /var/cache/conftool/dbconfig/20220518-104620-ladsgroup.json [10:46:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:46:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:26] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [10:46:27] (03PS1) 10Marostegui: Revert "dbproxy(1012,1015,1016,1021): Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/792654 [10:46:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P27923 and previous config saved to /var/cache/conftool/dbconfig/20220518-104628-ladsgroup.json [10:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:46] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) ganeti5002 is removed from the cluster and needs the same firmware/NIC updates as ganeti4* to enable the reimage to Bullseye. [10:47:24] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy(1012,1015,1016,1021): Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/792654 (owner: 10Marostegui) [10:47:31] (03CR) 10Majavah: [C: 04-1] redis: move to nagios::plugin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/792998 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [10:48:22] (03CR) 10Vgutierrez: [C: 04-1] varnish: annotate X-Analytics header with matching requestctl actions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791372 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [10:48:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2059.codfw.wmnet with reason: host reimage [10:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27924 and previous config saved to /var/cache/conftool/dbconfig/20220518-105046-ladsgroup.json [10:50:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:50:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:51] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [10:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:47] (03PS1) 10Volans: EVPN: add missing reverse zonefile includes [dns] - 10https://gerrit.wikimedia.org/r/793020 [10:54:03] (03CR) 10Jelto: [C: 03+2] aptrepo: add gitlab package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/792108 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [10:54:52] (03CR) 10Ayounsi: "Thanks, closing some comments, opening others but it's pretty much ready to be merged according to me :)" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [10:56:22] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [10:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:01] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [10:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:13] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [10:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:55] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [10:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:37] !log rolling upgrade to HAProxy 2.4.17 in drmrs - T307444 [11:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:42] T307444: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 [11:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:04:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2059.codfw.wmnet with OS bullseye [11:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:47] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2059.codfw.wmnet with OS bullseye completed: - ms-be2059 (**PASS**) - Downtim... [11:05:45] (03Abandoned) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [11:08:21] (03PS1) 10Cathal Mooney: Refactor routing-instances template for mgmt-vrf [homer/public] - 10https://gerrit.wikimedia.org/r/793021 (https://phabricator.wikimedia.org/T304989) [11:08:36] (03CR) 10Hnowlan: WIP: enable cassandra encryption (aqs cluster) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [11:08:48] (03PS1) 10Jcrespo: dbbackups: Update definition to use Floats for type [puppet] - 10https://gerrit.wikimedia.org/r/793022 (https://phabricator.wikimedia.org/T283017) [11:11:14] (03PS2) 10Jcrespo: dbbackups: Update definition to use Floats for type [puppet] - 10https://gerrit.wikimedia.org/r/793022 (https://phabricator.wikimedia.org/T283017) [11:14:17] (03PS1) 10Jcrespo: dbbackups: Temporarely disable notifications on backupmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/793023 (https://phabricator.wikimedia.org/T283017) [11:14:47] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update definition to use Floats for type [puppet] - 10https://gerrit.wikimedia.org/r/793022 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [11:15:40] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:17:03] (03PS2) 10Jcrespo: dbbackups: Temporarely disable notifications on backupmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/793023 (https://phabricator.wikimedia.org/T283017) [11:17:24] (03PS3) 10Jcrespo: dbbackups: Temporarely disable notifications on backupmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/793023 (https://phabricator.wikimedia.org/T283017) [11:18:59] (03PS4) 10Jcrespo: dbbackups: Temporarely disable notifications on backupmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/793023 (https://phabricator.wikimedia.org/T283017) [11:20:12] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Temporarely disable notifications on backupmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/793023 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [11:30:35] l [11:30:57] er typo [11:34:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2123.codfw.wmnet with reason: Maintenance [11:34:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2123.codfw.wmnet with reason: Maintenance [11:34:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [11:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:01] (03PS1) 10Jcrespo: dbbackups: Reinstall backupmon1001 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793026 (https://phabricator.wikimedia.org/T283017) [11:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [11:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:40] (03PS2) 10Jcrespo: dbbackups: Reinstall backupmon1001 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793026 (https://phabricator.wikimedia.org/T283017) [11:36:22] (03CR) 10Joal: [C: 03+1] "Ok for me (with the correction already spotted)" [puppet] - 10https://gerrit.wikimedia.org/r/791372 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [11:36:51] (03PS1) 10Stang: zhwikivoyage: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793027 (https://phabricator.wikimedia.org/T308620) [11:38:29] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:38:59] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reinstall backupmon1001 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793026 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [11:42:08] (03PS1) 10Mainframe98: Remove ElementTiming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793029 (https://phabricator.wikimedia.org/T308621) [11:45:50] (03PS6) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [11:45:52] (03PS2) 10Stang: zhwikivoyage: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793027 (https://phabricator.wikimedia.org/T308620) [11:46:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P27925 and previous config saved to /var/cache/conftool/dbconfig/20220518-114643-ladsgroup.json [11:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:49] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:47:10] (03CR) 10jerkins-bot: [V: 04-1] admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [11:57:08] (03PS1) 10Jbond: admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) [11:57:23] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:58:31] (03CR) 10jerkins-bot: [V: 04-1] admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [12:01:46] (03PS2) 10Jbond: admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) [12:01:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27927 and previous config saved to /var/cache/conftool/dbconfig/20220518-120148-ladsgroup.json [12:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:02:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:02:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298555)', diff saved to https://phabricator.wikimedia.org/P27928 and previous config saved to /var/cache/conftool/dbconfig/20220518-120209-ladsgroup.json [12:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:17] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [12:02:35] (03PS1) 10Stang: zhwiktionary: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793033 (https://phabricator.wikimedia.org/T308620) [12:02:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35349/console" [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [12:03:10] (03CR) 10jerkins-bot: [V: 04-1] admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [12:04:37] (03CR) 10Jbond: [C: 03+1] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/792996 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:04:40] (03CR) 10Jbond: [C: 03+2] eventlogging: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792996 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:05:32] (03PS1) 10Majavah: base: remove nrpe old plugin files [puppet] - 10https://gerrit.wikimedia.org/r/793034 (https://phabricator.wikimedia.org/T308601) [12:06:32] (03PS2) 10Majavah: base: remove nrpe old plugin files [puppet] - 10https://gerrit.wikimedia.org/r/793034 (https://phabricator.wikimedia.org/T308601) [12:06:34] (03CR) 10Jbond: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793010 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:07:07] (03CR) 10Jbond: [V: 03+2 C: 03+2] Add dummy authdns keys to fix PCC [labs/private] - 10https://gerrit.wikimedia.org/r/793013 (owner: 10Majavah) [12:08:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35352/console" [puppet] - 10https://gerrit.wikimedia.org/r/793034 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:08:53] (03CR) 10Jbond: [C: 03+2] haproxy: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793015 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:10:21] taavi: thanks think oi have mered all the outstanding nrope changes [12:12:25] (03CR) 10Jbond: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793034 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:13:19] jbond: thanks [12:13:45] (03PS7) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [12:13:51] np thank you :) [12:14:12] (03PS3) 10Jbond: admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) [12:14:25] (03PS4) 10Jbond: admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) [12:14:33] (03CR) 10jerkins-bot: [V: 04-1] admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [12:14:57] (03PS1) 10Slyngshede: Private APT repository, for internal use only. [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) [12:15:46] (03CR) 10jerkins-bot: [V: 04-1] Private APT repository, for internal use only. [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [12:16:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27929 and previous config saved to /var/cache/conftool/dbconfig/20220518-121653-ladsgroup.json [12:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:23] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Observability-Alerting, and 5 others: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 (10jbond) This issue is also affecting production reimages see P27926 [12:21:05] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:22:20] (03PS1) 10Majavah: impi: update to use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793037 (https://phabricator.wikimedia.org/T308601) [12:22:22] (03PS10) 10Filippo Giunchedi: netops: move network routers/devices definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [12:22:24] (03PS12) 10Filippo Giunchedi: netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [12:22:41] (03PS1) 10Jcrespo: dbbackups: Reenable checks now that they are working as intended [puppet] - 10https://gerrit.wikimedia.org/r/793038 (https://phabricator.wikimedia.org/T283017) [12:23:29] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35353/console" [puppet] - 10https://gerrit.wikimedia.org/r/793037 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:23:34] (03CR) 10Jbond: "done thanks" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792644 (owner: 10Jbond) [12:23:38] (03PS2) 10Jbond: hiera_export: add unmanaged (mostly) network devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792644 [12:24:00] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable checks now that they are working as intended [puppet] - 10https://gerrit.wikimedia.org/r/793038 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:26:22] (03CR) 10Volans: [C: 03+1] "I didn't tested it but LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792644 (owner: 10Jbond) [12:26:47] (03PS1) 10Gergő Tisza: Campaign templates: show legal footer on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792655 (https://phabricator.wikimedia.org/T307521) [12:28:03] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35354/console" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:29:41] (03PS3) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) [12:31:27] (03CR) 10Jbond: "PCC SUCCESS (NOOP 202 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35351/console" [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [12:31:53] (03PS1) 10Slyngshede: Remove old cron calls. [puppet] - 10https://gerrit.wikimedia.org/r/793040 [12:31:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P27930 and previous config saved to /var/cache/conftool/dbconfig/20220518-123158-ladsgroup.json [12:32:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:32:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:32:04] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:32:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T303603)', diff saved to https://phabricator.wikimedia.org/P27931 and previous config saved to /var/cache/conftool/dbconfig/20220518-123211-ladsgroup.json [12:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:25] (03CR) 10Jcrespo: "@Filippo:" [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:33:47] (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for the review/assistance!" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:34:29] (03CR) 10jerkins-bot: [V: 04-1] Remove old cron calls. [puppet] - 10https://gerrit.wikimedia.org/r/793040 (owner: 10Slyngshede) [12:34:31] (03PS2) 10Stang: zhwiktionary: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793033 (https://phabricator.wikimedia.org/T308620) [12:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T303603)', diff saved to https://phabricator.wikimedia.org/P27932 and previous config saved to /var/cache/conftool/dbconfig/20220518-123447-ladsgroup.json [12:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:01] (03PS4) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) [12:35:08] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35355/console" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:35:18] (03PS2) 10Slyngshede: Remove old cron calls. [puppet] - 10https://gerrit.wikimedia.org/r/793040 (https://phabricator.wikimedia.org/T790325) [12:38:09] (03PS1) 10Majavah: pybal: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793041 (https://phabricator.wikimedia.org/T308601) [12:38:36] (03PS1) 10Jcrespo: ddbackups: Remove old references to the check pass on the alert hosts [labs/private] - 10https://gerrit.wikimedia.org/r/793042 (https://phabricator.wikimedia.org/T283017) [12:39:15] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35356/console" [puppet] - 10https://gerrit.wikimedia.org/r/793041 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:39:37] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:40:20] (03CR) 10Jcrespo: "Adding Amir to ok the grant changes (I can deploy those)." [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:40:22] (03PS1) 10Majavah: profile: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793044 [12:40:44] (03PS2) 10Majavah: profile: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793044 (https://phabricator.wikimedia.org/T308601) [12:44:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/793040 (https://phabricator.wikimedia.org/T790325) (owner: 10Slyngshede) [12:45:49] (03PS3) 10Majavah: profile: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793044 (https://phabricator.wikimedia.org/T308601) [12:46:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2060.codfw.wmnet with OS bullseye [12:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:22] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2060.codfw.wmnet with OS bullseye [12:47:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298555)', diff saved to https://phabricator.wikimedia.org/P27933 and previous config saved to /var/cache/conftool/dbconfig/20220518-124708-ladsgroup.json [12:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:13] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [12:47:48] (03PS8) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [12:47:50] (03CR) 10Ladsgroup: alerting_host: Remove references to dbbackups monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:48:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35359/console" [puppet] - 10https://gerrit.wikimedia.org/r/793044 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:48:28] (03CR) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:48:32] (03PS5) 10Jbond: admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) [12:48:35] (03CR) 10jerkins-bot: [V: 04-1] admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [12:48:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35360/console" [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [12:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27934 and previous config saved to /var/cache/conftool/dbconfig/20220518-124952-ladsgroup.json [12:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:11] (03CR) 10jerkins-bot: [V: 04-1] Campaign templates: show legal footer on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792655 (https://phabricator.wikimedia.org/T307521) (owner: 10Gergő Tisza) [12:50:17] (03PS3) 10Stang: zhwiktionary: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793033 (https://phabricator.wikimedia.org/T308620) [12:51:36] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:52:38] I'll take the deploy window given it's been a while and I've filled the window with patches of my own. :-) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T1300). [13:00:05] James_F and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:14] tgr_: OK for me to merge your patch? [13:00:14] James_F: go ahead :) [13:00:20] Lucas_WMDE: Ta! [13:00:26] (03CR) 10Jforrester: [C: 03+2] [shnwiki] Enable the SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792737 (https://phabricator.wikimedia.org/T308623) (owner: 10Jforrester) [13:01:11] (03Merged) 10jenkins-bot: [shnwiki] Enable the SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792737 (https://phabricator.wikimedia.org/T308623) (owner: 10Jforrester) [13:01:25] (03PS9) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [13:02:11] (03CR) 10jerkins-bot: [V: 04-1] admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:02:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27935 and previous config saved to /var/cache/conftool/dbconfig/20220518-130213-ladsgroup.json [13:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35361/console" [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:02:52] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792737|[shnwiki] Enable the SandboxLink extension (T308623)]] (duration: 00m 53s) [13:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:57] T308623: Add sandbox menu to Shan Wikipedia - https://phabricator.wikimedia.org/T308623 [13:03:16] (03PS10) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [13:03:18] (03PS5) 10Jforrester: Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) [13:03:20] (03PS6) 10Jbond: admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) [13:03:27] (03CR) 10Jforrester: [C: 03+2] Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [13:03:47] (03PS1) 10Jelto: gitlab: fix gitlab-ce apt component on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/793046 (https://phabricator.wikimedia.org/T307142) [13:04:09] (03CR) 10jerkins-bot: [V: 04-1] admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:04:27] (03Merged) 10jenkins-bot: Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [13:04:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27936 and previous config saved to /var/cache/conftool/dbconfig/20220518-130457-ladsgroup.json [13:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:07] James_F: yes, thanks. sorry was afk. [13:05:43] (03CR) 10Jbond: [C: 03+1] impi: update to use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793037 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [13:05:45] (03CR) 10Jbond: [C: 03+2] impi: update to use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793037 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [13:05:49] tgr: No worries. :-) [13:05:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:05:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:12] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35363/console" [puppet] - 10https://gerrit.wikimedia.org/r/793046 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [13:06:18] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:677326|Disable LocalisationUpdate, part II (T158360)]] (duration: 00m 52s) [13:06:18] tgr: Is the V-1 fixed in a subsequent patch? [13:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:23] T158360: RFC: Reevaluate LocalisationUpdate extension for WMF - https://phabricator.wikimedia.org/T158360 [13:06:38] (03PS4) 10Jforrester: Disable LocalisationUpdate, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677327 (https://phabricator.wikimedia.org/T158360) [13:06:41] (03CR) 10Jforrester: [C: 03+2] Disable LocalisationUpdate, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677327 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [13:06:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:11] James_F: it's a side effect of Parsoid being loaded in an unusual way. It happens occasionally, can be ignored. [13:07:16] tgr: Ack. [13:07:20] (03CR) 10Jforrester: [C: 03+2] Campaign templates: show legal footer on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792655 (https://phabricator.wikimedia.org/T307521) (owner: 10Gergő Tisza) [13:07:35] (03Merged) 10jenkins-bot: Disable LocalisationUpdate, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677327 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [13:07:35] (you'll probably have to force merge though.) [13:07:40] Meh. [13:07:48] (03CR) 10Jbond: [C: 03+2] pybal: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793041 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [13:07:52] OK. [13:07:55] (03PS1) 10Volans: Remove unused PTRs from old experiment [dns] - 10https://gerrit.wikimedia.org/r/793047 (https://phabricator.wikimedia.org/T155761) [13:08:13] (03CR) 10Jforrester: [V: 03+2 C: 03+2] "Per tgr." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792655 (https://phabricator.wikimedia.org/T307521) (owner: 10Gergő Tisza) [13:08:31] (03CR) 10Ayounsi: [C: 03+1] Remove unused PTRs from old experiment [dns] - 10https://gerrit.wikimedia.org/r/793047 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:08:56] !log jforrester@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:677327|Disable LocalisationUpdate, part III (T158360)]] (duration: 00m 53s) [13:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:03] AIUI Parsoid doesn't use a matching branching scheme so wmf branches can be affected when a parser test is changed in master. [13:10:31] tgr: Live on mwdebug1002. [13:10:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793044 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [13:10:45] (03CR) 10Jbond: [C: 03+2] profile: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793044 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [13:11:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:12:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:54] tgr: OK to proceed? [13:13:56] James_F: works, thanks! [13:14:15] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:14:21] Excellent. [13:14:32] (03CR) 10Volans: [C: 03+2] Remove unused PTRs from old experiment [dns] - 10https://gerrit.wikimedia.org/r/793047 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:15:47] !log jforrester@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/GrowthExperiments: Backport: [[gerrit:792655|Campaign templates: show legal footer on mobile (T307521)]] (duration: 00m 53s) [13:15:50] (03PS3) 10Jforrester: Allow wikifunctions.org URLs to be used in the URL Shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771620 [13:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:52] T307521: Support templating for Growth campaign landing pages - https://phabricator.wikimedia.org/T307521 [13:16:00] (03CR) 10Jforrester: [C: 03+2] Allow wikifunctions.org URLs to be used in the URL Shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771620 (owner: 10Jforrester) [13:16:49] (03Merged) 10jenkins-bot: Allow wikifunctions.org URLs to be used in the URL Shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771620 (owner: 10Jforrester) [13:17:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27937 and previous config saved to /var/cache/conftool/dbconfig/20220518-131718-ladsgroup.json [13:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:08] (03PS3) 10Jforrester: Allow wikifunctions.org to use the CAPTCHA system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771621 [13:18:19] Anyone have anything they'd like deployed, whilst I'm here? [13:18:44] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:771620|Allow wikifunctions.org URLs to be used in the URL Shortener]] (duration: 00m 54s) [13:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T303603)', diff saved to https://phabricator.wikimedia.org/P27938 and previous config saved to /var/cache/conftool/dbconfig/20220518-132002-ladsgroup.json [13:20:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:20:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:08] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:20:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T303603)', diff saved to https://phabricator.wikimedia.org/P27939 and previous config saved to /var/cache/conftool/dbconfig/20220518-132011-ladsgroup.json [13:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:13] (03PS4) 10Jforrester: InitialiseSettings: Enable SandboxLink for uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791787 (https://phabricator.wikimedia.org/T308399) (owner: 10Samtar) [13:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:21] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (16) node(s) change every puppet run: gitlab1003, ms-be1068, ms-be1069, ms-be1070, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [13:20:22] (03CR) 10Jforrester: [C: 03+2] InitialiseSettings: Enable SandboxLink for uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791787 (https://phabricator.wikimedia.org/T308399) (owner: 10Samtar) [13:20:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2060.codfw.wmnet with reason: host reimage [13:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:14] (03Merged) 10jenkins-bot: InitialiseSettings: Enable SandboxLink for uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791787 (https://phabricator.wikimedia.org/T308399) (owner: 10Samtar) [13:22:12] (03PS1) 10Btullis: Deploy updated datahub containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/793048 (https://phabricator.wikimedia.org/T308052) [13:22:31] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791787|InitialiseSettings: Enable SandboxLink for uzwiki (T308399)]] (duration: 00m 53s) [13:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:36] T308399: Add "SandboxLink" to Uzbek Wikipedia - https://phabricator.wikimedia.org/T308399 [13:22:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T303603)', diff saved to https://phabricator.wikimedia.org/P27940 and previous config saved to /var/cache/conftool/dbconfig/20220518-132248-ladsgroup.json [13:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:08] (03CR) 10Jforrester: [C: 03+2] Allow wikifunctions.org to use the CAPTCHA system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771621 (owner: 10Jforrester) [13:23:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:10] (03Merged) 10jenkins-bot: Allow wikifunctions.org to use the CAPTCHA system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771621 (owner: 10Jforrester) [13:24:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2060.codfw.wmnet with reason: host reimage [13:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:24:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 204): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35362/console" [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:27:02] (03PS11) 10Jbond: admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) [13:27:12] (03PS7) 10Jbond: admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) [13:27:24] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:771621|Allow wikifunctions.org to use the CAPTCHA system]] (duration: 00m 52s) [13:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:35] (03CR) 10Btullis: [C: 03+2] Deploy updated datahub containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/793048 (https://phabricator.wikimedia.org/T308052) (owner: 10Btullis) [13:29:42] (03CR) 10Herron: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:30:28] (03CR) 10Jbond: [C: 03+2] admin: convert add_all_users to puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/792989 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:30:31] (03CR) 10Jbond: [C: 03+2] admin: convert unique_users function to puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793031 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:30:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:58] (03PS2) 10Volans: EVPN: add missing reverse zonefile includes [dns] - 10https://gerrit.wikimedia.org/r/793020 [13:31:00] (03PS1) 10Volans: cloud codfw1dev: remove records managed by Netbox [dns] - 10https://gerrit.wikimedia.org/r/793050 (https://phabricator.wikimedia.org/T155761) [13:31:39] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Remove legacy functions - https://phabricator.wikimedia.org/T308639 (10jbond) [13:31:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:31:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:56] !log volans@cumin1001 START - Cookbook sre.dns.netbox [13:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:21] (03PS1) 10Majavah: P:openstack: make pdns-recusor listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/793051 [13:32:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298555)', diff saved to https://phabricator.wikimedia.org/P27941 and previous config saved to /var/cache/conftool/dbconfig/20220518-133223-ladsgroup.json [13:32:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:32:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:29] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [13:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27942 and previous config saved to /var/cache/conftool/dbconfig/20220518-133231-ladsgroup.json [13:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:54] (03PS5) 10Jforrester: Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 (owner: 10Thiemo Kreuz (WMDE)) [13:33:47] (03Merged) 10jenkins-bot: Deploy updated datahub containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/793048 (https://phabricator.wikimedia.org/T308052) (owner: 10Btullis) [13:34:01] (03PS2) 10Majavah: P:openstack: make pdns-recusor listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/793051 [13:34:12] (03CR) 10Jforrester: [C: 03+2] Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 (owner: 10Thiemo Kreuz (WMDE)) [13:34:39] !log rolling upgrade to HAProxy 2.4.17 in codfw - T307444 [13:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:45] T307444: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 [13:35:05] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35365/console" [puppet] - 10https://gerrit.wikimedia.org/r/793051 (owner: 10Majavah) [13:35:19] (03Merged) 10jenkins-bot: Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 (owner: 10Thiemo Kreuz (WMDE)) [13:36:30] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27943 and previous config saved to /var/cache/conftool/dbconfig/20220518-133753-ladsgroup.json [13:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:32] !log jforrester@deploy1002 Synchronized docroot/wwwportal/w/search-redirect.php: Config: [[gerrit:740304|Make use of the ?? operator in more trivial situations]] (duration: 00m 51s) [13:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:38:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:04] !log volans@cumin1001 START - Cookbook sre.dns.wipe-cache ns-recursor0.openstack.codfw1dev.wikimediacloud.org on all recursors [13:39:07] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ns-recursor0.openstack.codfw1dev.wikimediacloud.org on all recursors [13:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:15] !log volans@cumin1001 START - Cookbook sre.dns.wipe-cache ns-recursor1.openstack.codfw1dev.wikimediacloud.org on all recursors [13:39:19] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ns-recursor1.openstack.codfw1dev.wikimediacloud.org on all recursors [13:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:41] !log jforrester@deploy1002 Synchronized docroot/noc/conf/highlight.php: Config: [[gerrit:740304|Make use of the ?? operator in more trivial situations]] (duration: 00m 51s) [13:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2060.codfw.wmnet with OS bullseye [13:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:48] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2060.codfw.wmnet with OS bullseye completed: - ms-be2060 (**PASS**) - Downtim... [13:40:58] !log jforrester@deploy1002 Synchronized rpc/RunJobs.php: Config: [[gerrit:740304|Make use of the ?? operator in more trivial situations]] (duration: 00m 51s) [13:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:16] (03PS7) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 [13:42:00] !log jforrester@deploy1002 Synchronized w/health-check.php: Config: [[gerrit:740304|Make use of the ?? operator in more trivial situations]] (duration: 00m 52s) [13:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:13] !log jforrester@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:740304|Make use of the ?? operator in more trivial situations]] (duration: 00m 52s) [13:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:37] !log jforrester@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:740304|Make use of the ?? operator in more trivial situations]] (duration: 00m 53s) [13:44:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:44] And done, finally. [13:47:37] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add kubelet labels used in ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/792970 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [13:49:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:50:41] (03PS1) 10Jforrester: build: Upgrade symfony/yaml to 5.4.3, the version we use in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793053 [13:51:32] (03CR) 10jerkins-bot: [V: 04-1] build: Upgrade symfony/yaml to 5.4.3, the version we use in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793053 (owner: 10Jforrester) [13:51:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:51:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27944 and previous config saved to /var/cache/conftool/dbconfig/20220518-135259-ladsgroup.json [13:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:41] (03CR) 10Ayounsi: [C: 03+1] "Alright, thanks! It's good to be merged, and we can iron out the remaining details later on if needed (eg. oob, etc.)" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:53:43] (03PS1) 10Jbond: rake_tasks: skip files with out an extension or with .original.py [puppet] - 10https://gerrit.wikimedia.org/r/793055 [13:53:56] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:54:16] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:20] (03PS2) 10Jbond: rake_tasks: skip files with out an extension or with .original.py [puppet] - 10https://gerrit.wikimedia.org/r/793055 [13:54:22] (03PS2) 10Volans: cloud codfw1dev: fix recursor records [dns] - 10https://gerrit.wikimedia.org/r/793050 (https://phabricator.wikimedia.org/T155761) [13:54:24] (03PS1) 10Volans: tox.ini: remove older Python versions, add 3.9 [dns] - 10https://gerrit.wikimedia.org/r/793056 (https://phabricator.wikimedia.org/T155761) [13:54:26] (03PS1) 10Volans: zone_validator: include Netbox data in the check [dns] - 10https://gerrit.wikimedia.org/r/793057 (https://phabricator.wikimedia.org/T155761) [13:54:28] PROBLEM - Check systemd state on ml-serve-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:32] PROBLEM - Check systemd state on ml-serve-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:54] (03CR) 10Ayounsi: "1 comment then LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:55:26] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Mazevedo) @Marostegui maybe not, is there some way I can check it? [13:55:46] (03CR) 10Majavah: [C: 03+1] cloud codfw1dev: fix recursor records [dns] - 10https://gerrit.wikimedia.org/r/793050 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:55:49] (03CR) 10Jbond: [C: 03+2] rake_tasks: skip files with out an extension or with .original.py [puppet] - 10https://gerrit.wikimedia.org/r/793055 (owner: 10Jbond) [13:56:40] (03CR) 10Eevans: [C: 04-1] WIP: enable cassandra encryption (aqs cluster) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [13:57:02] (03PS3) 10Majavah: P:openstack: make pdns-recusor listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/793051 [13:57:10] (03CR) 10Volans: [C: 03+2] cloud codfw1dev: fix recursor records [dns] - 10https://gerrit.wikimedia.org/r/793050 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:57:17] (03PS3) 10Volans: cloud codfw1dev: fix recursor records [dns] - 10https://gerrit.wikimedia.org/r/793050 (https://phabricator.wikimedia.org/T155761) [13:57:27] (03CR) 10jerkins-bot: [V: 04-1] P:openstack: make pdns-recusor listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/793051 (owner: 10Majavah) [13:57:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:42] (03CR) 10Giuseppe Lavagetto: varnish: annotate X-Analytics header with matching requestctl actions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791372 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [13:59:01] (03PS3) 10Giuseppe Lavagetto: varnish: annotate X-Analytics header with matching requestctl actions [puppet] - 10https://gerrit.wikimedia.org/r/791372 (https://phabricator.wikimedia.org/T305582) [13:59:05] (03PS3) 10Giuseppe Lavagetto: varnish: set retry-after based on throttle duration in requestctl [puppet] - 10https://gerrit.wikimedia.org/r/791373 (https://phabricator.wikimedia.org/T305824) [14:01:26] (03CR) 10Ayounsi: [C: 03+1] EVPN: add missing reverse zonefile includes [dns] - 10https://gerrit.wikimedia.org/r/793020 (owner: 10Volans) [14:02:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:55] (03PS1) 10Elukey: role::ml_k8s::master: remove node labels for kubelet [puppet] - 10https://gerrit.wikimedia.org/r/793059 (https://phabricator.wikimedia.org/T308418) [14:04:05] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Marostegui) @Mazevedo I am asking cause you answered "yes" to "Do you currently have shell access" but I am not sure you do as I cannot find you on our shell access file (nor your ssh key).... [14:05:02] (03CR) 10Cathal Mooney: [C: 03+1] EVPN: add missing reverse zonefile includes [dns] - 10https://gerrit.wikimedia.org/r/793020 (owner: 10Volans) [14:05:25] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::master: remove node labels for kubelet [puppet] - 10https://gerrit.wikimedia.org/r/793059 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [14:07:14] RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:34] RECOVERY - Check systemd state on ml-serve-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:38] RECOVERY - Check systemd state on ml-serve-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T303603)', diff saved to https://phabricator.wikimedia.org/P27945 and previous config saved to /var/cache/conftool/dbconfig/20220518-140804-ladsgroup.json [14:08:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:08:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:10] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:08:11] (03PS3) 10Volans: EVPN: add missing reverse zonefile includes [dns] - 10https://gerrit.wikimedia.org/r/793020 [14:08:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T303603)', diff saved to https://phabricator.wikimedia.org/P27946 and previous config saved to /var/cache/conftool/dbconfig/20220518-140812-ladsgroup.json [14:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:23] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Marostegui) @Ottomata can you approve this too? thanks [14:08:56] (03CR) 10Volans: [C: 03+2] EVPN: add missing reverse zonefile includes [dns] - 10https://gerrit.wikimedia.org/r/793020 (owner: 10Volans) [14:09:14] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) [14:09:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:09:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:30] !log rolling upgrade to HAProxy 2.4.17 in esams - T307444 [14:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:35] T307444: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 [14:10:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T303603)', diff saved to https://phabricator.wikimedia.org/P27947 and previous config saved to /var/cache/conftool/dbconfig/20220518-141048-ladsgroup.json [14:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:58] (03CR) 10Vgutierrez: [C: 03+1] varnish: annotate X-Analytics header with matching requestctl actions [puppet] - 10https://gerrit.wikimedia.org/r/791372 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [14:12:39] (03PS1) 10Marostegui: data.yaml: Add Marina Azevedo to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/793061 (https://phabricator.wikimedia.org/T308603) [14:12:55] (03CR) 10Marostegui: [C: 04-2] "Waiting for Otto's approval for Superset access" [puppet] - 10https://gerrit.wikimedia.org/r/793061 (https://phabricator.wikimedia.org/T308603) (owner: 10Marostegui) [14:14:35] (03PS1) 10Jbond: wmflib::configparser_format: Replace legacy function with puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793064 (https://phabricator.wikimedia.org/T308639) [14:15:09] (03CR) 10jerkins-bot: [V: 04-1] wmflib::configparser_format: Replace legacy function with puppet function [puppet] - 10https://gerrit.wikimedia.org/r/793064 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [14:15:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27948 and previous config saved to /var/cache/conftool/dbconfig/20220518-142158-ladsgroup.json [14:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:05] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [14:22:12] (03PS1) 10Jbond: C:graphite: Drop configparser_function in favour of wmflib::ini [puppet] - 10https://gerrit.wikimedia.org/r/793065 [14:22:21] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Ottomata) Approved. Just checking, @Mazevedo, do you need analytics-privatedata-users access? I'd expect so if you are expecting to be able to see dashboards that us... [14:23:04] (03CR) 10Hnowlan: New service: image-suggestion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [14:23:06] (03CR) 10Hnowlan: [C: 03+2] New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [14:23:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: annotate X-Analytics header with matching requestctl actions [puppet] - 10https://gerrit.wikimedia.org/r/791372 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [14:23:59] 10SRE, 10CirrusSearch, 10Discovery-Search: re-enable deprecation warning logger on elasticsearch once issues are solved - https://phabricator.wikimedia.org/T218995 (10dcausse) 05Open→03Resolved a:03EBernhardson [14:24:54] (03PS2) 10Jbond: C:graphite: Drop configparser_function in favour of wmflib::ini [puppet] - 10https://gerrit.wikimedia.org/r/793065 [14:25:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: set retry-after based on throttle duration in requestctl [puppet] - 10https://gerrit.wikimedia.org/r/791373 (https://phabricator.wikimedia.org/T305824) (owner: 10Giuseppe Lavagetto) [14:25:15] (03PS4) 10Giuseppe Lavagetto: varnish: set retry-after based on throttle duration in requestctl [puppet] - 10https://gerrit.wikimedia.org/r/791373 (https://phabricator.wikimedia.org/T305824) [14:25:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27949 and previous config saved to /var/cache/conftool/dbconfig/20220518-142553-ladsgroup.json [14:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:22] (03PS3) 10Jbond: C:graphite: Drop configparser_function in favour of wmflib::ini [puppet] - 10https://gerrit.wikimedia.org/r/793065 [14:27:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35370/console" [puppet] - 10https://gerrit.wikimedia.org/r/793065 (owner: 10Jbond) [14:27:20] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) [14:28:19] (03Merged) 10jenkins-bot: New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [14:29:06] (03CR) 10Ayounsi: [C: 03+1] Refactor routing-instances template for mgmt-vrf [homer/public] - 10https://gerrit.wikimedia.org/r/793021 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [14:31:24] (03PS2) 10Volans: tox.ini: remove older Python versions, add 3.9 [dns] - 10https://gerrit.wikimedia.org/r/793056 (https://phabricator.wikimedia.org/T155761) [14:31:55] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10calbon) I approve [14:33:30] (03PS1) 10Jelto: dns: add PTR records for gitlab-replica-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/793067 (https://phabricator.wikimedia.org/T307142) [14:33:34] (03CR) 10Volans: "All the patches in the related phabricator tasks were driven by the early results of this improved check." [dns] - 10https://gerrit.wikimedia.org/r/793057 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:35:48] (03PS3) 10Slyngshede: Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) [14:36:38] (03CR) 10jerkins-bot: [V: 04-1] Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [14:37:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27951 and previous config saved to /var/cache/conftool/dbconfig/20220518-143703-ladsgroup.json [14:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] !log jnuche@deploy1002 scap failed: average error rate on 6/8 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [14:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:45] (03PS1) 10Jbond: rake/spdx: fix type end_with vs ends_with [puppet] - 10https://gerrit.wikimedia.org/r/793068 [14:40:58] PROBLEM - very high load average likely xfs on ms-be2060 is CRITICAL: CRITICAL - load average: 115.05, 100.65, 73.71 https://wikitech.wikimedia.org/wiki/Swift [14:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27952 and previous config saved to /var/cache/conftool/dbconfig/20220518-144058-ladsgroup.json [14:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:17] (03CR) 10Cathal Mooney: [C: 03+2] Refactor routing-instances template for mgmt-vrf [homer/public] - 10https://gerrit.wikimedia.org/r/793021 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [14:42:20] (03PS4) 10Stang: zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) [14:42:34] (03CR) 10Jbond: [C: 03+2] rake/spdx: fix type end_with vs ends_with [puppet] - 10https://gerrit.wikimedia.org/r/793068 (owner: 10Jbond) [14:42:50] (03PS4) 10David Caro: P:openstack: make pdns-recusor listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/793051 (owner: 10Majavah) [14:43:36] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 452 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:43:47] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/793005 (owner: 10PipelineBot) [14:44:02] 10SRE, 10conftool: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10Joe) 05Open→03Resolved [14:44:03] ugh, seeing an explosion of Missing field: page_restrictions in our attempted rollback [14:44:04] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10Joe) [14:44:10] rolling back the rollback [14:44:18] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 205 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:44:22] 10SRE, 10conftool, 10Patch-For-Review: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 (10Joe) [14:44:24] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10Joe) [14:44:26] 10SRE, 10conftool: Support NOT in the dsl grammar - https://phabricator.wikimedia.org/T305607 (10Joe) 05Open→03Resolved a:03Joe [14:44:44] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10Joe) [14:45:16] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: Set commonswiki to 1.39.0-wmf.12 [14:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:28] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:47:50] phew, ok [14:47:58] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:48:26] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/793005 (owner: 10PipelineBot) [14:48:28] Amir1: I saw way earlier that you knew something about the "Missing field: page_restrictions" message? [14:48:35] (03PS4) 10Slyngshede: Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) [14:48:49] thcipriani: yeah it should recover on its own [14:48:58] it's a cache's missing b/c [14:49:34] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793051 (owner: 10Majavah) [14:49:40] ah, so we're trying to rollback commons and scap canaries halted the deploy on a big spike of that, but if we force our way through that is should recover, is that right? [14:49:40] rollbacks makes me sad [14:49:45] ^ [14:50:21] (03CR) 10Ori: "_joe_ : The alternative is to query the remote registry via curl and compare checksums. I hacked together a script : https://phabricator.w" [puppet] - 10https://gerrit.wikimedia.org/r/792749 (https://phabricator.wikimedia.org/T308598) (owner: 10Ori) [14:50:30] yeah but do we want to fix wmf.12 quickly? [14:50:45] the reason is sorta known (I want to avoid spikes) [14:50:53] oh sure [14:51:17] if you know a fix for this that'd be awesome :D [14:51:34] <_joe_> ori: you always have the evilest solutions [14:51:58] RECOVERY - very high load average likely xfs on ms-be2060 is OK: OK - load average: 65.28, 77.54, 76.96 https://wikitech.wikimedia.org/wiki/Swift [14:52:01] heh [14:52:05] I don't love it either [14:52:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27954 and previous config saved to /var/cache/conftool/dbconfig/20220518-145208-ladsgroup.json [14:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:30] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Mazevedo) I believe I need this access as well > > Just checking, @Mazevedo, do you need analytics-privatedata-users access? I'd expect so if you are expecting to be... [14:53:53] (03PS3) 10Ori: service::docker: refresh the service when its image is updated [puppet] - 10https://gerrit.wikimedia.org/r/792749 (https://phabricator.wikimedia.org/T308598) [14:54:02] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:14] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:24] (03PS5) 10Stang: zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) [14:56:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T303603)', diff saved to https://phabricator.wikimedia.org/P27955 and previous config saved to /var/cache/conftool/dbconfig/20220518-145603-ladsgroup.json [14:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:08] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:56:13] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [14:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:33] Amir1: for my own clarity were you talking about a fix for T308663 or a fix for the explosion of log messages on rollback? [14:56:34] T308663: LogicException: This ParserOutput contains no text! - https://phabricator.wikimedia.org/T308663 [14:56:46] thcipriani: the former [14:56:58] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [14:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:14] oh! excellent! thanks for that clarification [15:00:18] (03PS3) 10Hnowlan: Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) [15:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:39] (03CR) 10jerkins-bot: [V: 04-1] Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:02:58] amir1: I see you have a fix patch for T308663 already -> https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CommonsMetadata/+/793073/ [15:02:59] T308663: LogicException: This ParserOutput contains no text! - https://phabricator.wikimedia.org/T308663 [15:03:00] thanks for that [15:03:04] holding the rollback [15:03:16] yeah, let's see if it fixes the issue [15:03:50] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [15:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:31] !log rolling upgrade to HAProxy 2.4.17 in eqiad - T307444 [15:04:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1006.eqiad.wmnet [15:04:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "While this is enterprise-grade puppet evil, I don't think there's a solution here that's simpler and as effective." [puppet] - 10https://gerrit.wikimedia.org/r/792749 (https://phabricator.wikimedia.org/T308598) (owner: 10Ori) [15:04:34] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [15:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:36] T307444: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 [15:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] (03PS1) 10Jbond: C:librenms: move phpdump to the template [puppet] - 10https://gerrit.wikimedia.org/r/793075 (https://phabricator.wikimedia.org/T308639) [15:06:53] (03PS13) 10Filippo Giunchedi: netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [15:07:10] (03CR) 10Filippo Giunchedi: netops: ping core routers from Prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [15:07:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27956 and previous config saved to /var/cache/conftool/dbconfig/20220518-150714-ladsgroup.json [15:07:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1110.eqiad.wmnet with reason: Maintenance [15:07:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1110.eqiad.wmnet with reason: Maintenance [15:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:20] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [15:07:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298555)', diff saved to https://phabricator.wikimedia.org/P27957 and previous config saved to /var/cache/conftool/dbconfig/20220518-150722-ladsgroup.json [15:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793046 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [15:10:04] (03PS2) 10Jbond: C:librenms: move phpdump to the template [puppet] - 10https://gerrit.wikimedia.org/r/793075 (https://phabricator.wikimedia.org/T308639) [15:10:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1006.eqiad.wmnet [15:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35373/console" [puppet] - 10https://gerrit.wikimedia.org/r/793075 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [15:11:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:librenms: move phpdump to the template [puppet] - 10https://gerrit.wikimedia.org/r/793075 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [15:12:36] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Remove legacy functions - https://phabricator.wikimedia.org/T308639 (10jbond) [15:14:32] (03PS1) 10Jforrester: Return early if the ParserOutput doesn't have any text [extensions/CommonsMetadata] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792659 (https://phabricator.wikimedia.org/T308663) [15:15:38] (03CR) 10Ori: [C: 03+2] service::docker: refresh the service when its image is updated [puppet] - 10https://gerrit.wikimedia.org/r/792749 (https://phabricator.wikimedia.org/T308598) (owner: 10Ori) [15:15:43] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@3072d55]: (no justification provided) [15:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:51] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@3072d55]: (no justification provided) (duration: 00m 07s) [15:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:21] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: contint/releases/hosts with helm installed: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10hashar) 05Open→03Resolved a:03jbond With https://gerrit.wikimedia.org/r/791565 deployed , the CI servers have... [15:17:10] 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Release-Engineering-Team: Add Antoine Musso to Phabricator hosts - https://phabricator.wikimedia.org/T308478 (10hashar) Confirmed. Thank you very much @Marostegui [15:17:19] (03PS2) 10Marostegui: data.yaml: Add Marina Azevedo to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/793061 (https://phabricator.wikimedia.org/T308603) [15:17:34] (03CR) 10Filippo Giunchedi: "I'll need to test this in Pontoon (hence no vote yet) but LGTM overall." [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [15:19:24] (03CR) 10Filippo Giunchedi: [C: 03+1] ddbackups: Remove old references to the check pass on the alert hosts [labs/private] - 10https://gerrit.wikimedia.org/r/793042 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [15:19:41] (03PS2) 10Stang: zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) [15:21:35] (03CR) 10Filippo Giunchedi: alerting_host: Remove references to dbbackups monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [15:21:50] (03PS1) 10Cathal Mooney: Add template for custom routing-instances on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/793079 (https://phabricator.wikimedia.org/T304989) [15:23:45] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:23:54] (03PS1) 10Majavah: wmflib: port ipresolve to the new api [puppet] - 10https://gerrit.wikimedia.org/r/793081 (https://phabricator.wikimedia.org/T308639) [15:24:39] (03CR) 10Ladsgroup: [C: 03+2] Return early if the ParserOutput doesn't have any text [extensions/CommonsMetadata] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792659 (https://phabricator.wikimedia.org/T308663) (owner: 10Jforrester) [15:24:45] jouncebot: nowandnext [15:24:45] No deployments scheduled for the next 2 hour(s) and 35 minute(s) [15:24:45] In 2 hour(s) and 35 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T1800) [15:24:49] cooooool [15:24:50] (03CR) 10jerkins-bot: [V: 04-1] wmflib: port ipresolve to the new api [puppet] - 10https://gerrit.wikimedia.org/r/793081 (https://phabricator.wikimedia.org/T308639) (owner: 10Majavah) [15:25:09] thcipriani jnuche: FYI, I'm backporting the change [15:25:23] 👍 [15:25:51] (03PS2) 10Majavah: wmflib: port ipresolve to the new api [puppet] - 10https://gerrit.wikimedia.org/r/793081 (https://phabricator.wikimedia.org/T308639) [15:26:39] (03CR) 10jerkins-bot: [V: 04-1] wmflib: port ipresolve to the new api [puppet] - 10https://gerrit.wikimedia.org/r/793081 (https://phabricator.wikimedia.org/T308639) (owner: 10Majavah) [15:27:19] (03Merged) 10jenkins-bot: Return early if the ParserOutput doesn't have any text [extensions/CommonsMetadata] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792659 (https://phabricator.wikimedia.org/T308663) (owner: 10Jforrester) [15:29:46] (03PS3) 10Majavah: wmflib: port ipresolve to the new api [puppet] - 10https://gerrit.wikimedia.org/r/793081 (https://phabricator.wikimedia.org/T308639) [15:30:17] I am going to test database backups checks in production to make sure they keep working as expected, expect icinga IRC complains soon [15:30:34] (03CR) 10jerkins-bot: [V: 04-1] wmflib: port ipresolve to the new api [puppet] - 10https://gerrit.wikimedia.org/r/793081 (https://phabricator.wikimedia.org/T308639) (owner: 10Majavah) [15:31:40] James_F: That extension is fun, isn't it? :D [15:32:02] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/CommonsMetadata/src: Backport: [[gerrit:792659|Return early if the ParserOutput doesn't have any text (T308663)]] (duration: 00m 52s) [15:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:09] T308663: LogicException: This ParserOutput contains no text! - https://phabricator.wikimedia.org/T308663 [15:34:42] Amir1: So, so fun. [15:35:31] PROBLEM - dump of s1 in codfw on alert1001 is CRITICAL: dump for s1 at codfw (db2141) taken more than 8 days ago: Most recent backup 2022-05-10 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:35:31] PROBLEM - dump of s1 in codfw on backupmon1001 is CRITICAL: dump for s1 at codfw (db2141) taken more than a week ago: Most recent backup 2022-05-10 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:35:46] ok, the alert looks fine [15:35:55] recovering now before doing a different test [15:36:10] !log promoted user:Ladsgroup to admin of testcommonswiki [15:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:13] (03CR) 10Ayounsi: [C: 03+1] "+1 if PCC is happy ;)" [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [15:38:49] 10SRE, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality-Constraints, and 3 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10ItamarWMDE) [15:39:32] (03CR) 10Ahmon Dancy: mediawiki::php: check opcache revalidation in restart script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792982 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:39:39] 10SRE, 10Traffic, 10Upstream: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) 05Open→03Resolved `(95) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5016].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4021-4... [15:39:49] 10SRE, 10LDAP: Cleanup two LDAP users with invalid `cn` attributes - https://phabricator.wikimedia.org/T308682 (10BTullis) [15:40:15] RECOVERY - dump of s1 in codfw on alert1001 is OK: Last dump for s1 at codfw (db2141) taken on 2022-05-17 00:00:01 (170 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:40:15] RECOVERY - dump of s1 in codfw on backupmon1001 is OK: Last dump for s1 at codfw (db2141) taken on 2022-05-17 00:00:01 (170 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:40:22] amir1: thanks again! [15:40:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:40:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:00] one last test and I am done [15:41:43] ^^ [15:42:28] (03PS2) 10David Caro: varnish: move to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792991 (https://phabricator.wikimedia.org/T308601) [15:42:35] (03CR) 10David Caro: varnish: move to nrpe::plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792991 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [15:44:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:44:01] 10SRE, 10LDAP: Cleanup two LDAP users with invalid `cn` attributes - https://phabricator.wikimedia.org/T308682 (10Majavah) Those are not invalid values, those are just people whose usernames contain non-ASCII characters. Our existing stack fully supports them, and I'd argue that any software that does not like... [15:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:46] (03CR) 10David Caro: [C: 03+2] varnish: move to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792991 (https://phabricator.wikimedia.org/T308601) (owner: 10David Caro) [15:46:18] 10SRE, 10LDAP: Cleanup two LDAP users with invalid `cn` attributes - https://phabricator.wikimedia.org/T308682 (10Majavah) > I'm happy to fix it myself if it helps, but thought it might be best to simply create a ticket and tag it with LDAP and SRE to begin with. You're making it sound very simple :-) did you... [15:50:45] (03PS1) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [15:53:23] 10SRE, 10LDAP: Cleanup two LDAP users with invalid `cn` attributes - https://phabricator.wikimedia.org/T308682 (10jcrespo) There is no bug- our openldap setup is fully unicode complient (see, for example, https://ldap.toolforge.org/user/tgr) . What you are seeing is ldapsearch output's ldifs, which by RFC ( ht... [15:54:49] 10SRE, 10LDAP: Cleanup two LDAP users with invalid `cn` attributes - https://phabricator.wikimedia.org/T308682 (10Tgr) >>! In T308682#7938944, @Majavah wrote: > Those are not invalid values, those are just people whose usernames contain non-ASCII characters. Our existing stack fully supports them, and I'd argu... [15:56:08] (03PS4) 10Hnowlan: Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) [15:57:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298555)', diff saved to https://phabricator.wikimedia.org/P27959 and previous config saved to /var/cache/conftool/dbconfig/20220518-155733-ladsgroup.json [15:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:39] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [16:00:26] (03PS1) 10Andrew Bogott: Provide wmcs-roots access to ceph nodes [puppet] - 10https://gerrit.wikimedia.org/r/793086 [16:00:54] (03PS2) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [16:00:56] (03CR) 10Dzahn: "I can confirm this user exists, has wmf email address, the UID and email atches what is in LDAP, that all looks good. What I can't say is " [puppet] - 10https://gerrit.wikimedia.org/r/793061 (https://phabricator.wikimedia.org/T308603) (owner: 10Marostegui) [16:01:17] 10SRE, 10LDAP: Cleanup two LDAP users with invalid `cn` attributes - https://phabricator.wikimedia.org/T308682 (10BTullis) OK, thanks all. That makes perfect sense. I'll report this as an upstream bug in DataHub and decline this ticket. [16:01:35] 10SRE, 10LDAP: Cleanup two LDAP users with invalid `cn` attributes - https://phabricator.wikimedia.org/T308682 (10BTullis) 05Open→03Declined [16:02:28] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793086 (owner: 10Andrew Bogott) [16:03:34] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793067 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [16:03:55] 10SRE, 10LDAP: Cleanup two LDAP users with invalid `cn` attributes - https://phabricator.wikimedia.org/T308682 (10BTullis) [16:04:15] (03CR) 10Marostegui: data.yaml: Add Marina Azevedo to ldap users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793061 (https://phabricator.wikimedia.org/T308603) (owner: 10Marostegui) [16:06:18] (03CR) 10Dzahn: "yea, both of them say they _suspect_ it's needed .." [puppet] - 10https://gerrit.wikimedia.org/r/793061 (https://phabricator.wikimedia.org/T308603) (owner: 10Marostegui) [16:06:58] (03CR) 10Razzi: [C: 03+2] turnilo: add monitoring for node application [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [16:08:51] (03PS3) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [16:10:19] (03Abandoned) 10Razzi: sre.zookeeper.reboot_workers: add cookbook to reboot zookeeper cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/670966 (https://phabricator.wikimedia.org/T273278) (owner: 10Razzi) [16:10:31] (03CR) 10Razzi: [C: 03+2] turnilo: change an-tool1011 to use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/792733 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [16:10:36] (03PS2) 10Razzi: turnilo: change an-tool1011 to use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/792733 (https://phabricator.wikimedia.org/T308597) [16:11:29] (03CR) 10Jelto: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [16:12:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27960 and previous config saved to /var/cache/conftool/dbconfig/20220518-161238-ladsgroup.json [16:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:47] (03PS4) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [16:14:36] PROBLEM - dump of s1 in codfw on backupmon1001 is CRITICAL: dump for s1 at codfw (db2141) taken more than a week ago: Most recent backup 2022-05-10 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:14:38] PROBLEM - dump of s1 in codfw on alert1001 is CRITICAL: dump for s1 at codfw (db2141) taken more than 8 days ago: Most recent backup 2022-05-10 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:17:18] (03CR) 10Ayounsi: [C: 03+1] "some inline comments but +1 overall" [homer/public] - 10https://gerrit.wikimedia.org/r/793079 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [16:17:22] (03PS1) 10BCornwall: pws: simple grammar fix [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/793089 [16:18:06] (03CR) 10Dzahn: [C: 03+1] "yea, I noticed these were missing. looks good!" [dns] - 10https://gerrit.wikimedia.org/r/793067 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [16:18:16] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:40] (03PS5) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [16:19:22] (03CR) 10Dzahn: [C: 03+1] gitlab: fix gitlab-ce apt component on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/793046 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [16:20:20] (03PS6) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [16:21:25] (03CR) 10Dzahn: "Should we wait just a little bit until the "SLO budget" resets?" [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [16:21:42] (03PS7) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [16:21:54] (03CR) 10Dzahn: "(I don't know about switching certs to pki yet)" [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [16:22:39] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-tool1011.eqiad.wmnet with reason: Setting up turnilo for the first time, there will be errors [16:22:41] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-tool1011.eqiad.wmnet with reason: Setting up turnilo for the first time, there will be errors [16:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:38] (03CR) 10Dzahn: "I don't have much to add here, if Herron and Jesse like it that's what should count." [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [16:25:08] (03PS3) 10Dzahn: doc: add monitoring of doc.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/791684 [16:27:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27961 and previous config saved to /var/cache/conftool/dbconfig/20220518-162743-ladsgroup.json [16:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:06] (03CR) 10Dzahn: "small nitpick inline. is there a ticket to link?" [puppet] - 10https://gerrit.wikimedia.org/r/793085 (owner: 10Hokwelum) [16:33:44] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35378/" [puppet] - 10https://gerrit.wikimedia.org/r/791684 (owner: 10Dzahn) [16:36:59] (03Abandoned) 10Andrew Bogott: wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601) (owner: 10Andrew Bogott) [16:37:28] (03CR) 10Hashar: "I don't think there is any need for it. doc.wikimedia.org is behind the ATS/Varnish which hold the certificate and it seems to be using th" [puppet] - 10https://gerrit.wikimedia.org/r/791684 (owner: 10Dzahn) [16:37:58] mutante: doc.wikimedia.org seems to use the Digicert wildcard cert, so I don't think there is any need to verify its expiry explicitly [16:38:05] ok, fixing the s1 dump, after the following recovery I will stop all testing- any subsequent alert would be a real one [16:38:52] hashar: there's a separate internal certificate used for the cache-->doc1001 internal flow [16:39:03] RECOVERY - dump of s1 in codfw on alert1001 is OK: Last dump for s1 at codfw (db2141) taken on 2022-05-17 00:00:01 (170 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:39:03] RECOVERY - dump of s1 in codfw on backupmon1001 is OK: Last dump for s1 at codfw (db2141) taken on 2022-05-17 00:00:01 (170 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:39:25] (03CR) 10Dzahn: [C: 03+2] "You are getting the Digicert cert because you are in Europe. Over here in the US I get a Letsencrypt cert." [puppet] - 10https://gerrit.wikimedia.org/r/791684 (owner: 10Dzahn) [16:39:36] hashar: we get different certs based on location :p [16:39:53] mutante: I think the flow is {public} --- https ---> {ats/varnish/misc cache} --- envoy ---> (clear) Apache on doc [16:40:09] this is an ongoing issue. if you monitor from eqiad you get Letsencrypt [16:40:14] * jynus finished production alert backups testing [16:40:16] and need different warning thresholds [16:40:58] mutante: you're not monitoring that, you're monitoring the internal certificate used for the cpxxxx-->doc1001 hop [16:41:14] doc.discovery.wmnet is an alias for doc1001.eqiad.wmnet. [16:41:49] well, we have been through this a couple times with the monitoring for the phab.wmfusercontent and that was the only one that checked the global cert [16:42:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298555)', diff saved to https://phabricator.wikimedia.org/P27962 and previous config saved to /var/cache/conftool/dbconfig/20220518-164248-ladsgroup.json [16:42:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:42:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:54] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [16:42:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27963 and previous config saved to /var/cache/conftool/dbconfig/20220518-164256-ladsgroup.json [16:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:13] (03PS1) 10Dzahn: Revert "doc: add monitoring of doc.wikimedia.org certificate" [puppet] - 10https://gerrit.wikimedia.org/r/792661 [16:43:50] will leave it to a general discussion between observability and traffic because we have been going a bit in circles when it comes to cert monitoring. [16:44:04] or maybe they are supposed to move to pki as well [16:44:39] (03CR) 10Dzahn: [C: 03+2] Revert "doc: add monitoring of doc.wikimedia.org certificate" [puppet] - 10https://gerrit.wikimedia.org/r/792661 (owner: 10Dzahn) [16:45:52] (03CR) 10jerkins-bot: [V: 04-1] Revert "doc: add monitoring of doc.wikimedia.org certificate" [puppet] - 10https://gerrit.wikimedia.org/r/792661 (owner: 10Dzahn) [16:46:32] taavi: yes, I was going to monitor the cert generated by cergen that I made for doc.discovery.wmnet [16:46:56] hashar: yea, this is about ATS -> envoy and it has all the names on it [16:46:59] DNS:doc.discovery.wmnet, DNS:doc.wikimedia.org, DNS:doc1001.eqiad.wmnet, DNS:doc2001.codfw.wmnet [16:47:06] but I'm reverting regardless [16:47:12] yes [16:47:22] so I don't see how the edge digicert/le split is problematic here [16:47:50] I was merely responding to "it's using Digicert" [16:47:52] mutante: ah so it might be better to add that to ::profile::tlsproxy::envoy maybe [16:48:20] mutante: well at least the role::doc include that profile and I am guessing that is where the cert monitoring can be made [16:48:30] the external cert is also not monitored afaict though [16:48:41] but you can't really because of that split [16:48:50] and because you need a different check command for LE vs not-LE [16:49:30] maybe yea, but I wanted to add it for one service at a time and not add all the checks at once [16:49:44] because that starts a much bigger discussion [16:50:07] doesn't the text-lb.$SITE https check cover those certs? so you don't need to monitor them separately [16:50:15] you just need to monitor the internal cert and that is enough [16:50:19] (03PS8) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [16:50:28] no, I don't think so [16:50:50] because when it was close to expiry in the past we got the alert just from phab.wmfusercontent.org and some other random site I had added this for [16:51:06] then I removed one of them again because it was duplicate [16:53:34] (03PS2) 10Dzahn: Revert "doc: add monitoring of doc.wikimedia.org certificate" [puppet] - 10https://gerrit.wikimedia.org/r/792661 [16:54:21] I don't follow [16:54:38] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [16:54:44] if you add a check for an internal endpoint (like your patch did), there's no way any of the edge wildcard certs are getting involved [16:55:18] but if (and only if) you monitor the external endpoints, then you need to take the letsencrypt expiry weirdness into account [16:55:24] I am saying both are not monitored right now [16:55:32] 2 different issues [16:55:55] are you saying that we don't monitor the main certificate that's serving en.wikipedia.org and others at all? [16:55:56] (03CR) 10Hokwelum: Add Clarkson university host to list of dumps mirrors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793085 (owner: 10Hokwelum) [16:57:16] and even if that's true, why does that block us from monitoring the internal certificate? [16:57:58] we are monitoring it as kind of an accidental side effect of checking TLS works on phab.wmfusercontent.org but that's it [16:58:01] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=phab.wmfusercontent.org [16:58:18] this will alert if it comes close to expiry [16:58:32] but people will be confused why it's phab.wmfusercontent [16:59:45] what is this per cache host check monitoring then? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp1089&service=HAProxy+HTTPS+wikipedia.org+RSA [16:59:52] I am not saying that blocks monitoring the internal certificate. What blocks monitoring the internal certificate is just the request to add it for ALL of the internal certs at once. [16:59:58] (03PS9) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [17:01:04] (03PS10) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [17:01:53] I don't know, maybe it was added since the last time we had the alert [17:02:19] (03CR) 10Hokwelum: Add Clarkson university host to list of dumps mirrors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793085 (owner: 10Hokwelum) [17:02:34] or it's the Digicert vs Letsencrypt part [17:03:51] (03PS1) 10Jcrespo: alert_host: Ensure packages and files from dbbackups check are gone [puppet] - 10https://gerrit.wikimedia.org/r/793094 (https://phabricator.wikimedia.org/T283017) [17:07:09] (03CR) 10Dzahn: [C: 03+1] "looks good to me (fwiw, IPv6 is missing reverse record in DNS), Ariel needs to approve though" [puppet] - 10https://gerrit.wikimedia.org/r/793085 (owner: 10Hokwelum) [17:09:03] (03CR) 10Jcrespo: "Waiting for your ok (not in a hurry):" [puppet] - 10https://gerrit.wikimedia.org/r/793094 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [17:13:23] (03Abandoned) 10Andrew Bogott: profile::wmcs::instance: create nrpe plugin directory [puppet] - 10https://gerrit.wikimedia.org/r/792701 (https://phabricator.wikimedia.org/T308601) (owner: 10Andrew Bogott) [17:13:53] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Observability-Alerting, and 5 others: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 (10Andrew) 05In progress→03Resolved [17:18:21] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:20:54] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack: place VM hostnames in lables['test_hostname'] [puppet] - 10https://gerrit.wikimedia.org/r/792223 (owner: 10Andrew Bogott) [17:21:02] (03PS2) 10Andrew Bogott: nova_fullstack: place VM hostnames in lables['test_hostname'] [puppet] - 10https://gerrit.wikimedia.org/r/792223 [17:21:22] (03PS1) 10Majavah: mariadb: convert to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793096 (https://phabricator.wikimedia.org/T308601) [17:22:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35381/console" [puppet] - 10https://gerrit.wikimedia.org/r/793096 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [17:25:13] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:26:17] (03PS1) 10Stang: zhwiki: Comment amendment for restricting "flow-hide" to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793098 (https://phabricator.wikimedia.org/T264489) [17:28:32] (03PS1) 10Majavah: monitoring: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793099 (https://phabricator.wikimedia.org/T308601) [17:30:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35382/console" [puppet] - 10https://gerrit.wikimedia.org/r/793099 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [17:31:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27964 and previous config saved to /var/cache/conftool/dbconfig/20220518-173139-ladsgroup.json [17:31:42] (03PS2) 10Majavah: monitoring: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793099 (https://phabricator.wikimedia.org/T308601) [17:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:45] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [17:33:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35383/console" [puppet] - 10https://gerrit.wikimedia.org/r/793099 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [17:37:45] (03CR) 10Razzi: [V: 03+1 C: 03+2] turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [17:37:50] (03PS5) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) [17:40:09] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH [17:40:26] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@ad59116]: (no justification provided) [17:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:34] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@ad59116]: (no justification provided) (duration: 00m 07s) [17:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:51] (03PS1) 10Majavah: raid: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793102 (https://phabricator.wikimedia.org/T308601) [17:44:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35384/console" [puppet] - 10https://gerrit.wikimedia.org/r/793102 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [17:45:09] (03Abandoned) 10Andrew Bogott: nova_fullstack_test: abuse the cloud.instance.name field to hold the test VM [puppet] - 10https://gerrit.wikimedia.org/r/791668 (owner: 10Andrew Bogott) [17:46:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27965 and previous config saved to /var/cache/conftool/dbconfig/20220518-174644-ladsgroup.json [17:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:56:13] (03PS1) 10Dzahn: use port 8080 instead of 80 in virtual host config line [container/miscweb] - 10https://gerrit.wikimedia.org/r/793104 [17:56:56] (03PS2) 10Dzahn: use port 8080 instead of 80 in virtual host config line [container/miscweb] - 10https://gerrit.wikimedia.org/r/793104 [17:57:02] (03CR) 10Dzahn: [C: 03+2] use port 8080 instead of 80 in virtual host config line [container/miscweb] - 10https://gerrit.wikimedia.org/r/793104 (owner: 10Dzahn) [17:58:26] (03CR) 10BryanDavis: [C: 03+1] "not sure what this fixes, but it seems fine. Debian's /etc/profile sources /etc/profile.d/*.sh for Bourne and Bourne compatible shells (sh" [puppet] - 10https://gerrit.wikimedia.org/r/792694 (owner: 10Zabe) [18:00:04] jnuche and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T1800). [18:01:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27966 and previous config saved to /var/cache/conftool/dbconfig/20220518-180149-ladsgroup.json [18:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:40] (03Restored) 10Ssingh: DHCP: make doh and durum hosts use the bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn) [18:02:51] mutante: ^ it's back :P [18:05:21] (03Merged) 10jenkins-bot: use port 8080 instead of 80 in virtual host config line [container/miscweb] - 10https://gerrit.wikimedia.org/r/793104 (owner: 10Dzahn) [18:05:23] (03CR) 10Dzahn: [C: 03+1] "whenever you like. I was hoping you just restore it whenver you want :)" [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn) [18:05:49] sukhe: :) anytime [18:08:54] (03CR) 10Jbond: "thanks ill comment on the other issues when resolved but all sgtm" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [18:16:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298555)', diff saved to https://phabricator.wikimedia.org/P27967 and previous config saved to /var/cache/conftool/dbconfig/20220518-181654-ladsgroup.json [18:16:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:16:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:01] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [18:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:34] (03CR) 10Jbond: "Sorry if this is still WIP, some one mentioned it to me and for somereason assumed it was an old CR" [puppet] - 10https://gerrit.wikimedia.org/r/793081 (https://phabricator.wikimedia.org/T308639) (owner: 10Majavah) [18:48:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:50:34] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48109 bytes in 4.137 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:51:37] (03PS2) 10Cathal Mooney: Add template for custom routing-instances on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/793079 (https://phabricator.wikimedia.org/T304989) [18:51:55] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) a:05RobH→03MoritzMuehlenhoff ganeti5002 firmware updates completed: nic 21.85.21.92, bios 2.14.2, idrac 5.10.10.00. system booted back into OS and is online for reimage later. can r... [18:53:15] (03CR) 10Cathal Mooney: "Thanks for the review. On reflection the RD stuff was overkill, the primary IP will always be unique and the VRF ID keeps them different " [homer/public] - 10https://gerrit.wikimedia.org/r/793079 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [18:53:23] (03CR) 10Cathal Mooney: [C: 03+2] Add template for custom routing-instances on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/793079 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [18:54:52] (03Merged) 10jenkins-bot: Add template for custom routing-instances on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/793079 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [19:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:09:59] db1139 says it's not replacting, it's not me [19:10:35] ah it's not pooled [19:10:43] backup source [19:11:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1163.eqiad.wmnet with reason: Maint [19:11:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1163.eqiad.wmnet with reason: Maint [19:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:34] (03PS1) 10Jbond: nitcracker: remove :nutcracker_pools function as its unused [puppet] - 10https://gerrit.wikimedia.org/r/793110 (https://phabricator.wikimedia.org/T308639) [19:23:34] !log capturing debug logs on mx2001.wikimedia.org [19:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:24] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: exim debug log capture [19:24:25] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: exim debug log capture [19:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:54] (03PS1) 10Jbond: P:profile::redis::multidc: drop legacy function redis_get_instances [puppet] - 10https://gerrit.wikimedia.org/r/793111 (https://phabricator.wikimedia.org/T308639) [19:27:59] (03PS2) 10Jbond: P:profile::redis::multidc: drop legacy function redis_get_instances [puppet] - 10https://gerrit.wikimedia.org/r/793111 (https://phabricator.wikimedia.org/T308639) [19:28:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35387/console" [puppet] - 10https://gerrit.wikimedia.org/r/793111 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [19:30:42] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:35] (03CR) 10Hashar: "recheck to trigger debian-glue after https://gerrit.wikimedia.org/r/c/integration/config/+/793028" [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [19:34:24] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:08] (03PS1) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [19:44:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [19:45:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [19:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T303603)', diff saved to https://phabricator.wikimedia.org/P27969 and previous config saved to /var/cache/conftool/dbconfig/20220518-194504-ladsgroup.json [19:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:10] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [19:46:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:46:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298555)', diff saved to https://phabricator.wikimedia.org/P27970 and previous config saved to /var/cache/conftool/dbconfig/20220518-194701-ladsgroup.json [19:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:08] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [19:48:00] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RKemper) [19:48:28] (03PS3) 10Jbond: P:profile::redis::multidc: drop legacy function redis_get_instances [puppet] - 10https://gerrit.wikimedia.org/r/793111 (https://phabricator.wikimedia.org/T308639) [19:48:30] (03PS2) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [19:48:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T303603)', diff saved to https://phabricator.wikimedia.org/P27971 and previous config saved to /var/cache/conftool/dbconfig/20220518-194857-ladsgroup.json [19:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35389/console" [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [19:50:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35390/console" [puppet] - 10https://gerrit.wikimedia.org/r/793111 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [19:53:27] (03PS3) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [19:53:31] (03CR) 10Hashar: "Debian glue success! :)" [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [19:54:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35391/console" [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [19:58:43] (03PS2) 10Cathal Mooney: Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 [19:58:50] (03CR) 10jerkins-bot: [V: 04-1] Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 (owner: 10Cathal Mooney) [19:59:13] (03PS4) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [20:00:05] RoanKattouw, Urbanecm, and cjming: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T2000). [20:00:05] koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:31] here [20:00:43] hi koi: i can deploy [20:01:00] (03CR) 10Clare Ming: [C: 03+2] zhwiki: Comment amendment for restricting "flow-hide" to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793098 (https://phabricator.wikimedia.org/T264489) (owner: 10Stang) [20:02:22] (03Merged) 10jenkins-bot: zhwiki: Comment amendment for restricting "flow-hide" to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793098 (https://phabricator.wikimedia.org/T264489) (owner: 10Stang) [20:03:19] (03PS3) 10Cathal Mooney: Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 [20:03:27] koi: syncing your 1st patch since it's just a comment change [20:04:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27972 and previous config saved to /var/cache/conftool/dbconfig/20220518-200402-ladsgroup.json [20:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:16] (03PS4) 10Clare Ming: zhwiktionary: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793033 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:04:23] (03PS5) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [20:04:36] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793098|zhwiki: Comment amendment for restricting "flow-hide" to autoconfirmed (T264489)]] (duration: 00m 52s) [20:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:04] (03CR) 10Clare Ming: [C: 03+2] zhwiktionary: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793033 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:06:41] (03PS6) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [20:06:59] (03Merged) 10jenkins-bot: zhwiktionary: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793033 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:07:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35394/console" [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [20:07:53] koi: can you verify 2nd patch on mwdebug1001? [20:07:58] looking [20:08:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:38] LGTM! [20:08:44] cool - syncing [20:10:33] !log cjming@deploy1002 Synchronized static/images/project-logos/zhwiktionary-1.5x.png: Config: [[gerrit:793033|zhwiktionary: Declare commons files for logo (T308620)]] (duration: 00m 52s) [20:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:38] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [20:11:31] !log cjming@deploy1002 Synchronized static/images/project-logos/zhwiktionary-2x.png: Config: [[gerrit:793033|zhwiktionary: Declare commons files for logo (T308620)]] (duration: 00m 52s) [20:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:26] !log cjming@deploy1002 Synchronized static/images/project-logos/zhwiktionary.png: Config: [[gerrit:793033|zhwiktionary: Declare commons files for logo (T308620)]] (duration: 00m 52s) [20:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:24] !log cjming@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:793033|zhwiktionary: Declare commons files for logo (T308620)]] (duration: 00m 51s) [20:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:28] !log cjming@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:793033|zhwiktionary: Declare commons files for logo (T308620)]] (duration: 00m 51s) [20:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298555)', diff saved to https://phabricator.wikimedia.org/P27973 and previous config saved to /var/cache/conftool/dbconfig/20220518-201454-ladsgroup.json [20:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:59] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [20:15:29] ok koi: your changes should be live - i purged the logos [20:16:10] also works from my side :) [20:16:24] yay! [20:16:52] i'll hang for a few mins -- will close B&C window here shortly [20:17:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27974 and previous config saved to /var/cache/conftool/dbconfig/20220518-201907-ladsgroup.json [20:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:18] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [20:20:44] (03CR) 10Cwhite: [C: 03+1] delete expired certs etcd.eqiad.wmnet.crt and etcd.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [20:20:50] !log end of UTC late backport window [20:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:21:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:24:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27975 and previous config saved to /var/cache/conftool/dbconfig/20220518-202959-ladsgroup.json [20:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:18] (03PS4) 10Cathal Mooney: Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 [20:30:26] (03CR) 10jerkins-bot: [V: 04-1] Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 (owner: 10Cathal Mooney) [20:30:59] (03Abandoned) 10Cathal Mooney: Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 (owner: 10Cathal Mooney) [20:34:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T303603)', diff saved to https://phabricator.wikimedia.org/P27976 and previous config saved to /var/cache/conftool/dbconfig/20220518-203412-ladsgroup.json [20:34:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [20:34:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [20:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:18] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [20:34:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T303603)', diff saved to https://phabricator.wikimedia.org/P27977 and previous config saved to /var/cache/conftool/dbconfig/20220518-203420-ladsgroup.json [20:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:28] (03PS1) 10Cathal Mooney: Add policer config to switches [homer/public] - 10https://gerrit.wikimedia.org/r/793116 [20:37:40] (03CR) 10Cathal Mooney: "Self-merging as ayounsi gave a +1 in CR I545955fda6d391180cab50a749370440741bc5e4 and git has thwarted me." [homer/public] - 10https://gerrit.wikimedia.org/r/793116 (owner: 10Cathal Mooney) [20:38:22] (03CR) 10Cathal Mooney: [C: 03+2] Add policer config to switches [homer/public] - 10https://gerrit.wikimedia.org/r/793116 (owner: 10Cathal Mooney) [20:39:27] (03Merged) 10jenkins-bot: Add policer config to switches [homer/public] - 10https://gerrit.wikimedia.org/r/793116 (owner: 10Cathal Mooney) [20:40:48] (03PS2) 10Krinkle: Remove ElementTiming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793029 (https://phabricator.wikimedia.org/T308621) (owner: 10Mainframe98) [20:41:23] (03PS3) 10Krinkle: Remove wgElementTiming setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793029 (https://phabricator.wikimedia.org/T308621) (owner: 10Mainframe98) [20:41:45] (03CR) 10Krinkle: [C: 03+2] Remove wgElementTiming setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793029 (https://phabricator.wikimedia.org/T308621) (owner: 10Mainframe98) [20:43:14] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Dzahn) >>! In T306654#7873432, @Volans wrote: > As for the `puppet-merge` on the puppetmasters, does the `datacenter-ops` have +2 on the `operations/p... [20:43:33] (03PS7) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [20:44:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T303603)', diff saved to https://phabricator.wikimedia.org/P27978 and previous config saved to /var/cache/conftool/dbconfig/20220518-204403-ladsgroup.json [20:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:10] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [20:44:25] (03CR) 10jerkins-bot: [V: 04-1] C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [20:45:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27979 and previous config saved to /var/cache/conftool/dbconfig/20220518-204504-ladsgroup.json [20:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:47] (03PS8) 10Jbond: C:redis::multidc::ipsec: migrate legacy redis_shard_hosts to puppet code [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) [20:46:08] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Dzahn) >>! In T306654#7872422, @MoritzMuehlenhoff wrote: > we already have an existing group "datacenter-ops" .. Yes, please use that group. Back whe... [20:46:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35396/console" [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [20:47:31] (03CR) 10Jbond: [V: 03+1 C: 04-1] "not working" [puppet] - 10https://gerrit.wikimedia.org/r/793113 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [20:48:29] (03PS1) 10Cathal Mooney: Adding class-of-service config template for cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/793117 [20:55:03] (03PS4) 10Krinkle: Remove wgElementTiming setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793029 (https://phabricator.wikimedia.org/T308621) (owner: 10Mainframe98) [20:55:10] (03CR) 10Krinkle: [C: 03+2] Remove wgElementTiming setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793029 (https://phabricator.wikimedia.org/T308621) (owner: 10Mainframe98) [20:56:03] (03Merged) 10jenkins-bot: Remove wgElementTiming setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793029 (https://phabricator.wikimedia.org/T308621) (owner: 10Mainframe98) [20:59:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27980 and previous config saved to /var/cache/conftool/dbconfig/20220518-205908-ladsgroup.json [20:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:00:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298555)', diff saved to https://phabricator.wikimedia.org/P27981 and previous config saved to /var/cache/conftool/dbconfig/20220518-210009-ladsgroup.json [21:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [21:00:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [21:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:16] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [21:00:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298555)', diff saved to https://phabricator.wikimedia.org/P27982 and previous config saved to /var/cache/conftool/dbconfig/20220518-210017-ladsgroup.json [21:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:01:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:18] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:07:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:44] (03CR) 10Cathal Mooney: [C: 03+2] Adding class-of-service config template for cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/793117 (owner: 10Cathal Mooney) [21:07:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:07:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:25] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Adding class-of-service config template for cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/793117 (owner: 10Cathal Mooney) [21:11:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:39] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: apply a hack to prevent repeated backup failures [puppet] - 10https://gerrit.wikimedia.org/r/790023 (owner: 10Andrew Bogott) [21:12:04] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:14:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27983 and previous config saved to /var/cache/conftool/dbconfig/20220518-211413-ladsgroup.json [21:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:12] (03PS6) 10Stang: zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) [21:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298555)', diff saved to https://phabricator.wikimedia.org/P27984 and previous config saved to /var/cache/conftool/dbconfig/20220518-212815-ladsgroup.json [21:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:22] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [21:28:27] (03PS7) 10Stang: zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) [21:29:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T303603)', diff saved to https://phabricator.wikimedia.org/P27985 and previous config saved to /var/cache/conftool/dbconfig/20220518-212918-ladsgroup.json [21:29:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:29:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:25] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [21:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P27986 and previous config saved to /var/cache/conftool/dbconfig/20220518-212926-ladsgroup.json [21:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:36] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I0b6171b5452b (duration: 00m 55s) [21:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:01] (03PS1) 10Stang: zhwikiquote: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793119 (https://phabricator.wikimedia.org/T308620) [21:36:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P27987 and previous config saved to /var/cache/conftool/dbconfig/20220518-213617-ladsgroup.json [21:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:23] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [21:37:25] (03CR) 10Andrew Bogott: [C: 03+2] openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [21:38:30] (03CR) 10Andrew Bogott: [C: 03+2] "this will be correct eventually, and harmless now." [puppet] - 10https://gerrit.wikimedia.org/r/789116 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [21:40:26] (03PS1) 10Dwisehaupt: Add missing forward entries for frack nat addresses [dns] - 10https://gerrit.wikimedia.org/r/793121 (https://phabricator.wikimedia.org/T308672) [21:43:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27988 and previous config saved to /var/cache/conftool/dbconfig/20220518-214321-ladsgroup.json [21:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:19] (03PS3) 10Andrew Bogott: openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [21:49:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:51:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P27989 and previous config saved to /var/cache/conftool/dbconfig/20220518-215122-ladsgroup.json [21:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27990 and previous config saved to /var/cache/conftool/dbconfig/20220518-215826-ladsgroup.json [21:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:43] (03PS1) 10Andrew Bogott: wmcs-wikitech-grep: update to python3 with 2to3 [puppet] - 10https://gerrit.wikimedia.org/r/793123 [22:02:29] (03CR) 10jerkins-bot: [V: 04-1] wmcs-wikitech-grep: update to python3 with 2to3 [puppet] - 10https://gerrit.wikimedia.org/r/793123 (owner: 10Andrew Bogott) [22:05:45] (03PS2) 10Andrew Bogott: wmcs-wikitech-grep: update to python3 with 2to3 [puppet] - 10https://gerrit.wikimedia.org/r/793123 [22:06:27] (03CR) 10jerkins-bot: [V: 04-1] wmcs-wikitech-grep: update to python3 with 2to3 [puppet] - 10https://gerrit.wikimedia.org/r/793123 (owner: 10Andrew Bogott) [22:06:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P27991 and previous config saved to /var/cache/conftool/dbconfig/20220518-220627-ladsgroup.json [22:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:48] (03PS3) 10Andrew Bogott: wmcs-wikitech-grep: update to python3 with 2to3 [puppet] - 10https://gerrit.wikimedia.org/r/793123 [22:09:06] (03PS1) 10Ladsgroup: parser: Avoid pushing the whole content to ParserObserver debug log [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792665 (https://phabricator.wikimedia.org/T305218) [22:09:18] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-wikitech-grep: update to python3 with 2to3 [puppet] - 10https://gerrit.wikimedia.org/r/793123 (owner: 10Andrew Bogott) [22:09:23] (03CR) 10Ladsgroup: [C: 03+2] parser: Avoid pushing the whole content to ParserObserver debug log [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792665 (https://phabricator.wikimedia.org/T305218) (owner: 10Ladsgroup) [22:12:06] (03PS3) 10Stang: zhwikivoyage: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793027 (https://phabricator.wikimedia.org/T308620) [22:12:56] (03PS4) 10Andrew Bogott: openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [22:13:16] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:13:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298555)', diff saved to https://phabricator.wikimedia.org/P27992 and previous config saved to /var/cache/conftool/dbconfig/20220518-221331-ladsgroup.json [22:13:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [22:13:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [22:13:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:37] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [22:13:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298555)', diff saved to https://phabricator.wikimedia.org/P27993 and previous config saved to /var/cache/conftool/dbconfig/20220518-221344-ladsgroup.json [22:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:23] (03CR) 10jerkins-bot: [V: 04-1] openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [22:15:11] (03PS1) 10Stang: zhwikivoyage: Generate zh-hant logo variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793125 (https://phabricator.wikimedia.org/T308620) [22:15:28] (03PS5) 10Andrew Bogott: openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [22:16:08] (03CR) 10jerkins-bot: [V: 04-1] openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [22:17:38] (03PS6) 10Andrew Bogott: openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [22:19:02] (03PS1) 10Bartosz Dziewoński: mw.htmlform: Fix conditional hide/disable for non-OOUI forms [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793146 (https://phabricator.wikimedia.org/T308626) [22:19:51] (03PS7) 10Andrew Bogott: openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [22:21:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P27994 and previous config saved to /var/cache/conftool/dbconfig/20220518-222132-ladsgroup.json [22:21:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [22:21:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [22:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:21:38] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [22:21:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T303603)', diff saved to https://phabricator.wikimedia.org/P27995 and previous config saved to /var/cache/conftool/dbconfig/20220518-222145-ladsgroup.json [22:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:08] (03PS3) 10Stang: zhwikisource: Declare commons files for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792752 (https://phabricator.wikimedia.org/T308620) [22:23:06] (03CR) 10Andrew Bogott: "@dcaro I can't get my version of Black to produce the same output as what you got in ps3 which might defeat the purpose of minimizing futu" [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [22:24:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T303603)', diff saved to https://phabricator.wikimedia.org/P27996 and previous config saved to /var/cache/conftool/dbconfig/20220518-222433-ladsgroup.json [22:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:21] (03PS1) 10Stang: zhwikisource: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793127 (https://phabricator.wikimedia.org/T308620) [22:26:43] (03Merged) 10jenkins-bot: parser: Avoid pushing the whole content to ParserObserver debug log [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792665 (https://phabricator.wikimedia.org/T305218) (owner: 10Ladsgroup) [22:26:55] Amir1: if you're planning to be deploying wmf.12 patches, can you also do https://gerrit.wikimedia.org/r/c/mediawiki/core/+/793146 for me? [22:27:10] MatmaRex: I accept bribe [22:27:19] (03CR) 10Ladsgroup: [C: 03+2] mw.htmlform: Fix conditional hide/disable for non-OOUI forms [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793146 (https://phabricator.wikimedia.org/T308626) (owner: 10Bartosz Dziewoński) [22:27:22] (03PS2) 10Bartosz Dziewoński: mw.htmlform: Fix conditional hide/disable for non-OOUI forms [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793146 (https://phabricator.wikimedia.org/T308626) [22:27:22] heh [22:27:35] Amir1: sorry, do it again, i updated commit message [22:27:38] :D [22:27:48] (03CR) 10Ladsgroup: [C: 03+2] mw.htmlform: Fix conditional hide/disable for non-OOUI forms [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793146 (https://phabricator.wikimedia.org/T308626) (owner: 10Bartosz Dziewoński) [22:27:54] thank you [22:30:03] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/includes/parser/ParserObserver.php: Backport: [[gerrit:792665|parser: Avoid pushing the whole content to ParserObserver debug log (T305218)]] (duration: 00m 52s) [22:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:09] T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 [22:31:26] (03PS2) 10Stang: zhwikiversity: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792985 (https://phabricator.wikimedia.org/T308620) [22:32:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:41] (03PS1) 10Stang: zhwikiversity: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793128 (https://phabricator.wikimedia.org/T308620) [22:33:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:33:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:28] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) [22:36:43] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) Work was completed on May 4th. [22:39:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P27997 and previous config saved to /var/cache/conftool/dbconfig/20220518-223938-ladsgroup.json [22:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:40] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298555)', diff saved to https://phabricator.wikimedia.org/P27998 and previous config saved to /var/cache/conftool/dbconfig/20220518-224141-ladsgroup.json [22:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:46] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [22:43:28] (03Merged) 10jenkins-bot: mw.htmlform: Fix conditional hide/disable for non-OOUI forms [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793146 (https://phabricator.wikimedia.org/T308626) (owner: 10Bartosz Dziewoński) [22:45:58] MatmaRex: deployed [22:46:11] thanks [22:46:17] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/resources/src/mediawiki.htmlform/cond-state.js: Backport: [[gerrit:793146|mw.htmlform: Fix conditional hide/disable for non-OOUI forms (T308626)]] (duration: 00m 51s) [22:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:23] T308626: TypeError: Cannot read properties of undefined (reading 'nodeType') - https://phabricator.wikimedia.org/T308626 [22:49:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:50:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P27999 and previous config saved to /var/cache/conftool/dbconfig/20220518-225443-ladsgroup.json [22:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P28000 and previous config saved to /var/cache/conftool/dbconfig/20220518-225646-ladsgroup.json [22:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:34] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:02:24] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:09:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T303603)', diff saved to https://phabricator.wikimedia.org/P28001 and previous config saved to /var/cache/conftool/dbconfig/20220518-230948-ladsgroup.json [23:09:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [23:09:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [23:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:55] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [23:09:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T303603)', diff saved to https://phabricator.wikimedia.org/P28002 and previous config saved to /var/cache/conftool/dbconfig/20220518-230956-ladsgroup.json [23:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P28003 and previous config saved to /var/cache/conftool/dbconfig/20220518-231151-ladsgroup.json [23:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T303603)', diff saved to https://phabricator.wikimedia.org/P28004 and previous config saved to /var/cache/conftool/dbconfig/20220518-231244-ladsgroup.json [23:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:58] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: exim debug log capture [23:17:00] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: exim debug log capture [23:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:56] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:26:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298555)', diff saved to https://phabricator.wikimedia.org/P28005 and previous config saved to /var/cache/conftool/dbconfig/20220518-232656-ladsgroup.json [23:26:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [23:26:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [23:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:02] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [23:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28006 and previous config saved to /var/cache/conftool/dbconfig/20220518-232704-ladsgroup.json [23:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P28007 and previous config saved to /var/cache/conftool/dbconfig/20220518-232749-ladsgroup.json [23:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:12] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:31:46] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:37:10] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P28008 and previous config saved to /var/cache/conftool/dbconfig/20220518-234254-ladsgroup.json [23:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:17] !log an-db1002 - broken systemd state in Icinga since 48d - systemctl reset-failed [23:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:34] !log deploy2002 - broken systemd state in Icinga since 42d - systemctl reset-failed [23:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:00] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:30] !log dumpsdata1002 - broken systemd state in Icinga since 23d - systemctl reset-failed [23:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:28] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:11] !log ms-be1036 - broken systemd state in Icinga since 15d - systemctl reset-failed [23:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:48] !log ms-be1054 - broken systemd state in Icinga since 19d - systemctl reset-failed [23:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:35] !log ms-be1063 - broken systemd state in Icinga since 19d - systemctl reset-failed [23:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:56] RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:22] RECOVERY - Check systemd state on ms-be1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:30] !log seaborgium - broken systemd state in Icinga since 23d - systemctl reset-failed [23:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:44] RECOVERY - Check systemd state on ms-be1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:22] RECOVERY - Check systemd state on seaborgium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:08] !log webperf1001/webperf2001 - re-enabling notifications in icinga that were disabled without comment (please don't do this, they keep being forgotten on a regular basis) [23:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:58] !log webperf1001 - systemctl reset-failed [23:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:02] (03PS1) 10Cathal Mooney: Remove hard-coded loopback filter for vrf loopback ints [homer/public] - 10https://gerrit.wikimedia.org/r/793131 (https://phabricator.wikimedia.org/T304989) [23:57:48] (03CR) 10Cathal Mooney: [C: 03+2] Remove hard-coded loopback filter for vrf loopback ints [homer/public] - 10https://gerrit.wikimedia.org/r/793131 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [23:57:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T303603)', diff saved to https://phabricator.wikimedia.org/P28009 and previous config saved to /var/cache/conftool/dbconfig/20220518-235759-ladsgroup.json [23:58:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:58:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:05] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [23:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:25] (03Merged) 10jenkins-bot: Remove hard-coded loopback filter for vrf loopback ints [homer/public] - 10https://gerrit.wikimedia.org/r/793131 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)