[00:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:53:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:53:37] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:54:51] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [01:00:29] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:05:19] (03PS5) 10DannyS712: phpcs: move AssignmentInControlStructures exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796360 (https://phabricator.wikimedia.org/T171115) [01:09:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:12:23] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802840 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [01:22:08] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802841 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [01:22:45] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:25:05] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:29:49] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [01:54:51] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:17:39] (03PS4) 10Tim Starling: Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) [03:17:41] (03PS5) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [03:17:43] (03PS4) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 [03:19:01] (03PS5) 10Tim Starling: Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) [03:19:03] (03PS6) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [03:19:05] (03PS5) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 [03:22:39] (03CR) 10Tim Starling: "In PS5 I excluded x2 from the cross-DC master connection logic, reflecting the fact that MW has x2 configured such that all reads go to th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [03:26:39] 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [03:27:31] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:29:49] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [04:02:31] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:11:50] (03PS4) 10DannyS712: phpcs: enable and configure ValidGlobalName.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) [04:12:27] (03PS5) 10DannyS712: phpcs: enable and configure ValidGlobalName.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) [04:18:49] (03PS6) 10DannyS712: phpcs: enable and configure ValidGlobalName.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) [04:19:39] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:20:02] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802946 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [04:20:41] 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [04:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:26:36] (03CR) 10Tim Starling: [C: 03+2] Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [04:27:23] (03Merged) 10jenkins-bot: Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [04:28:48] 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) p:05Triage→03Medium [04:29:06] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) p:05Triage→03Medium [04:31:32] !log tstarling@deploy1002 Synchronized wmf-config/db-production.php: enable SSL for cross-DC master connections (duration: 03m 10s) [04:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:33] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:33:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:36:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:04] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802947 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [04:40:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:42] (03PS1) 10Marostegui: db1128: Enanble notifications [puppet] - 10https://gerrit.wikimedia.org/r/802945 (https://phabricator.wikimedia.org/T309303) [05:02:15] (03CR) 10Marostegui: [C: 03+2] db1128: Enanble notifications [puppet] - 10https://gerrit.wikimedia.org/r/802945 (https://phabricator.wikimedia.org/T309303) (owner: 10Marostegui) [05:03:47] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:06:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1128 to dbctl T309303', diff saved to https://phabricator.wikimedia.org/P29418 and previous config saved to /var/cache/conftool/dbconfig/20220606-050616-marostegui.json [05:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:21] T309303: Move db1128 from m1 (misc) to s1 (mediawiki) - https://phabricator.wikimedia.org/T309303 [05:07:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1128 on s1 with small weight after DIMM replacement T309303', diff saved to https://phabricator.wikimedia.org/P29419 and previous config saved to /var/cache/conftool/dbconfig/20220606-050707-root.json [05:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:39] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [05:12:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1137 in x1 to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29420 and previous config saved to /var/cache/conftool/dbconfig/20220606-051205-marostegui.json [05:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:11] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [05:17:55] (03CR) 10Marostegui: [C: 03+1] switchover-tmpl: Add commands for the heartbeat and zarcillo (031 comment) [software] - 10https://gerrit.wikimedia.org/r/802778 (owner: 10Ladsgroup) [05:18:00] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl: Add commands for the heartbeat and zarcillo [software] - 10https://gerrit.wikimedia.org/r/802778 (owner: 10Ladsgroup) [05:18:33] (03Merged) 10jenkins-bot: switchover-tmpl: Add commands for the heartbeat and zarcillo [software] - 10https://gerrit.wikimedia.org/r/802778 (owner: 10Ladsgroup) [05:25:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully pool db1137 in x1 to with 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29421 and previous config saved to /var/cache/conftool/dbconfig/20220606-052546-marostegui.json [05:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:52] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [06:01:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 2%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29422 and previous config saved to /var/cache/conftool/dbconfig/20220606-060110-root.json [06:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 5%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29423 and previous config saved to /var/cache/conftool/dbconfig/20220606-061614-root.json [06:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 10%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29424 and previous config saved to /var/cache/conftool/dbconfig/20220606-063118-root.json [06:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:19] RECOVERY - Memcached on an-tool1005 is OK: TCP OK - 0.001 second response time on 10.64.36.117 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [06:38:57] !log Migrate pc1014 to mariadb 10.6.8 T309612 [06:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:01] T309612: Migrate an active DC parsercache host to MariaDB 10.6 - https://phabricator.wikimedia.org/T309612 [06:40:04] (03PS1) 10Marostegui: pc1014: Install MariaDB 10.6.8 [puppet] - 10https://gerrit.wikimedia.org/r/803084 (https://phabricator.wikimedia.org/T309612) [06:41:15] (03CR) 10Marostegui: [C: 03+2] pc1014: Install MariaDB 10.6.8 [puppet] - 10https://gerrit.wikimedia.org/r/803084 (https://phabricator.wikimedia.org/T309612) (owner: 10Marostegui) [06:46:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 20%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29425 and previous config saved to /var/cache/conftool/dbconfig/20220606-064622-root.json [06:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:52] (03PS7) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [06:50:30] (03CR) 10Elukey: Add BGP configuration for the new ML staging codfw cluster (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [06:52:04] (03CR) 10Elukey: [C: 03+2] ml-services: add svwiki & trwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/802500 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [06:57:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ayounsi) 05Resolved→03Open Feel free to close the task if expected, but the latest diffscan report shows that SSH is open to... [07:00:05] Amir1 and Urbanecm: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:22] (03CR) 10Ayounsi: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [07:01:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29426 and previous config saved to /var/cache/conftool/dbconfig/20220606-070126-root.json [07:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 40%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29427 and previous config saved to /var/cache/conftool/dbconfig/20220606-071630-root.json [07:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:01] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803088 (https://phabricator.wikimedia.org/T309612) [07:24:33] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) Sorry they're giving you the runaround, that sounds very annoying :( Thanks for the update! [07:31:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 50%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29428 and previous config saved to /var/cache/conftool/dbconfig/20220606-073134-root.json [07:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:01] (03CR) 10Ayounsi: [C: 03+1] "10.64.48.89 is pc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803088 (https://phabricator.wikimedia.org/T309612) (owner: 10Marostegui) [07:35:20] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803088 (https://phabricator.wikimedia.org/T309612) (owner: 10Marostegui) [07:36:10] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803088 (https://phabricator.wikimedia.org/T309612) (owner: 10Marostegui) [07:37:11] (03PS1) 10Marostegui: pc1011,pc1014: Promote pc1014 to pc1 master [puppet] - 10https://gerrit.wikimedia.org/r/803229 (https://phabricator.wikimedia.org/T309612) [07:38:17] (03CR) 10Marostegui: [C: 03+2] pc1011,pc1014: Promote pc1014 to pc1 master [puppet] - 10https://gerrit.wikimedia.org/r/803229 (https://phabricator.wikimedia.org/T309612) (owner: 10Marostegui) [07:39:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:24] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to pc1 master T309612 (duration: 02m 53s) [07:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:27] T309612: Migrate an active DC parsercache host to MariaDB 10.6 - https://phabricator.wikimedia.org/T309612 [07:44:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:44:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 60%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29429 and previous config saved to /var/cache/conftool/dbconfig/20220606-074638-root.json [07:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1048.eqiad.wmnet with OS bullseye [07:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:42] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1048.eqiad.wmnet with OS bullseye [08:00:38] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:01:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29430 and previous config saved to /var/cache/conftool/dbconfig/20220606-080142-root.json [08:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1048.eqiad.wmnet with reason: host reimage [08:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1048.eqiad.wmnet with reason: host reimage [08:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P29431 and previous config saved to /var/cache/conftool/dbconfig/20220606-081647-root.json [08:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:40] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10fgiunchedi) re: instances, a bit of historical context in case it is useful. The main reason Thumbor was deployed that way is because of concurrency limits (i.e. one instance =... [08:20:35] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10fgiunchedi) p:05High→03Medium >>! In T309447#7976123, @MoritzMuehlenhoff wrote: > Severity is unclear to me from just rea... [08:25:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1048.eqiad.wmnet with OS bullseye [08:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:58] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1048.eqiad.wmnet with OS bullseye completed: - ms-be1048 (**PASS**) - Downtim... [08:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:31:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298560)', diff saved to https://phabricator.wikimedia.org/P29432 and previous config saved to /var/cache/conftool/dbconfig/20220606-083153-ladsgroup.json [08:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:57] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [08:31:59] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:33:00] (03PS8) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [08:39:39] !log maintenance: trigger full planet re-import for maps codfw [08:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:52] (03PS1) 10Filippo Giunchedi: Deprecate 'monitoring_setup' service state [puppet] - 10https://gerrit.wikimedia.org/r/803231 (https://phabricator.wikimedia.org/T309774) [08:41:24] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1049.eqiad.wmnet with OS bullseye [08:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:28] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1049.eqiad.wmnet with OS bullseye [08:46:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P29433 and previous config saved to /var/cache/conftool/dbconfig/20220606-084658-ladsgroup.json [08:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:33] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) >>! In T303049#7976511, @JMeybohm wrote: > > Sorry for nudging @BTullis - do you miss any information or need any assistance regarding the remaining s... [08:58:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1049.eqiad.wmnet with reason: host reimage [08:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:23] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35734/console" [puppet] - 10https://gerrit.wikimedia.org/r/803231 (https://phabricator.wikimedia.org/T309774) (owner: 10Filippo Giunchedi) [09:01:18] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1049.eqiad.wmnet with reason: host reimage [09:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:33] (03CR) 10Filippo Giunchedi: [V: 03+1] "+ Janis and Ben re: datasearchhub (nothing functionally will change, JFYI)" [puppet] - 10https://gerrit.wikimedia.org/r/803231 (https://phabricator.wikimedia.org/T309774) (owner: 10Filippo Giunchedi) [09:02:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P29434 and previous config saved to /var/cache/conftool/dbconfig/20220606-090203-ladsgroup.json [09:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:26] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:13:29] (03CR) 10Btullis: [C: 03+1] "Looks good. Thanks for the heads-up." [puppet] - 10https://gerrit.wikimedia.org/r/803231 (https://phabricator.wikimedia.org/T309774) (owner: 10Filippo Giunchedi) [09:14:06] (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch: add support for managing opensearch 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/802862 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [09:17:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298560)', diff saved to https://phabricator.wikimedia.org/P29435 and previous config saved to /var/cache/conftool/dbconfig/20220606-091709-ladsgroup.json [09:17:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:17:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:14] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [09:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1049.eqiad.wmnet with OS bullseye [09:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:24] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1049.eqiad.wmnet with OS bullseye completed: - ms-be1049 (**PASS**) - Downtim... [09:18:39] (03CR) 10Filippo Giunchedi: "LGTM modulo the comments already made" [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [09:20:40] (03PS8) 10MarcoAurelio: Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) [09:23:04] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager.yml.erb: use facts directly instead of lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/802489 (owner: 10David Caro) [09:25:35] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [09:29:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1050.eqiad.wmnet with OS bullseye [09:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:15] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1050.eqiad.wmnet with OS bullseye [09:34:00] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35735/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [09:34:26] (03CR) 10Filippo Giunchedi: [V: 03+1] "Idea itself LGTM, see inline and what David said" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [09:35:13] (03CR) 10MVernon: sre.swift.convert-ssds: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [09:36:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [09:39:01] (03PS3) 10Volans: sre.swift.convert-ssds: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) [09:42:10] (03CR) 10Volans: [C: 03+2] sre.swift.convert-ssds: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [09:44:39] jouncebot: nowandnext [09:44:39] No deployments scheduled for the next 3 hour(s) and 15 minute(s) [09:44:39] In 3 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T1300) [09:45:12] (03CR) 10Filippo Giunchedi: "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:45:13] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10BTullis) 05Open→03Resolved a:03BTullis I have downtimed the MegaRAID service on analytics1068 until 2022-08-30 - Apologies for the oversight @RhinosF1 [09:45:27] (03Merged) 10jenkins-bot: sre.swift.convert-ssds: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/801693 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [09:45:59] (03PS1) 10Urbanecm: Revoke ipinfo-view-log from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803236 (https://phabricator.wikimedia.org/T309411) [09:46:11] (03PS2) 10Urbanecm: Revoke ipinfo-view-log from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803236 (https://phabricator.wikimedia.org/T309411) [09:46:14] (03CR) 10Urbanecm: [C: 03+2] Revoke ipinfo-view-log from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803236 (https://phabricator.wikimedia.org/T309411) (owner: 10Urbanecm) [09:47:00] (03Merged) 10jenkins-bot: Revoke ipinfo-view-log from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803236 (https://phabricator.wikimedia.org/T309411) (owner: 10Urbanecm) [09:47:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1050.eqiad.wmnet with reason: host reimage [09:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:24] (03CR) 10Jbond: [C: 03+2] ipmi: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802757 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:48:36] (03CR) 10Jbond: [C: 03+2] webperf: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802758 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:49:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1050.eqiad.wmnet with reason: host reimage [09:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:06] (03PS1) 10Volans: sre.swift.convert-ssds: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/803238 (https://phabricator.wikimedia.org/T309027) [09:51:36] (03CR) 10Volans: [C: 03+2] "Trivial typo, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/803238 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [09:51:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:44] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b35c217163fc621bf68b982580dd68f317b08a55: Revoke ipinfo-view-log from sysop (T309411) (duration: 03m 04s) [09:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:39] (03Merged) 10jenkins-bot: sre.swift.convert-ssds: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/803238 (https://phabricator.wikimedia.org/T309027) (owner: 10Volans) [09:57:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/802810 (owner: 10JMeybohm) [09:58:57] (03CR) 10Volans: [C: 04-1] black format cookbooks/sre/__init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/802810 (owner: 10JMeybohm) [10:00:34] (03CR) 10Volans: [C: 04-1] "The cookbook repository does not currently use black. Applying black to a single file doesn't seem wise to me because it mixes different s" [cookbooks] - 10https://gerrit.wikimedia.org/r/802810 (owner: 10JMeybohm) [10:04:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1050.eqiad.wmnet with OS bullseye [10:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:03] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1050.eqiad.wmnet with OS bullseye completed: - ms-be1050 (**PASS**) - Downtim... [10:08:30] (03PS10) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [10:13:04] (03CR) 10CI reject: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [10:18:07] (03CR) 10Jbond: [C: 04-1] "LGTM couple of minor nits/issues inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [10:29:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1051.eqiad.wmnet with OS bullseye [10:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:54] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1051.eqiad.wmnet with OS bullseye [10:31:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/802849 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [10:35:27] (03PS1) 10Btullis: Use latest image version in all remaining eventgate services [deployment-charts] - 10https://gerrit.wikimedia.org/r/803242 (https://phabricator.wikimedia.org/T306181) [10:41:03] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:42:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1051.eqiad.wmnet with reason: host reimage [10:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:13] (03PS1) 10Jbond: remote: add an __iter__ to RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 [10:44:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:44:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1051.eqiad.wmnet with reason: host reimage [10:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:35] (03CR) 10Volans: "I don't have problems adding it. I'm just wondering if it could be confusing and/or incentivate re-implementing things already available v" [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [10:52:07] (03CR) 10CI reject: [V: 04-1] remote: add an __iter__ to RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [10:58:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1051.eqiad.wmnet with OS bullseye [10:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:04] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1051.eqiad.wmnet with OS bullseye completed: - ms-be1051 (**PASS**) - Downtim... [11:04:02] (03CR) 10Jbond: "thanks for the patch lgtm but cls=important dosn't seem to have an affect" [puppet] - 10https://gerrit.wikimedia.org/r/802897 (owner: 10Ladsgroup) [11:05:13] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I'm a bit confused by the state of things now. 1) Has the update to service-runner 3.1.0 be... [11:05:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [11:07:49] jouncebot: now [11:07:49] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [11:08:31] (03CR) 10Jbond: [C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/803231 (https://phabricator.wikimedia.org/T309774) (owner: 10Filippo Giunchedi) [11:11:10] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: b35c217163fc621bf68b982580dd68f317b08a55: Revoke ipinfo-view-log from sysop (T309411) (duration: 03m 18s) [11:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:15] * urbanecm done [11:11:36] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) >>! In T306181#7982366, @BTullis wrote: > I'm a bit confused by the state of things now. >... [11:13:27] (03PS1) 10Jbond: CONTRIBUTORS: add additional contributors [puppet] - 10https://gerrit.wikimedia.org/r/803247 [11:15:13] (03CR) 10CI reject: [V: 04-1] CONTRIBUTORS: add additional contributors [puppet] - 10https://gerrit.wikimedia.org/r/803247 (owner: 10Jbond) [11:21:00] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/803253 (owner: 10L10n-bot) [11:22:31] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1052.eqiad.wmnet with OS bullseye [11:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:35] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1052.eqiad.wmnet with OS bullseye [11:25:36] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Great, thanks for the summary @akosiaris - So the reduction in replicas alone explains the s... [11:33:58] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10CDanis) a:03KFrancis [11:38:26] Is there anyone interested in T309974? [11:38:27] T309974: https://codesearch.wmcloud.org/ does not load - https://phabricator.wikimedia.org/T309974 [11:44:25] koi: WFM [11:44:37] Same [11:47:16] back to normal now [11:57:10] (03PS1) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [11:57:12] (03PS1) 10Ayounsi: [WIP] Decom cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 [11:59:57] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [12:01:16] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Cmjohnson) You have successfully submitted request SR1096030919. [12:05:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1052.eqiad.wmnet with reason: host reimage [12:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye [12:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye [12:06:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1009.eqiad.wmnet with OS bullseye [12:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye execut... [12:07:01] (03PS1) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [12:08:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1052.eqiad.wmnet with reason: host reimage [12:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye [12:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye [12:11:02] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1009.eqiad.wmnet with OS bullseye [12:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye execut... [12:11:15] (03PS1) 10Jforrester: Partial revert "TextHandler::getTextTracksFromRows(): Remove unused code" [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802952 (https://phabricator.wikimedia.org/T309873) [12:11:36] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:08] (03PS11) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [12:21:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10jbond) >>! In T308013#7980024, @Dzahn wrote: >> bundle exec rake 'spdx:convert:module[MODULENAME]' > > Is there any way to install the ruby gem "puppet" from a Debian pac... [12:24:30] (03CR) 10CI reject: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [12:24:58] (03CR) 10Volans: [C: 04-1] "This fails the unit tests with AttributeError: 'PosixPath' object has no attribute 'startswith'" [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi) [12:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:27:21] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] Deprecate 'monitoring_setup' service state [puppet] - 10https://gerrit.wikimedia.org/r/803231 (https://phabricator.wikimedia.org/T309774) (owner: 10Filippo Giunchedi) [12:27:55] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [12:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:00] (03PS2) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [12:30:01] (03PS2) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [12:32:57] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [12:34:04] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:40:48] PROBLEM - Host es2031 is DOWN: PING CRITICAL - Packet loss = 100% [12:41:58] RECOVERY - Host es2031 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [12:42:28] PROBLEM - MariaDB read only es2 on es2031 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:42:35] marostegui: es2031 got rebooted [12:42:42] probably crashed, I'm having a look [12:43:28] PROBLEM - mysqld processes on es2031 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:43:43] cc godog, jayme ^^^ [12:45:23] volans: thank you for the heads up [12:45:43] volans: what assistance would you like ? [12:46:28] godog: me personally nothing, I'm not sure if there is any DBA around today though to have a look [12:46:31] it seems hardaware failure [12:47:01] ack [12:51:54] 10ops-codfw, 10DBA: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Volans) p:05Triage→03High [12:52:02] godog: I've created this task ^^^ [12:53:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1052.eqiad.wmnet with OS bullseye [12:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:04] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1052.eqiad.wmnet with OS bullseye completed: - ms-be1052 (**PASS**) - Downtim... [12:54:06] ack, not sure if we need any depool action at this time? I see cp2031 in dbconfig-instance in puppet [12:54:48] volans: godog: es2031 https://netbox.wikimedia.org/extras/reports/network.Network/ [12:55:13] papaul: ? [12:55:16] wrong link? [12:55:45] not that one i hav ea bus fatal error was detected on a component at slot 4 on es2031 [12:55:52] yes that was the wrong link sorry [12:56:04] godog: yes I can depool it from dbctl [12:56:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:31] volans: SGTM [12:58:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [12:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:24] !log volans@cumin1001 dbctl commit (dc=all): 'es2031 crashed T309977', diff saved to https://phabricator.wikimedia.org/P29436 and previous config saved to /var/cache/conftool/dbconfig/20220606-125923-volans.json [12:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:27] T309977: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T1300). [13:00:04] hauskatze: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:01:19] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:00] (03PS12) 10Jbond: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:03:08] (03CR) 10CI reject: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:04:09] volans: thanks I will check. was having lunch [13:04:41] marostegui: thanks no prob, doesn't seem crazy urgent [13:04:50] I've put hw logs in the task [13:04:52] yeah it is not uses [13:04:54] used [13:04:55] doesn't seem it was the first time [13:05:04] thanks - I'll follow up [13:05:17] https://sal.toolforge.org/ is down now [13:06:34] and back to normal [13:07:00] ...no, still 500 here [13:08:58] koi: probably better to ask in #wikimedia-cloud-admin or #wikimedia-cloud [13:09:30] happens regularly and yes, you should ask in -cloud for someone to restart the webservice [13:10:29] thanks for reply, asked [13:11:48] (03PS3) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [13:14:44] (03PS1) 10Andrew Bogott: Openstack nova vendordata: more fixes to metadata timeouts [puppet] - 10https://gerrit.wikimedia.org/r/803269 (https://phabricator.wikimedia.org/T309930) [13:14:48] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [13:17:55] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:18:02] (03CR) 10Andrew Bogott: [C: 03+2] Openstack nova vendordata: more fixes to metadata timeouts [puppet] - 10https://gerrit.wikimedia.org/r/803269 (https://phabricator.wikimedia.org/T309930) (owner: 10Andrew Bogott) [13:24:27] (03CR) 10Ayounsi: "Almost good to merge" [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [13:25:13] (03PS13) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [13:25:30] (03PS9) 10Elukey: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) [13:26:10] (03CR) 10Elukey: Add BGP configuration for the new ML staging codfw cluster (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [13:27:53] No deployers around for this window? :) [13:28:05] (03CR) 10CI reject: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:28:40] 10ops-codfw, 10DBA: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Marostegui) a:03Papaul @Papaul can we contact Dell about this and get some advise? Checking the disk controller logs I haven't found anything relevant. [13:29:13] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [13:29:55] (03PS1) 10Marostegui: es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803271 (https://phabricator.wikimedia.org/T309977) [13:30:48] (03PS1) 10Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) [13:30:51] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Marostegui) As these hosts do not have replication, I am leaving MySQL stopped for now in case Papaul needs some reboots/firmware upgrade. @Papaul if you need to power off or reboot this host... [13:31:00] (03CR) 10Marostegui: [C: 03+2] es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803271 (https://phabricator.wikimedia.org/T309977) (owner: 10Marostegui) [13:34:15] (03PS14) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [13:35:30] (03PS1) 10Ayounsi: Disable alert notifications on new netbox frontends [puppet] - 10https://gerrit.wikimedia.org/r/803274 (https://phabricator.wikimedia.org/T296452) [13:36:17] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:36:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:37] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:07] (03CR) 10CI reject: [V: 04-1] Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:37:20] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:29] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/pcc-worker1001/35737/cp2038.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:37:50] (03PS1) 10Zabe: netbase: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803275 (https://phabricator.wikimedia.org/T308013) [13:37:52] (03PS1) 10Zabe: ncredir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803276 (https://phabricator.wikimedia.org/T308013) [13:37:54] (03PS1) 10Zabe: mtail: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803277 (https://phabricator.wikimedia.org/T308013) [13:37:56] (03PS1) 10Zabe: mjolnir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803278 (https://phabricator.wikimedia.org/T308013) [13:37:58] (03PS1) 10Zabe: mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) [13:38:00] (03PS1) 10Zabe: lxc: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803280 (https://phabricator.wikimedia.org/T308013) [13:38:02] (03PS1) 10Zabe: logster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803281 (https://phabricator.wikimedia.org/T308013) [13:38:04] (03PS1) 10Zabe: logrotate: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803282 (https://phabricator.wikimedia.org/T308013) [13:39:17] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [13:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:16] jouncebot: nowandnext [13:40:16] For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T1300) [13:40:16] In 1 hour(s) and 49 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T1530) [13:40:22] (03PS9) 10Urbanecm: Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [13:40:28] (03CR) 10Urbanecm: [C: 03+2] "let's try it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [13:40:31] (03PS15) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [13:41:01] hauskatze: deploying your patch. I don't think I need your presence for that, since it's a private wiki, which makes it hard for you to test :) [13:41:07] (03CR) 10CI reject: [V: 04-1] mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [13:41:10] urbanecm: kind of :) [13:41:18] ? [13:41:30] kind of hard to test in a wiki I don't have an account [13:41:35] I mean :) [13:42:20] yeah, exactly :) [13:43:24] (03PS2) 10Zabe: mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) [13:44:19] (03CR) 10Ladsgroup: os_reports: Make the reports look better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802897 (owner: 10Ladsgroup) [13:44:30] (03Merged) 10jenkins-bot: Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [13:45:03] (03PS1) 10MVernon: Thanos: add search_platform user [puppet] - 10https://gerrit.wikimedia.org/r/803284 (https://phabricator.wikimedia.org/T309715) [13:45:52] (03CR) 10CI reject: [V: 04-1] mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [13:46:32] (03PS1) 10Ssingh: trafficserver: 9.x upgrade: separate metric current_client_connections [puppet] - 10https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) [13:47:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:34] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35738/console" [puppet] - 10https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:48:28] (03PS1) 10Ssingh: trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) [13:49:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:29] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35739/console" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:49:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [13:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:16] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: b7ca9fb268d59a3c2262733df247fb514b97f8b7: Enable $wgFixDoubleRedirects on officewiki (T305782) (duration: 03m 10s) [13:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:20] T305782: Enable $wgFixDoubleRedirects on officewiki - https://phabricator.wikimedia.org/T305782 [13:50:28] (03PS1) 10MVernon: profile::thanos::swift: fake creds for search_platform [labs/private] - 10https://gerrit.wikimedia.org/r/803287 (https://phabricator.wikimedia.org/T309715) [13:50:37] (03CR) 10Filippo Giunchedi: [C: 03+1] Thanos: add search_platform user [puppet] - 10https://gerrit.wikimedia.org/r/803284 (https://phabricator.wikimedia.org/T309715) (owner: 10MVernon) [13:50:44] hauskatze: and let's see what happens :) [13:51:14] (03CR) 10Andrea Denisse: [C: 03+1] Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:52:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:57] urbanecm: thanks :) [13:54:13] np [13:54:26] hmm, you synced commonsettings, not IS? [13:54:33] ... [13:54:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:45] syncing again [13:55:08] thanks for noticing that [13:55:12] (03CR) 10MVernon: [C: 03+2] Thanos: add search_platform user [puppet] - 10https://gerrit.wikimedia.org/r/803284 (https://phabricator.wikimedia.org/T309715) (owner: 10MVernon) [13:55:33] (03CR) 10MVernon: [V: 03+2 C: 03+2] profile::thanos::swift: fake creds for search_platform [labs/private] - 10https://gerrit.wikimedia.org/r/803287 (https://phabricator.wikimedia.org/T309715) (owner: 10MVernon) [13:56:28] (03PS1) 10Ssingh: trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) [13:56:39] (03PS16) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [13:56:58] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:57:26] np :) [13:58:20] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b7ca9fb268d59a3c2262733df247fb514b97f8b7: Enable $wgFixDoubleRedirects on officewiki (T305782) (duration: 03m 27s) [13:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:24] T305782: Enable $wgFixDoubleRedirects on officewiki - https://phabricator.wikimedia.org/T305782 [13:58:30] I was unsure if it is 'wg' or 'wmg'; MediaWiki docs say wg [14:00:02] it's wg. wmg are WM-specific variables. [14:00:43] 10SRE-swift-storage, 10Discovery-Search (Current work), 10Patch-For-Review: Create swift thanos account for Search platform team - https://phabricator.wikimedia.org/T309715 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This should all be done now, and I've restarted all the thanos swift frontends. [14:02:43] I'll be leaving shortly if there are no errors or a revert is needed [14:03:46] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803274 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [14:09:37] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/35740/" [puppet] - 10https://gerrit.wikimedia.org/r/803274 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [14:10:14] (03CR) 10Herron: [C: 03+1] opensearch: add support for managing opensearch 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/802862 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [14:11:35] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [14:16:18] (03PS2) 10Filippo Giunchedi: hieradata: TCP probe for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/802071 (https://phabricator.wikimedia.org/T305847) [14:16:20] (03PS19) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:18:17] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:19:20] (03CR) 10MVernon: "Hi," [labs/private] - 10https://gerrit.wikimedia.org/r/802631 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [14:20:19] (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:21:21] (03CR) 10Jbond: [C: 03+2] netbase: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803275 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:21:52] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Papaul) @Marostegui thanks [14:22:04] (03CR) 10Jbond: [C: 03+2] ncredir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803276 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:22:44] (03CR) 10Jbond: [C: 03+2] mtail: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803277 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:23:06] (03CR) 10Jbond: [C: 03+2] mjolnir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803278 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:24:26] (03CR) 10Jbond: [C: 03+2] "thanks again will merge upto here the test in mcrouter need investigating" [puppet] - 10https://gerrit.wikimedia.org/r/803278 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:25:48] (03CR) 10Jbond: [C: 03+2] lxc: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803280 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:25:55] (03PS2) 10Jbond: lxc: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803280 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:26:03] (03PS2) 10Jbond: logster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803281 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:26:34] (03PS2) 10Jbond: logrotate: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803282 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:26:58] (03CR) 10Jbond: [C: 03+2] logster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803281 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:27:04] (03CR) 10Jbond: [C: 03+2] logrotate: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803282 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:30:03] (03CR) 10Filippo Giunchedi: "Thank you for the followup, I've tested the change in Pontoon and reworked/adjusted a few bits and overall LGTM! See inline too" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:30:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] logrotate: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803282 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:31:02] (03PS1) 10Elukey: role::prometheus: enable settings for k8s ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/803295 (https://phabricator.wikimedia.org/T302195) [14:32:03] (03CR) 10AOkoth: [C: 03+1] vrts: rename exim4 templates from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [14:33:14] (03PS2) 10Krinkle: hieradata: switchover doc to doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [14:33:57] (03CR) 10Jbond: "LGTM thanks will deploy" [puppet] - 10https://gerrit.wikimedia.org/r/802897 (owner: 10Ladsgroup) [14:33:59] (03CR) 10Jbond: [C: 03+2] os_reports: Make the reports look better [puppet] - 10https://gerrit.wikimedia.org/r/802897 (owner: 10Ladsgroup) [14:35:04] (03PS1) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) [14:36:19] (03PS3) 10Jbond: mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:36:53] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35741/console" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:37:32] (03CR) 10CI reject: [V: 04-1] mcrouter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803279 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:38:45] (03PS2) 10Elukey: role::prometheus: enable settings for k8s ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/803295 (https://phabricator.wikimedia.org/T302195) [14:41:05] (03PS1) 10Ssingh: trafficserver: 9.x upgrade: remove redundant metrics [puppet] - 10https://gerrit.wikimedia.org/r/803297 (https://phabricator.wikimedia.org/T309651) [14:41:52] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) I propose the following rollout: 1. [change 744763 (pup... [14:42:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35742/console" [puppet] - 10https://gerrit.wikimedia.org/r/803297 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:45:11] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:45:56] (03PS1) 10Ladsgroup: os-reports: Push the ul elements inside [puppet] - 10https://gerrit.wikimedia.org/r/803299 [14:47:04] (03PS1) 10Bking: Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) [14:47:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:27] (03CR) 10CI reject: [V: 04-1] Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) (owner: 10Bking) [14:49:00] (03PS1) 10Ssingh: trafficserver: 9.x upgrade: update logging field for HTTP version [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) [14:50:04] 10SRE, 10ops-codfw, 10DBA: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Papaul) ` 2022-06-06 12:36:35 PCI1360 A bus fatal error was detected on a component at slot 4. Log Sequence Number: 323 Detailed Description: System performance may be degraded, or system may fail to operate.... [14:50:16] (03PS2) 10Bking: Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) [14:50:34] (03CR) 10Ssingh: "Not very happy with this one but let's discuss that during the reviews." [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:51:33] (03CR) 10CI reject: [V: 04-1] Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) (owner: 10Bking) [14:52:04] 10SRE, 10ops-codfw, 10DBA: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Papaul) https://www.dell.com/support/manuals/en-us/integrated-dell-remote-access-cntrllr-8-with-lifecycle-controller-v2.00.00.00/eemi_13g-v1/pci-event-messages?guid=guid-b22e470e-adc2-4ef4-ac82-98df81dc1dff&lang=en... [14:52:16] (03CR) 10Elukey: [C: 03+2] Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [14:52:39] (03CR) 10Elukey: [C: 03+2] Add BGP configuration for the new ML staging codfw cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [14:52:49] (03PS3) 10Bking: Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) [14:52:54] (03CR) 10Ladsgroup: [C: 03+2] os-reports: Push the ul elements inside [puppet] - 10https://gerrit.wikimedia.org/r/803299 (owner: 10Ladsgroup) [14:53:27] (03Merged) 10jenkins-bot: Add BGP configuration for the new ML staging codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [14:53:48] (03CR) 10CI reject: [V: 04-1] Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) (owner: 10Bking) [14:54:57] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:35] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:10] (03PS4) 10Bking: Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) [14:56:23] !log add BGP config for the k8s ml-staging cluster on cr{1,2}-codfw via homer - T302198 [14:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:29] T302198: Create ml-serve-staging k8s's control plane VMs - https://phabricator.wikimedia.org/T302198 [14:56:58] (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:57:07] (03CR) 10CI reject: [V: 04-1] Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) (owner: 10Bking) [14:57:18] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:33] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:20] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:54] (03PS5) 10Bking: Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) [15:01:47] (03CR) 10CI reject: [V: 04-1] Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) (owner: 10Bking) [15:02:21] (03CR) 10Elukey: [C: 03+2] role::prometheus: enable settings for k8s ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/803295 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [15:04:09] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/803305 [15:11:11] PROBLEM - Host es2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:16:53] (03PS6) 10Bking: Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) [15:17:07] RECOVERY - Host es2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [15:17:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [15:17:50] (03CR) 10CI reject: [V: 04-1] Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) (owner: 10Bking) [15:18:05] (03CR) 10Ahmon Dancy: "Thanks Alex!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [15:18:26] 10SRE, 10ops-codfw, 10DBA: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Papaul) a:05Papaul→03Marostegui Firmware upgrade done for : - BIOS - IDRAC - Backplan1 Power drain on the server @Marostegui we can repool the server for now after all the firmware upgrade according to Dell... [15:18:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki 0.2.2: Run test job as uid 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/802799 (owner: 10Ahmon Dancy) [15:20:52] 10SRE, 10ops-codfw, 10DBA: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Marostegui) Sounds good Papaul, I will start MySQL again then [15:21:00] (03Merged) 10jenkins-bot: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [15:22:15] (03Merged) 10jenkins-bot: mediawiki 0.2.2: Run test job as uid 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/802799 (owner: 10Ahmon Dancy) [15:23:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:24:53] XioNoX: didn't you merge the disable notification earlier for netbox1002? [15:25:15] volans: yep [15:25:29] 10SRE, 10ops-codfw, 10DBA: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Marostegui) Upgraded and started mysql [15:25:55] so why did it alert? :D [15:26:13] jbond: ack re: stripping '---' from the output, yes that's what I meant [15:26:31] (03CR) 10Ahmon Dancy: [C: 03+1] tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802869 (owner: 10Brennen Bearnes) [15:29:55] (03CR) 10Ahmon Dancy: [C: 03+1] GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [15:30:05] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T1530). [15:31:42] (03PS2) 10Daimona Eaytoy: Remove references to $wgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 [15:35:08] (03CR) 10Dom Walden: "I am afraid I don't know enough about this to comment. But, happy to +2." [deployment-charts] - 10https://gerrit.wikimedia.org/r/803305 (owner: 10PipelineBot) [15:36:34] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [15:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org w... [15:38:29] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:39:25] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [15:39:26] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [15:40:24] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/802750 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [15:40:53] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/802749 (owner: 10Muehlenhoff) [15:41:29] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:42:07] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10dpifke) [15:42:39] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10dpifke) [15:43:18] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10dpifke) [15:43:32] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10MPhamWMF) [15:44:40] (03PS1) 10Jbond: wmflib: add to_yaml function which allows striping yaml header [puppet] - 10https://gerrit.wikimedia.org/r/803311 [15:45:47] (03CR) 10CI reject: [V: 04-1] wmflib: add to_yaml function which allows striping yaml header [puppet] - 10https://gerrit.wikimedia.org/r/803311 (owner: 10Jbond) [15:45:48] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) This is complete...closing! [15:47:03] (03PS2) 10Jbond: wmflib: add to_yaml function which allows striping yaml header [puppet] - 10https://gerrit.wikimedia.org/r/803311 [15:48:06] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 (10ssingh) [15:50:40] (03CR) 10CI reject: [V: 04-1] wmflib: add to_yaml function which allows striping yaml header [puppet] - 10https://gerrit.wikimedia.org/r/803311 (owner: 10Jbond) [15:53:20] (03PS3) 10Jbond: wmflib: add to_yaml function which allows striping yaml header [puppet] - 10https://gerrit.wikimedia.org/r/803311 [15:56:23] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [15:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:26] (03CR) 10Jbond: [C: 03+2] wmflib: add to_yaml function which allows striping yaml header [puppet] - 10https://gerrit.wikimedia.org/r/803311 (owner: 10Jbond) [15:59:27] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [15:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:32] (03PS7) 10Bking: Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) [16:03:18] 10SRE, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10JArguello-WMF) [16:03:44] (03PS20) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [16:03:46] 10SRE, 10Data-Engineering, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10JArguello-WMF) [16:05:11] (03PS21) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [16:06:16] (03PS4) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [16:08:13] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Volans) Instead of adding a quick check in the downtime cookbook only I preferred to add the feature to spicerack directly so... [16:08:46] (03PS1) 10Volans: pylint: remove unnecessary comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/803316 [16:08:48] (03PS1) 10Volans: icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) [16:10:55] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [16:10:58] (03CR) 10Ebernhardson: [C: 03+1] Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) (owner: 10Bking) [16:12:00] (03CR) 10Bking: [C: 03+2] Elastic: Add elastic bindir to root's path [puppet] - 10https://gerrit.wikimedia.org/r/803300 (https://phabricator.wikimedia.org/T309720) (owner: 10Bking) [16:12:45] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [16:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with... [16:15:38] legoktm: here's the nudge you asked for ref merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/800855 if no one else had :) T309449 [16:15:39] T309449: Package 'cgroup-bin' has no installation candidate on Debian 11 (modules/mediawiki/manifests/cgroup.pp) - https://phabricator.wikimedia.org/T309449 [16:16:59] (03CR) 10CI reject: [V: 04-1] pylint: remove unnecessary comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/803316 (owner: 10Volans) [16:20:09] (03PS2) 10Volans: pylint: remove unnecessary comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/803316 [16:20:11] (03PS2) 10Volans: icinga: ensure that the downtime was applied [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) [16:23:28] (03CR) 10Ayounsi: "Example output for an interface rename: https://phabricator.wikimedia.org/P29440" [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [16:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:30:15] (03PS1) 10Papaul: Testing new partman recipe for clouddumps nodes [puppet] - 10https://gerrit.wikimedia.org/r/803318 (https://phabricator.wikimedia.org/T302981) [16:31:55] (03CR) 10Papaul: [C: 03+2] Testing new partman recipe for clouddumps nodes [puppet] - 10https://gerrit.wikimedia.org/r/803318 (https://phabricator.wikimedia.org/T302981) (owner: 10Papaul) [16:33:36] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [16:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org w... [16:34:23] (03PS5) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [16:42:06] (03PS1) 10Ebernhardson: Revert "Revert "Upgrade to elasticsearch 7.10.2"" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803321 (https://phabricator.wikimedia.org/T309720) [16:43:30] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) Testing out the "Move devices attributes" script before using it on the new PDUs move all configuration from ps1-a2-codfw to ps1-a2-codfw-new give the output below ` [success] [dst] Sett... [16:45:39] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [16:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:44] volans: puppet is disabled on netbox1002 [16:46:54] that's why, I'll follow up on that [16:47:06] XioNoX: ahhh thx [16:48:21] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [16:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:06] 10SRE, 10SRE-Access-Requests: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10nettrom_WMF) [16:53:00] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10nettrom_WMF) I filed this task and am notifying @KStoller-WMF about it so she can fill out the necessary information. I don't think SSH access is needed at th... [16:59:29] (03CR) 10Thcipriani: [C: 03+2] tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802869 (owner: 10Brennen Bearnes) [17:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T1700). [17:01:16] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [17:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with... [17:18:36] (03PS1) 10Jbond: C:puppetmaster: Add requestctl validate to the private repo pre-commit [puppet] - 10https://gerrit.wikimedia.org/r/803324 [17:20:30] (03PS2) 10Jbond: C:puppetmaster: Add requestctl validate to the private repo pre-commit [puppet] - 10https://gerrit.wikimedia.org/r/803324 [17:21:33] (03PS3) 10Jbond: C:puppetmaster: Add requestctl validate to the private repo pre-commit [puppet] - 10https://gerrit.wikimedia.org/r/803324 [17:21:54] 10SRE, 10Data-Engineering, 10Discovery, 10Event-Platform, 10Platform Team Workboards (Clinic Duty Team): Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10JArguello-WMF) [17:24:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 22): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35745/console" [puppet] - 10https://gerrit.wikimedia.org/r/803324 (owner: 10Jbond) [17:35:52] (03CR) 10Herron: [C: 03+1] "LGTM! 📈📉" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [17:36:51] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:37:04] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10KFrancis) @MoritzMuehlenhoff @CDanis The NDA has been completed. Please proceed with the access request. Thanks! [17:39:31] (03Abandoned) 10Reedy: Bump default cache epochs from 20130601 to 20160101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443866 (owner: 10Reedy) [17:39:52] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [17:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org w... [17:40:45] (03PS2) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) [17:41:15] (03CR) 10RLazarus: [C: 03+1] C:puppetmaster: Add requestctl validate to the private repo pre-commit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803324 (owner: 10Jbond) [17:44:03] (03CR) 10Brennen Bearnes: [V: 03+2] tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802869 (owner: 10Brennen Bearnes) [17:45:58] (03PS1) 10Herron: logstash-slo: update plugin id label from elasticsearch to logstash [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/803328 [17:46:29] (03PS2) 10Herron: logstash-slo: update plugin id label from elasticsearch to opensearch [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/803328 [17:49:06] (03CR) 10Herron: [V: 03+2 C: 03+2] logstash-slo: update plugin id label from elasticsearch to opensearch [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/803328 (owner: 10Herron) [17:50:43] 10SRE, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10RLazarus) p:05Triage→03Medium [17:57:09] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [17:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:20] (03PS3) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) [17:58:47] (03CR) 10Volans: "addressed comment" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [18:00:18] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [18:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [18:10:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [18:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298560)', diff saved to https://phabricator.wikimedia.org/P29442 and previous config saved to /var/cache/conftool/dbconfig/20220606-181103-ladsgroup.json [18:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:07] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [18:12:31] (03CR) 10RLazarus: slo: Correct queries for error budget remaining (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [18:14:30] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [18:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with... [18:14:55] (03PS5) 10RLazarus: slo: Correct queries for error budget remaining [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) [18:15:50] (03CR) 10RLazarus: [V: 03+2 C: 03+2] slo: Correct queries for error budget remaining [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [18:17:30] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:19:23] (03PS2) 10Volans: ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 [18:19:31] (03CR) 10Volans: "addressed comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [18:19:51] (03CR) 10Volans: "addressed comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [18:21:01] (03CR) 10Volans: "Sorry, I messed up with the rebase, I squashed the 2 CR into this latest PS, I'll fix it later re-splitting the two." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [18:27:33] (03CR) 10RLazarus: "Just for clarity: would you like me to go ahead and deploy this?" [puppet] - 10https://gerrit.wikimedia.org/r/801776 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [18:28:00] (03PS1) 10Andrew Bogott: magnum: update policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/803332 [18:30:11] (03CR) 10Andrew Bogott: [C: 03+2] magnum: update policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/803332 (owner: 10Andrew Bogott) [18:36:00] (03PS1) 10Andrew Bogott: magnum policy.yaml: close a string [puppet] - 10https://gerrit.wikimedia.org/r/803333 [18:37:13] (03CR) 10Andrew Bogott: [C: 03+2] magnum policy.yaml: close a string [puppet] - 10https://gerrit.wikimedia.org/r/803333 (owner: 10Andrew Bogott) [18:37:52] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:39:34] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1009 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:39:55] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10CDanis) 05Open→03Resolved a:05KFrancis→03CDanis Completed! [18:56:33] (03PS1) 10Andrew Bogott: Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m" [puppet] - 10https://gerrit.wikimedia.org/r/802956 [18:58:36] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) @Cmjohnson Alright, gotcha! Thanks for the updates and Dell request. [18:58:47] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) 05Open→03In progress [19:00:15] (03CR) 10Andrew Bogott: [C: 03+2] Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m" [puppet] - 10https://gerrit.wikimedia.org/r/802956 (owner: 10Andrew Bogott) [19:02:02] (03CR) 10Catrope: doc.wikimedia.org CSP: Also allow form submissions to enwiki/wikidata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801776 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [19:13:34] jouncebot: nowandnext [19:13:34] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [19:13:34] In 0 hour(s) and 46 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T2000) [19:13:53] !log disabling puppet on appservers to deploy https://gerrit.wikimedia.org/r/801776 [19:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:46] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:15:49] (03CR) 10RLazarus: [C: 03+2] doc.wikimedia.org CSP: Also allow form submissions to enwiki/wikidata [puppet] - 10https://gerrit.wikimedia.org/r/801776 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [19:19:05] !log enabled puppet on appservers [19:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:43] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10Dzahn) This is a better link since it's directly upstream and latest docs from 2022: https://www.haproxy.org/download/2.7/doc/configuration.txt ^ it's th... [19:24:17] 10SRE, 10SRE-OnFire, 10Observability-Logging, 10Sustainability (Incident Followup), 10Wikimedia-Incident: create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10Krinkle) We have something like this for POST requests to `api.php` on appservers, which we log (unsampled) to `api.lo... [19:25:58] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query service: port cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [19:29:51] (03PS1) 10Ryan Kemper: query service: clean up absented resources [puppet] - 10https://gerrit.wikimedia.org/r/803336 (https://phabricator.wikimedia.org/T273673) [19:34:56] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:37:05] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10Dzahn) 05In progress→03Open [19:38:04] (03PS2) 10Ryan Kemper: query service: clean up absented resources [puppet] - 10https://gerrit.wikimedia.org/r/803336 (https://phabricator.wikimedia.org/T273673) [19:38:17] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/803336 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [19:46:29] (03PS1) 10AntiCompositeNumber: SpecialDeletedContributions: Hide date headers [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802957 [19:54:59] (03PS1) 10Ryan Kemper: query_service: load categories daily, not weekly [puppet] - 10https://gerrit.wikimedia.org/r/803339 (https://phabricator.wikimedia.org/T273673) [19:55:15] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/803339 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [19:58:46] (03PS2) 10Ryan Kemper: query_service: load categories daily, not weekly [puppet] - 10https://gerrit.wikimedia.org/r/803339 (https://phabricator.wikimedia.org/T273673) [20:00:05] RoanKattouw, Urbanecm, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T2000). [20:00:05] AntiComposite: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:22] o/ [20:00:23] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/803339 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [20:00:24] hello AntiComposite! [20:00:33] i can deploy today [20:01:43] (03PS1) 10Nskaggs: Add tenacity lib and retry logic [puppet] - 10https://gerrit.wikimedia.org/r/803340 [20:02:15] AntiComposite: just to double check, we're removing the dates from https://test.wikipedia.org/wiki/Special:DeletedContributions/Martin_Urbanec, right? [20:02:19] (the headlines i mean) [20:02:30] yes, that's correct [20:02:34] ok, thanks [20:02:36] (03CR) 10Urbanecm: [C: 03+2] SpecialDeletedContributions: Hide date headers [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802957 (owner: 10AntiCompositeNumber) [20:02:39] same as Special:Contribs [20:03:05] okay [20:03:19] I'll let you know once this is ready for testing -- will take a while to merge [20:03:49] alright [20:05:45] (03PS3) 10Ryan Kemper: query_service: load categories daily, not weekly [puppet] - 10https://gerrit.wikimedia.org/r/803339 (https://phabricator.wikimedia.org/T273673) [20:05:55] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/803339 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [20:08:05] (03PS1) 10Jbond: utils: Add small script to set up bundler [puppet] - 10https://gerrit.wikimedia.org/r/803341 [20:08:59] (03CR) 10CI reject: [V: 04-1] utils: Add small script to set up bundler [puppet] - 10https://gerrit.wikimedia.org/r/803341 (owner: 10Jbond) [20:09:09] (03PS4) 10Ryan Kemper: query_service: load categories daily, not weekly [puppet] - 10https://gerrit.wikimedia.org/r/803339 (https://phabricator.wikimedia.org/T273673) [20:10:44] (03CR) 10Ryan Kemper: [C: 03+2] query_service: load categories daily, not weekly [puppet] - 10https://gerrit.wikimedia.org/r/803339 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [20:13:09] (03CR) 10Dzahn: utils: Add small script to set up bundler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803341 (owner: 10Jbond) [20:13:51] (03PS3) 10Ryan Kemper: query service: clean up absented resources [puppet] - 10https://gerrit.wikimedia.org/r/803336 (https://phabricator.wikimedia.org/T273673) [20:17:41] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/803336 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [20:18:17] (03CR) 10Eevans: WIP: Configure AQS Cassandra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [20:21:43] (03Merged) 10jenkins-bot: SpecialDeletedContributions: Hide date headers [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802957 (owner: 10AntiCompositeNumber) [20:21:55] (03PS4) 10Ryan Kemper: query service: clean up absented resources [puppet] - 10https://gerrit.wikimedia.org/r/803336 (https://phabricator.wikimedia.org/T273673) [20:22:09] here we go :) [20:22:14] (03CR) 10Ryan Kemper: [C: 03+2] query service: clean up absented resources [puppet] - 10https://gerrit.wikimedia.org/r/803336 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [20:22:16] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] query service: clean up absented resources [puppet] - 10https://gerrit.wikimedia.org/r/803336 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [20:23:48] AntiComposite: should be ready at mwdebug1001. can you check please? [20:24:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:25] urbanecm, looks good to me, thanks [20:24:40] thanks, syncing [20:25:42] 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10wiki_willy) Initial quote received back is $26,642.00 for the equipment, minus $3,325.25 for the drive shredding and $3,716.76 for freight charges. I'm seeing if they can lower the freight costs, before... [20:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:27:06] (03CR) 10Eevans: [C: 03+1] Dummy keys and certificates for cassandra (aqs) (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/802631 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [20:27:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:34] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.14/includes/specials/SpecialDeletedContributions.php: a15c11e72d766fa45aee690d3dffb17b186a35e0: SpecialDeletedContributions: Hide date headers (duration: 03m 09s) [20:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:39] AntiComposite: and live [20:28:41] anything else? [20:28:56] nope, looks good to me, thanks for your help! [20:29:39] no problem, thanks for the patch! [20:29:47] !log UTC late B&C window deploy [20:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:51] !log UTC late B&C window deploy completed [20:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:10] (03CR) 10Cwhite: [C: 03+2] opensearch: add support for managing opensearch 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/802862 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [20:31:11] (03PS1) 10Ryan Kemper: query_service: we don't use cron here anymore [puppet] - 10https://gerrit.wikimedia.org/r/803344 (https://phabricator.wikimedia.org/T273673) [20:33:20] (03CR) 10Cwhite: [C: 03+2] add new index pattern format [software/ecs] - 10https://gerrit.wikimedia.org/r/802873 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [20:33:59] (03Merged) 10jenkins-bot: add new index pattern format [software/ecs] - 10https://gerrit.wikimedia.org/r/802873 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [20:39:07] (03CR) 10Andrew Bogott: "This might help with openstack instability. Is it possible to add a sleep between attempts? Or is that the default?" [puppet] - 10https://gerrit.wikimedia.org/r/803340 (owner: 10Nskaggs) [20:39:49] (03CR) 10Andrew Bogott: Add tenacity lib and retry logic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803340 (owner: 10Nskaggs) [20:43:12] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:44:21] (03PS1) 10Cwhite: templates: replace all version instances [software/ecs] - 10https://gerrit.wikimedia.org/r/803345 (https://phabricator.wikimedia.org/T305175) [20:47:20] (03CR) 10Ryan Kemper: [C: 03+2] query_service: we don't use cron here anymore [puppet] - 10https://gerrit.wikimedia.org/r/803344 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [20:48:16] (03CR) 10Cwhite: [C: 03+2] templates: replace all version instances [software/ecs] - 10https://gerrit.wikimedia.org/r/803345 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [20:48:49] (03Merged) 10jenkins-bot: templates: replace all version instances [software/ecs] - 10https://gerrit.wikimedia.org/r/803345 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [20:57:56] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:00:05] Reedy, sbassett, Maryum, and manfredi: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220606T2100). Please do the needful. [21:03:45] (03PS1) 10Cwhite: add new index pattern to ecs templates [puppet] - 10https://gerrit.wikimedia.org/r/803350 (https://phabricator.wikimedia.org/T305175) [21:04:15] (03PS2) 10Cwhite: logstash: add new index pattern to ecs templates [puppet] - 10https://gerrit.wikimedia.org/r/803350 (https://phabricator.wikimedia.org/T305175) [21:07:42] (03CR) 10Dduvall: [C: 03+1] Turn mw_releases into a list [puppet] - 10https://gerrit.wikimedia.org/r/800758 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [21:09:34] (03PS1) 10Nskaggs: Fix spelling [puppet] - 10https://gerrit.wikimedia.org/r/803353 [21:10:45] (03Abandoned) 10Nskaggs: Fix spelling [puppet] - 10https://gerrit.wikimedia.org/r/803353 (owner: 10Nskaggs) [21:17:56] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:24:04] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10wiki_willy) a:03Cmjohnson [21:24:54] (03PS3) 10Volans: ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 [21:24:56] (03PS4) 10Volans: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) [21:26:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10wiki_willy) [21:26:44] (03CR) 10Volans: "un-squashed commits" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [21:27:08] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudnet1004 - https://phabricator.wikimedia.org/T309576 (10wiki_willy) 05Open→03Resolved a:03wiki_willy [21:29:39] (03CR) 10Volans: "FYI, if it might be useful we have also @retry in the wmflib package. See the related documentation in:" [puppet] - 10https://gerrit.wikimedia.org/r/803340 (owner: 10Nskaggs) [21:29:43] (03PS2) 10Nskaggs: Add tenacity lib and retry logic [puppet] - 10https://gerrit.wikimedia.org/r/803340 [21:31:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35746/" [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [21:33:16] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on otrs1001 confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [21:33:49] (03PS2) 10Dzahn: vrts: rename daemon resource and template from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802849 (https://phabricator.wikimedia.org/T293942) [21:34:34] * Krinkle testing on mwdebug1002 [21:35:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35747/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/802849 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [21:41:19] !log otrs1001 - stopped otrs-daemon, started vrts-daemon - after renaming it gerrit:802849 (T293942) [21:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:25] T293942: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 [21:48:24] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "manually stopped otrs-daemon, started vrts-daemon" [puppet] - 10https://gerrit.wikimedia.org/r/802849 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [22:01:23] (03CR) 10Cwhite: [C: 03+2] logstash: add new index pattern to ecs templates [puppet] - 10https://gerrit.wikimedia.org/r/803350 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [22:03:00] * Krinkle done testing [22:09:33] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10KStoller-WMF) [22:12:00] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:12:32] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1009 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:16:32] (03CR) 10Nskaggs: Add tenacity lib and retry logic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803340 (owner: 10Nskaggs) [22:18:15] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10KStoller-WMF) I've added my info to the task and signed the "Acknowledgement of Wikimedia Server Access Responsibilities". I think I now need @MMiller_WMF 's... [22:19:06] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:19:28] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:21:54] !log upgrade prometheus-es-exporter on logstash2026 T304440 [22:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:58] T304440: Test and upgrade OpenSearch to 2.0.0 - https://phabricator.wikimedia.org/T304440 [22:24:06] 10SRE, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) a:03Dzahn ok, thank you IF team! assigning back to me for the moment to follow-up. Yes, there was a specific person. I will readd this with a speci... [22:33:07] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, 10SecTeam-Processed: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Catrope) 05Open→03Resolved [22:39:36] !log upgrade prometheus-es-exporter on logstash1026 T304440 [22:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:42] T304440: Test and upgrade OpenSearch to 2.0.0 - https://phabricator.wikimedia.org/T304440 [22:45:34] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:10:01] (03PS1) 10Papaul: Testing partman recipe for couddumps nodes [puppet] - 10https://gerrit.wikimedia.org/r/803366 (https://phabricator.wikimedia.org/T302981) [23:11:58] (03CR) 10Papaul: [C: 03+2] Testing partman recipe for couddumps nodes [puppet] - 10https://gerrit.wikimedia.org/r/803366 (https://phabricator.wikimedia.org/T302981) (owner: 10Papaul) [23:14:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Papaul) @Andrew it looks like the way partman is seeing disks in a raid configuration and disk in a no raid configuration is dif... [23:14:15] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [23:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org w... [23:17:27] !log removing one file for legal compliance [23:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:42] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:21:12] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:27:52] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:39:13] (03PS1) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) [23:44:38] (03PS2) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) [23:49:19] (03CR) 10BCornwall: "Not sure if "warning" is the appropriate severity for this; I suspect it may require a more urgent severity." [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [23:54:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) That's similar to what I was seeing -- I don't understand why partman can tell the difference unless it's just the diffe...