[00:03:14] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 17.36 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:03:22] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:03:43] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [00:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:51] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye [00:04:54] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 33.65 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:07:46] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 87.74 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:08:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:08:54] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:11] (03CR) 10Cwhite: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [00:10:45] (03CR) 10Jdlrobson: [C: 03+1] "To be backported tomorrow and run." [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [00:12:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6011.drmrs.wmnet with OS buster [00:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:15] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6011.drmrs.wmnet with OS buster [00:19:20] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:20:00] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:21:06] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:22:58] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Papaul) I was able to pxe boot with 1024 but got ` Failed to load ldlinux.c32 ` [00:33:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage [00:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage [00:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:18] (CertAlmostExpired) firing: (2) Certificate for inference:30443 is about to expire - https://alerts.wikimedia.org [00:44:55] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:46:53] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:58:41] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https [00:58:41] ech.wikimedia.org/wiki/Services/Monitoring/restbase [01:01:21] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:28:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6011.drmrs.wmnet with OS buster [01:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:34] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6011.drmrs.wmnet with OS buster completed: - cp6011 (**WARN**) -... [01:29:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be [01:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe [01:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:44] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls [01:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:08] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [01:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:16] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed... [01:37:51] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bullseye [01:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:59] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye [01:43:20] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1026.eqiad.wmnet with OS bullseye [01:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:28] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye executed... [01:44:20] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [01:54:50] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [02:00:50] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [02:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22637 and previous config saved to /var/cache/conftool/dbconfig/20220316-020831-marostegui.json [02:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:36] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [02:23:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22638 and previous config saved to /var/cache/conftool/dbconfig/20220316-022336-marostegui.json [02:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:02] PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100% [02:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22639 and previous config saved to /var/cache/conftool/dbconfig/20220316-023842-marostegui.json [02:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22640 and previous config saved to /var/cache/conftool/dbconfig/20220316-025347-marostegui.json [02:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:52] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [03:02:24] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:58:22] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:41:33] (CertAlmostExpired) firing: (2) Certificate for inference:30443 is about to expire - https://alerts.wikimedia.org [05:01:43] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.106`. Pre-deploy tests passing on canary `wdqs1003` [05:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:41] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@38de611]: 0.3.106 [05:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:13] !log [WDQS Deploy] Tests passing following deploy of `0.3.106` on canary `wdqs1003`; proceeding to rest of fleet [05:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:17] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@38de611]: 0.3.106 (duration: 06m 36s) [05:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:10] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [05:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:13] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [05:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:27] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [05:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:45] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@38de611] (wcqs): Deploy 0.3.106 to WCQS [05:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:06] !log [WCQS Deploy] Tests look good following deploy of `0.3.106` to canary `wcqs1002.eqiad.wmnet`, proceeding to rest of fleet [05:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:38] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@38de611] (wcqs): Deploy 0.3.106 to WCQS (duration: 01m 53s) [05:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:53] !log [WCQS Deploy] Test query passed on commons-query.wikimedia.org ; WCQS deploy complete [05:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:58] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1068.eqiad.wmnet with OS stretch [05:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1068.eqiad.wmnet with OS stretch executed with errors:... [05:36:35] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [05:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:20] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [05:57:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [05:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298557)', diff saved to https://phabricator.wikimedia.org/P22641 and previous config saved to /var/cache/conftool/dbconfig/20220316-055805-marostegui.json [05:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:09] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [05:58:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [05:58:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [05:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22642 and previous config saved to /var/cache/conftool/dbconfig/20220316-055903-marostegui.json [05:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:07] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [05:59:30] (03PS1) 10Muehlenhoff: Fix typo in role name [puppet] - 10https://gerrit.wikimedia.org/r/771243 [06:00:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [06:00:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [06:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22643 and previous config saved to /var/cache/conftool/dbconfig/20220316-060008-marostegui.json [06:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:12] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [06:03:45] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in role name [puppet] - 10https://gerrit.wikimedia.org/r/771243 (owner: 10Muehlenhoff) [06:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [06:08:00] (03CR) 10Marostegui: auto_schema: Add abaility to skip replicas (031 comment) [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup) [06:08:22] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:43] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) Thanks for working on this @Ladsgro... [06:33:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298557)', diff saved to https://phabricator.wikimedia.org/P22644 and previous config saved to /var/cache/conftool/dbconfig/20220316-063344-marostegui.json [06:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:49] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:44:23] qchris: o/ thanks for the istio repo! [06:48:24] 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10RhinosF1) [06:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22646 and previous config saved to /var/cache/conftool/dbconfig/20220316-064849-marostegui.json [06:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:24] PROBLEM - Host ms-be1070 is DOWN: PING CRITICAL - Packet loss = 100% [06:49:28] PROBLEM - Host ms-be1069 is DOWN: PING CRITICAL - Packet loss = 100% [06:50:44] PROBLEM - Host ms-be1071 is DOWN: PING CRITICAL - Packet loss = 100% [06:52:06] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:55:35] (03PS1) 10Marostegui: switchover-tmpl.sh: Add orchestrator tag notes [software] - 10https://gerrit.wikimedia.org/r/771257 (https://phabricator.wikimedia.org/T266869) [06:59:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22647 and previous config saved to /var/cache/conftool/dbconfig/20220316-065918-marostegui.json [06:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:23] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [07:00:05] Amir1, awight, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] 'morning [07:00:33] i can deploy kart_ (unless you want to self-service?) [07:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3312', diff saved to https://phabricator.wikimedia.org/P22648 and previous config saved to /var/cache/conftool/dbconfig/20220316-070033-marostegui.json [07:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:51] urbanecm: Thanks. Please go ahead :) [07:01:25] urbanecm: specially, new table creation on testwiki. I don't recall I've done it earlier or maybe it was too long back :) [07:01:41] kart_: i do recall doing it for you in the past :D [07:01:48] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:16] the tables should be only on testwiki now? [07:02:31] (03PS2) 10Urbanecm: Disable ContentTranslation for non-extended confirmed users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770882 (https://phabricator.wikimedia.org/T299636) (owner: 10KartikMistry) [07:02:35] (03CR) 10Urbanecm: [C: 03+2] Disable ContentTranslation for non-extended confirmed users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770882 (https://phabricator.wikimedia.org/T299636) (owner: 10KartikMistry) [07:02:41] urbanecm: yes. as wmf.26 yet to deploy on Group1 and 2. [07:02:55] kart_: well we can create the table everywhere now if that's the goal [07:03:17] (03Merged) 10jenkins-bot: Disable ContentTranslation for non-extended confirmed users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770882 (https://phabricator.wikimedia.org/T299636) (owner: 10KartikMistry) [07:03:17] urbanecm: let's wait. We need to do some testing on testwiki too. [07:03:21] in my understanding, it's usually better to do that (even if it stays empty on many wikis), as it's easier to keep track of which table exist where that way [07:03:50] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:03:50] urbanecm: oh, if that's possible - it requires to be create on x1 cluster. [07:03:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22649 and previous config saved to /var/cache/conftool/dbconfig/20220316-070354-marostegui.json [07:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22650 and previous config saved to /var/cache/conftool/dbconfig/20220316-070452-marostegui.json [07:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:56] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [07:05:01] kart_: in the meanwhile, pulled the config patch to mwdebug1001. please test. [07:05:40] sure. Testing. [07:06:52] kart_: so, you want me to create the tables where exactly? in wikishared on x1? in the per-wiki DB for testwiki on x1? in testwiki's main database? a combination of those [07:06:53] ACKNOWLEDGEMENT - MegaRAID on db1158 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T303910 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:06:58] 10SRE, 10ops-eqiad: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10ops-monitoring-bot) [07:07:26] urbanecm: OK. Works. Shows expected msg. [07:07:30] great, syncing [07:07:41] https://phabricator.wikimedia.org/T302371#7756524 looks to say testwiki's main database and wikishared, but I'd like to confirm that before i do it, as table creation is hard to undo [07:07:47] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10elukey) @cmooney hi! I tried on ms-be1068 and the arp cache looks broken, lldpi shows me that lsw1-e1-eqiad is the top of rack, maybe the same that happened... [07:08:00] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Marostegui) p:05Triage→03Medium The RAID is indeed degraded: ` Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID... [07:08:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 455895168ab266813ae499e8fc353c66e6d5b450: Disable ContentTranslation for non-extended confirmed users on viwiki (T299636) (duration: 00m 51s) [07:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:00] T299636: Disable ContentTranslation for non-extended confirmed users on viwiki - https://phabricator.wikimedia.org/T299636 [07:09:04] PROBLEM - SSH on kubernetes2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:09:11] kart_: config patch live. waiting for your answer re table creation before i proceed with that :) [07:10:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:10:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:20] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Marostegui) a:03Cmjohnson Disk #2 is gone: ` root@db1158:~# megacli -PDList -aALL | grep Slot Slot Number: 0 Slot Number: 1 Slot Number: 3 Slot Number: 4 Slot Number: 5 Slot Number: 6 Slot Number: 7 Slot Numbe... [07:10:57] urbanecm: Let's do that only for testwiki and with wmf.26 all Wikis, I'll schedule it on Monday. testwiki and other Wikipedias for CX uses different DBs (s3 v/s x1). [07:10:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:31] urbanecm: Please log command also, so I'll remember that :) [07:11:35] kart_: sounds good to me. last confirmation: I create the tables, using the SQL files specified in T302371's description, in testwiki's s3 DB only [07:11:36] T302371: Create new tables: cx_significant_edits and cx_section_translation - https://phabricator.wikimedia.org/T302371 [07:11:53] urbanecm: Yes. Confirmed. [07:11:56] doing [07:15:22] !log Create `testwiki.cx_significant_edits` and `testwiki.cx_section_translation` at s3 (T302371; `mwscript sql.php --wiki=testwiki /srv/mediawiki-staging/php-1.38.0-wmf.26/extensions/ContentTranslation/sql/{section-translations,significant-edits}.sql)`) [07:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:48] kart_: should be done now, see https://www.irccloud.com/pastebin/yUotxgfS/ [07:16:10] urbanecm: looks good! [07:16:22] kart_: I'm not sure how much the command is useful though. for x1, it'll look differently [07:17:00] urbanecm: No problem. Let's do that on Monday :) [07:17:21] sounds good :) [07:17:25] anything else i can do for you today? [07:17:31] urbanecm: Thanks a lot :) [07:17:37] urbanecm: Done for now :) [07:17:44] okay! see you later then [07:18:04] !log UTC morning B&C window done [07:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298557)', diff saved to https://phabricator.wikimedia.org/P22651 and previous config saved to /var/cache/conftool/dbconfig/20220316-071859-marostegui.json [07:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:04] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:19:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [07:19:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [07:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [07:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [07:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22652 and previous config saved to /var/cache/conftool/dbconfig/20220316-071957-marostegui.json [07:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:16] what happened to metawiki at UTC midnight today? [07:27:52] Nikerabbit: can you be a bit more specific? [07:28:27] urbanecm: check Language-Team dashboard in Logstash for past 12 hours [07:28:35] looking [07:29:15] (03CR) 10Elukey: [C: 03+2] Set simpler partman recipe for kubernetes200[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/770912 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:34:52] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] admin: add releng to docker group on deployment [puppet] - 10https://gerrit.wikimedia.org/r/770976 (https://phabricator.wikimedia.org/T303450) (owner: 10Giuseppe Lavagetto) [07:35:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22653 and previous config saved to /var/cache/conftool/dbconfig/20220316-073502-marostegui.json [07:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:22] (03CR) 10Ladsgroup: "Ping" [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup) [07:45:15] (03CR) 10Ladsgroup: "Ping. We have had another case of this last week. It was auto_schema otherwise I would have killed the dump." [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup) [07:49:11] !log dbmaint on master of s4@eqiad (T298743) [07:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:15] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [07:50:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22654 and previous config saved to /var/cache/conftool/dbconfig/20220316-075007-marostegui.json [07:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:12] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [07:51:26] (KubernetesCalicoDown) firing: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [07:51:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:51:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:56] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:52:16] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:52:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:52:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298294)', diff saved to https://phabricator.wikimedia.org/P22655 and previous config saved to /var/cache/conftool/dbconfig/20220316-075248-marostegui.json [07:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:52] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [07:54:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298294)', diff saved to https://phabricator.wikimedia.org/P22656 and previous config saved to /var/cache/conftool/dbconfig/20220316-075448-marostegui.json [07:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:54:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22657 and previous config saved to /var/cache/conftool/dbconfig/20220316-075502-marostegui.json [07:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:06] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [07:56:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22658 and previous config saved to /var/cache/conftool/dbconfig/20220316-075612-marostegui.json [07:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:25] (03PS2) 10Ladsgroup: auto_schema: Add ability to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) [08:00:09] (03CR) 10Ladsgroup: [C: 03+1] switchover-tmpl.sh: Add orchestrator tag notes [software] - 10https://gerrit.wikimedia.org/r/771257 (https://phabricator.wikimedia.org/T266869) (owner: 10Marostegui) [08:00:33] jouncebot: nowandnext [08:00:33] No deployments scheduled for the next 4 hour(s) and 59 minute(s) [08:00:34] In 4 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1300) [08:00:40] noice [08:00:57] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Joe) 05Open→03Resolved [08:02:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [08:02:52] (03CR) 10Marostegui: [C: 03+1] auto_schema: Add ability to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup) [08:03:06] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add ability to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup) [08:03:35] (03Merged) 10jenkins-bot: auto_schema: Add ability to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup) [08:04:29] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Add orchestrator tag notes [software] - 10https://gerrit.wikimedia.org/r/771257 (https://phabricator.wikimedia.org/T266869) (owner: 10Marostegui) [08:07:54] 10SRE-OnFire, 10DBA, 10Platform Engineering, 10Performance-Team (Radar), and 2 others: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Marostegui) [08:08:07] 10SRE-OnFire, 10Data-Persistence (Consultation), 10Platform Engineering, 10Performance-Team (Radar), and 2 others: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10M... [08:08:47] (03PS2) 10Ladsgroup: Change A/V player to videojs in the first batch of production wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770130 (https://phabricator.wikimedia.org/T248418) [08:09:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22659 and previous config saved to /var/cache/conftool/dbconfig/20220316-080953-marostegui.json [08:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:06] (03CR) 10Ladsgroup: [C: 03+2] Change A/V player to videojs in the first batch of production wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770130 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup) [08:10:47] (03Merged) 10jenkins-bot: Change A/V player to videojs in the first batch of production wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770130 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup) [08:10:50] RECOVERY - SSH on kubernetes2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:11:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P22660 and previous config saved to /var/cache/conftool/dbconfig/20220316-081117-marostegui.json [08:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:57] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:770130|Change A/V player to videojs in the first batch of production wiki (T248418)]] (duration: 00m 49s) [08:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:00] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [08:12:24] marostegui: heads up, this change of a/v player in wikis will lead to ParserCache fragmentation, we tried to avoid it as much as possible but lmk if you see any issues [08:12:32] wilco [08:13:26] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [08:14:42] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4028 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:14:45] (03PS1) 10Elukey: install_server: improve the kubernetes-node-virtual-overlay recipe [puppet] - 10https://gerrit.wikimedia.org/r/771319 (https://phabricator.wikimedia.org/T300744) [08:16:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:48] PROBLEM - Confd vcl based reload on cp4028 is CRITICAL: reload-vcl failed to run since 0h, 7 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:17:30] (03PS1) 10Giuseppe Lavagetto: varnish: add ACLs even if empty [puppet] - 10https://gerrit.wikimedia.org/r/771320 [08:17:56] (03PS2) 10Elukey: install_server: improve the kubernetes-node-virtual-overlay recipe [puppet] - 10https://gerrit.wikimedia.org/r/771319 (https://phabricator.wikimedia.org/T300744) [08:18:02] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] varnish: add ACLs even if empty [puppet] - 10https://gerrit.wikimedia.org/r/771320 (owner: 10Giuseppe Lavagetto) [08:20:08] (03CR) 10Elukey: [C: 03+2] install_server: improve the kubernetes-node-virtual-overlay recipe [puppet] - 10https://gerrit.wikimedia.org/r/771319 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:21:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:21:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:24:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22661 and previous config saved to /var/cache/conftool/dbconfig/20220316-082458-marostegui.json [08:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P22662 and previous config saved to /var/cache/conftool/dbconfig/20220316-082622-marostegui.json [08:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:21] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4021 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:22] PROBLEM - Confd vcl based reload on cp4021 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:30:26] _joe_: ^ looks like your patch [08:30:52] <_joe_> RhinosF1: it's a temporary problem with icinga yes [08:31:01] <_joe_> I think [08:31:12] <_joe_> let me try to run puppet on the alert host [08:31:54] <_joe_> basically I removed that file and it's ok, it was not used directly [08:32:08] <_joe_> but it should also not be checked anymore [08:32:18] Makes sense [08:32:40] <_joe_> uhhh no I think I know what the problem is [08:32:56] <_joe_> some resources are not properly absented via confd::file I guess [08:33:53] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp6002 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:33:53] PROBLEM - Confd vcl based reload on cp6003 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:33:54] <_joe_> so yes, it needs icinga to run puppet [08:34:01] <_joe_> so it will happen on more servers :/ [08:35:27] !log Restarting CI Jenkins [08:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:33] PROBLEM - Confd vcl based reload on cp6006 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:35:35] <_joe_> but these are not actual issues [08:35:46] <_joe_> uh wait [08:36:25] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp2039 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:36:39] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp5012 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:36:39] PROBLEM - Confd vcl based reload on cp5006 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:36:40] PROBLEM - Confd vcl based reload on cp5015 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:37:08] <_joe_> ok not sure about these reload fails, stopping puppet on all cp servers [08:37:29] PROBLEM - Confd vcl based reload on cp3058 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:37:33] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:37:33] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp6009 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:38:11] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp1077 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:38:17] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp2037 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:38:23] PROBLEM - Confd vcl based reload on cp5003 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:38:42] <_joe_> I can only run puppet on the alert server to make these errors go away [08:38:44] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Volans) >>! In T303776#7780384, @Papaul wrote: > ` > Failed to load ldlinux.c32 > ` At first sight this might be an occurrence of this issue: htt... [08:38:51] RECOVERY - Confd vcl based reload on cp6006 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [08:39:23] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp3060 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:39:41] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp6004 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:39:55] PROBLEM - Confd vcl based reload on cp6004 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:40:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298294)', diff saved to https://phabricator.wikimedia.org/P22663 and previous config saved to /var/cache/conftool/dbconfig/20220316-084003-marostegui.json [08:40:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:40:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:08] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22664 and previous config saved to /var/cache/conftool/dbconfig/20220316-084011-marostegui.json [08:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22665 and previous config saved to /var/cache/conftool/dbconfig/20220316-084127-marostegui.json [08:41:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:41:31] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [08:41:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:33] (CertAlmostExpired) firing: (2) Certificate for inference:30443 is about to expire - https://alerts.wikimedia.org [08:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:39] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [08:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T297189)', diff saved to https://phabricator.wikimedia.org/P22666 and previous config saved to /var/cache/conftool/dbconfig/20220316-084140-marostegui.json [08:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22667 and previous config saved to /var/cache/conftool/dbconfig/20220316-084219-marostegui.json [08:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:26] PROBLEM - Confd vcl based reload on cp2037 is CRITICAL: reload-vcl failed to run since 0h, 9 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:44:50] PROBLEM - Confd vcl based reload on cp6002 is CRITICAL: reload-vcl failed to run since 0h, 14 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:44:50] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 9 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:47:00] (03PS1) 10Elukey: install_server: try a simpler version of kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771321 [08:47:26] PROBLEM - Confd vcl based reload on cp2034 is CRITICAL: reload-vcl failed to run since 0h, 16 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:47:35] <_joe_> please ignore those vcl based reload alerts, I'm not evne sure why they're happening [08:47:43] <_joe_> I'm going to clean them up soon [08:47:50] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 10 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:48:46] 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10Peachey88) Did you keep a full copy of one of the tracerts that you could provide to the SRE team via [[ https://phabricator.wikimedia.org/paste/edit/form/36/ | private paste ]]? For more information... [08:50:00] (03PS2) 10Elukey: install_server: try a simpler version of kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771321 [08:50:36] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [08:51:28] (03CR) 10Elukey: [C: 03+2] install_server: try a simpler version of kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771321 (owner: 10Elukey) [08:52:22] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1024.eqiad.wmnet with OS bullseye [08:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:11] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [08:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:15] RECOVERY - Confd vcl based reload on cp5003 is OK: reload-vcl successfully ran 0h, 3 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [08:55:15] RECOVERY - Confd vcl based reload on cp5006 is OK: reload-vcl successfully ran 0h, 3 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [08:55:51] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4034 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:56:19] RECOVERY - Confd vcl based reload on cp6003 is OK: reload-vcl successfully ran 0h, 4 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [08:56:19] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 4 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [08:56:33] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp3056 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:56:41] PROBLEM - Confd vcl based reload on cp3063 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:56:49] <_joe_> again sorry for the noise, please disregard [08:57:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22668 and previous config saved to /var/cache/conftool/dbconfig/20220316-085724-marostegui.json [08:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:11] RECOVERY - Confd vcl based reload on cp4028 is OK: reload-vcl successfully ran 0h, 6 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [08:58:19] PROBLEM - Confd vcl based reload on cp4022 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [08:58:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:58:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:25] RECOVERY - Confd vcl based reload on cp2034 is OK: reload-vcl successfully ran 0h, 7 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:00:20] (03PS1) 10Elukey: install_server: add more options to kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771323 [09:00:27] PROBLEM - Confd vcl based reload on cp5013 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:01:19] (03CR) 10Elukey: [C: 03+2] install_server: add more options to kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771323 (owner: 10Elukey) [09:02:59] RECOVERY - Confd vcl based reload on cp3058 is OK: reload-vcl successfully ran 0h, 11 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:04:11] PROBLEM - Confd vcl based reload on cp3050 is CRITICAL: reload-vcl failed to run since 0h, 10 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:04:13] PROBLEM - Confd vcl based reload on cp6010 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:04:21] RECOVERY - Confd vcl based reload on cp4021 is OK: reload-vcl successfully ran 0h, 12 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:05:17] RECOVERY - Confd vcl based reload on cp5015 is OK: reload-vcl successfully ran 0h, 13 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:09:17] PROBLEM - Confd vcl based reload on cp2040 is CRITICAL: reload-vcl failed to run since 0h, 17 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:09:20] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate if stopping mysql with buffer_pool dump between 10.4 versions is safe - https://phabricator.wikimedia.org/T303498 (10Marostegui) I have done quite a bunch of testing and so far I have not been able to reproduce the crashes when doing 10.4... [09:09:28] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1024.eqiad.wmnet with OS bullseye [09:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:53] PROBLEM - Confd vcl based reload on cp3056 is CRITICAL: reload-vcl failed to run since 0h, 15 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:10:15] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp2028 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:11:30] PROBLEM - Confd vcl based reload on cp2028 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:12:14] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22669 and previous config saved to /var/cache/conftool/dbconfig/20220316-091229-marostegui.json [09:12:32] PROBLEM - Confd vcl based reload on cp4034 is CRITICAL: reload-vcl failed to run since 0h, 19 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:12:32] PROBLEM - Confd vcl based reload on cp4036 is CRITICAL: reload-vcl failed to run since 0h, 19 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:34] PROBLEM - Confd vcl based reload on cp5014 is CRITICAL: reload-vcl failed to run since 0h, 13 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:12:52] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp3065 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:13:46] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4026 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:13:48] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp3054 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:15:23] PROBLEM - Confd vcl based reload on cp3055 is CRITICAL: reload-vcl failed to run since 0h, 16 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:15:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 T303498', diff saved to https://phabricator.wikimedia.org/P22670 and previous config saved to /var/cache/conftool/dbconfig/20220316-091533-marostegui.json [09:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:38] T303498: Investigate if stopping mysql with buffer_pool dump between 10.4 versions is safe - https://phabricator.wikimedia.org/T303498 [09:15:57] RECOVERY - Confd vcl based reload on cp6004 is OK: reload-vcl successfully ran 0h, 24 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:16:15] !log revert mx1001/mx2001 to the Bullseye version of Exim T303738 [09:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:29] (03PS1) 10Marostegui: db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/771324 (https://phabricator.wikimedia.org/T303498) [09:17:01] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp2033 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:17:09] PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4033 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:17:09] PROBLEM - Confd vcl based reload on cp3065 is CRITICAL: reload-vcl failed to run since 0h, 7 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:17:46] (03CR) 10Marostegui: [C: 03+2] db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/771324 (https://phabricator.wikimedia.org/T303498) (owner: 10Marostegui) [09:18:21] PROBLEM - Confd vcl based reload on cp3059 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:18:21] PROBLEM - Confd vcl based reload on cp1080 is CRITICAL: reload-vcl failed to run since 0h, 22 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:18:39] PROBLEM - Confd vcl based reload on cp4035 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:18:41] PROBLEM - Confd vcl based reload on cp4033 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:19:05] PROBLEM - Confd vcl based reload on cp5008 is CRITICAL: reload-vcl failed to run since 0h, 21 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:19:05] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 27 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:19:29] PROBLEM - Confd vcl based reload on cp4025 is CRITICAL: reload-vcl failed to run since 0h, 21 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:19:33] PROBLEM - Confd vcl based reload on cp1086 is CRITICAL: reload-vcl failed to run since 0h, 26 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:20:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T297189)', diff saved to https://phabricator.wikimedia.org/P22671 and previous config saved to /var/cache/conftool/dbconfig/20220316-092004-marostegui.json [09:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:10] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [09:20:35] PROBLEM - Confd vcl based reload on cp1081 is CRITICAL: reload-vcl failed to run since 0h, 8 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:20:59] PROBLEM - Confd vcl based reload on cp3051 is CRITICAL: reload-vcl failed to run since 0h, 25 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:21:01] RECOVERY - Confd vcl based reload on cp6002 is OK: reload-vcl successfully ran 0h, 29 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:21:23] PROBLEM - Confd vcl based reload on cp2038 is CRITICAL: reload-vcl failed to run since 0h, 25 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:21:23] PROBLEM - Confd vcl based reload on cp3052 is CRITICAL: reload-vcl failed to run since 0h, 25 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:21:49] PROBLEM - Confd vcl based reload on cp4026 is CRITICAL: reload-vcl failed to run since 0h, 11 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:21:55] PROBLEM - Confd vcl based reload on cp5007 is CRITICAL: reload-vcl failed to run since 0h, 9 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:23:23] PROBLEM - Confd vcl based reload on cp3061 is CRITICAL: reload-vcl failed to run since 0h, 30 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:24:11] RECOVERY - Confd vcl based reload on cp2037 is OK: reload-vcl successfully ran 0h, 32 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:25:13] RECOVERY - Confd vcl based reload on cp4035 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:25:13] RECOVERY - Confd vcl based reload on cp3065 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:25:15] RECOVERY - Confd vcl based reload on cp3051 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:25:17] RECOVERY - Confd vcl based reload on cp4033 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:25:38] (03PS3) 10DCausse: Replace Swift native API with S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski) [09:25:41] RECOVERY - Confd vcl based reload on cp5013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:25:49] RECOVERY - Confd vcl based reload on cp2038 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:25:51] RECOVERY - Confd vcl based reload on cp3052 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:25:53] RECOVERY - Confd vcl based reload on cp5008 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:26:19] RECOVERY - Confd vcl based reload on cp1080 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:26:21] RECOVERY - Confd vcl based reload on cp3059 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:26:21] RECOVERY - Confd vcl based reload on cp1081 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:26:23] RECOVERY - Confd vcl based reload on cp4025 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:26:31] RECOVERY - Confd vcl based reload on cp4026 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:26:33] RECOVERY - Confd vcl based reload on cp1086 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:26:43] RECOVERY - Confd vcl based reload on cp5007 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:26:43] RECOVERY - Confd vcl based reload on cp3061 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:27:11] RECOVERY - Confd vcl based reload on cp3055 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:27:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22672 and previous config saved to /var/cache/conftool/dbconfig/20220316-092735-marostegui.json [09:27:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:27:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:39] (03CR) 10DCausse: [C: 03+1] "Must be deployed with care and I think a safe approach is to simply completely delete the deployment and the corresponding data on swift (" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski) [09:27:40] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22673 and previous config saved to /var/cache/conftool/dbconfig/20220316-092742-marostegui.json [09:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:59] RECOVERY - Confd vcl based reload on cp3050 is OK: reload-vcl successfully ran 0h, 2 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:29:25] PROBLEM - TFTP service on install1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [09:29:29] RECOVERY - Confd vcl based reload on cp3063 is OK: reload-vcl successfully ran 0h, 3 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:29:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22674 and previous config saved to /var/cache/conftool/dbconfig/20220316-092947-marostegui.json [09:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:03] PROBLEM - Check systemd state on cp5003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_varnish-frontend-hospital.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:13] (03PS1) 10Elukey: install_server: add missing 'echo' for kubernetes vms in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/771325 [09:33:19] RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [09:34:06] (03CR) 10Ladsgroup: "don't merge it, I need to review it 😄" [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [09:34:08] (03PS1) 10Vgutierrez: aptrepo:update-keys: Refresh gitlab key [puppet] - 10https://gerrit.wikimedia.org/r/771326 [09:35:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P22675 and previous config saved to /var/cache/conftool/dbconfig/20220316-093509-marostegui.json [09:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:29] (03CR) 10Elukey: [C: 03+2] install_server: add missing 'echo' for kubernetes vms in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/771325 (owner: 10Elukey) [09:36:05] !log T293862: manually restarted blazegraph on wdqs1010 with "-agentpath:/usr/lib/libjvmquake.so=1000,1,0,warn=30,touch=/tmp/jvmquake" [09:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:09] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862 [09:36:13] RECOVERY - Confd vcl based reload on cp2028 is OK: reload-vcl successfully ran 0h, 10 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:36:36] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) > @jcrespo Did you test the POC I ment... [09:38:37] RECOVERY - Confd vcl based reload on cp3056 is OK: reload-vcl successfully ran 0h, 12 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:39:35] RECOVERY - Confd vcl based reload on cp6010 is OK: reload-vcl successfully ran 0h, 13 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:39:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/771326 (owner: 10Vgutierrez) [09:40:53] RECOVERY - Confd vcl based reload on cp4034 is OK: reload-vcl successfully ran 0h, 15 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:41:39] RECOVERY - Confd vcl based reload on cp4022 is OK: reload-vcl successfully ran 0h, 15 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:42:18] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/771326 (owner: 10Vgutierrez) [09:42:23] RECOVERY - Confd vcl based reload on cp2040 is OK: reload-vcl successfully ran 0h, 16 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:42:59] RECOVERY - Confd vcl based reload on cp4036 is OK: reload-vcl successfully ran 0h, 17 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:43:01] RECOVERY - Confd vcl based reload on cp5014 is OK: reload-vcl successfully ran 0h, 17 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:44:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22676 and previous config saved to /var/cache/conftool/dbconfig/20220316-094452-marostegui.json [09:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:10] (03CR) 10Vgutierrez: [C: 03+2] aptrepo:update-keys: Refresh gitlab key [puppet] - 10https://gerrit.wikimedia.org/r/771326 (owner: 10Vgutierrez) [09:46:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [09:46:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [09:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:29] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) >>! In T281249#7780918, @jcrespo wr... [09:50:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P22677 and previous config saved to /var/cache/conftool/dbconfig/20220316-095014-marostegui.json [09:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:31] (03PS1) 10Marostegui: db1149: Remove candidate master [puppet] - 10https://gerrit.wikimedia.org/r/771329 (https://phabricator.wikimedia.org/T266869) [09:55:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1071.eqiad.wmnet with OS buster [09:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS buster executed with errors: -... [09:55:53] (03CR) 10Marostegui: [C: 03+2] db1149: Remove candidate master [puppet] - 10https://gerrit.wikimedia.org/r/771329 (https://phabricator.wikimedia.org/T266869) (owner: 10Marostegui) [09:55:53] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1070.eqiad.wmnet with OS stretch [09:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:58] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS stretch executed with errors:... [09:56:28] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:56:29] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1069.eqiad.wmnet with OS stretch [09:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:33] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch executed with errors:... [09:59:30] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) > When we migrated to dbctl, we lost t... [09:59:54] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) Fixed dbctl notes for s4. Checked a... [09:59:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22678 and previous config saved to /var/cache/conftool/dbconfig/20220316-095957-marostegui.json [10:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:55] !log installing openssl security updates [10:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:25] !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia [10:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T297189)', diff saved to https://phabricator.wikimedia.org/P22679 and previous config saved to /var/cache/conftool/dbconfig/20220316-100519-marostegui.json [10:05:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:05:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:23] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [10:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22680 and previous config saved to /var/cache/conftool/dbconfig/20220316-100527-marostegui.json [10:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:37] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) @elukey thanks for the heads up. Yes this is very worrying, we have the same thing on, for instance ms-be1069, which is connected to lsw1-e2-eqiad.... [10:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [10:06:46] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:13:14] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:15:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22681 and previous config saved to /var/cache/conftool/dbconfig/20220316-101502-marostegui.json [10:15:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:15:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:08] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [10:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:17] !log rolling restart of ats-tls and ats-backend to catch up on OpenSSL updates [10:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [10:16:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [10:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 10 hosts with reason: Maintenance [10:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 10 hosts with reason: Maintenance [10:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:17:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298294)', diff saved to https://phabricator.wikimedia.org/P22682 and previous config saved to /var/cache/conftool/dbconfig/20220316-101729-marostegui.json [10:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298294)', diff saved to https://phabricator.wikimedia.org/P22683 and previous config saved to /var/cache/conftool/dbconfig/20220316-101848-marostegui.json [10:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:58] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (owner: 10Awight) [10:28:14] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [10:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:22] RECOVERY - traffic_server backend process restarted on cp3051 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3051&var-layer=backend [10:29:59] (03PS1) 10Muehlenhoff: Remove access for ppechelko [puppet] - 10https://gerrit.wikimedia.org/r/771332 [10:30:20] (03PS8) 10Jbond: C:java: Refactor java code to work with cloud [puppet] - 10https://gerrit.wikimedia.org/r/770930 [10:31:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34340/console" [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond) [10:33:42] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ppechelko [puppet] - 10https://gerrit.wikimedia.org/r/771332 (owner: 10Muehlenhoff) [10:33:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22684 and previous config saved to /var/cache/conftool/dbconfig/20220316-103353-marostegui.json [10:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [10:40:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [10:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:26] (03CR) 10WMDE-Fisch: [C: 03+1] Deploy template features to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (owner: 10Awight) [10:42:59] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [10:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:34] (03PS1) 10Muehlenhoff: Remove access for accraze [puppet] - 10https://gerrit.wikimedia.org/r/771333 [10:44:09] (03PS1) 10Marostegui: switchover-tmpl.sh: Remove communication related steps [software] - 10https://gerrit.wikimedia.org/r/771334 (https://phabricator.wikimedia.org/T303605) [10:46:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for accraze [puppet] - 10https://gerrit.wikimedia.org/r/771333 (owner: 10Muehlenhoff) [10:46:34] (03CR) 10Filippo Giunchedi: [C: 04-1] "Alert itself LGTM, though the alerting file will need to be deployed as a global rule (i.e. thanos)" [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [10:46:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22685 and previous config saved to /var/cache/conftool/dbconfig/20220316-104637-marostegui.json [10:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:42] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [10:48:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22686 and previous config saved to /var/cache/conftool/dbconfig/20220316-104858-marostegui.json [10:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [10:50:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/769671 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [10:51:09] (03CR) 10Krinkle: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy) [10:51:53] (03CR) 10Filippo Giunchedi: grafana ldap users sync: enable retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite) [10:52:11] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ayounsi) `name=initial PXE boot sequence CLIENT MAC ADDR: B0 26 28 29 5D F0 GUID: 4C4C4544-005A-5910-805A-C4C04F515032 CLIENT IP: 10.64.20.43 MA... [10:55:14] !log rolling upgrade to HAProxy 2.4.15 on cache nodes [10:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:45] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ayounsi) Is it possible to upgrade PXE? The current version seems quite old: 20150819 [10:58:14] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:58:22] (03PS1) 10Muehlenhoff: Remove twentyafterfour from various access groups [puppet] - 10https://gerrit.wikimedia.org/r/771337 [10:59:44] (03CR) 10Muehlenhoff: [C: 03+2] Remove twentyafterfour from various access groups [puppet] - 10https://gerrit.wikimedia.org/r/771337 (owner: 10Muehlenhoff) [11:01:21] (03CR) 10Jbond: [C: 03+1] Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [11:01:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P22687 and previous config saved to /var/cache/conftool/dbconfig/20220316-110142-marostegui.json [11:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:06] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] clinic-duty: add coverage for work.gcalendarLink() [software] - 10https://gerrit.wikimedia.org/r/768142 (owner: 10Krinkle) [11:03:13] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] clinic-duty: Use Date.parse() and assert.propContains() [software] - 10https://gerrit.wikimedia.org/r/768141 (owner: 10Krinkle) [11:04:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298294)', diff saved to https://phabricator.wikimedia.org/P22688 and previous config saved to /var/cache/conftool/dbconfig/20220316-110403-marostegui.json [11:04:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:04:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:08] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [11:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298294)', diff saved to https://phabricator.wikimedia.org/P22689 and previous config saved to /var/cache/conftool/dbconfig/20220316-110411-marostegui.json [11:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:07] (03CR) 10Jcrespo: [C: 03+1] switchover-tmpl.sh: Remove communication related steps [software] - 10https://gerrit.wikimedia.org/r/771334 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui) [11:08:18] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:35] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Remove communication related steps [software] - 10https://gerrit.wikimedia.org/r/771334 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui) [11:09:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6012.drmrs.wmnet with OS buster [11:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:19] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6012.drmrs.wmnet with OS buster [11:09:52] (03Merged) 10jenkins-bot: switchover-tmpl.sh: Remove communication related steps [software] - 10https://gerrit.wikimedia.org/r/771334 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui) [11:09:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34342/console" [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond) [11:13:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:java: Refactor java code to work with cloud [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond) [11:15:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298294)', diff saved to https://phabricator.wikimedia.org/P22690 and previous config saved to /var/cache/conftool/dbconfig/20220316-111532-marostegui.json [11:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:37] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [11:16:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P22691 and previous config saved to /var/cache/conftool/dbconfig/20220316-111647-marostegui.json [11:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [11:24:54] (03PS20) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [11:25:14] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [11:26:56] (03CR) 10Jbond: [C: 03+2] reposync: dont catch RepoSyncNoChangeError (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/770003 (owner: 10Jbond) [11:27:39] (03PS1) 10Ayounsi: DNS: add drmrs dcmap ressources [dns] - 10https://gerrit.wikimedia.org/r/771342 [11:28:16] RECOVERY - Check systemd state on cp5003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:56] PROBLEM - Host kubernetes2005 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:17] (03PS2) 10Awight: Deploy template features to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (https://phabricator.wikimedia.org/T302857) [11:29:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage [11:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:04] RECOVERY - Host kubernetes2005 is UP: PING OK - Packet loss = 0%, RTA = 32.63 ms [11:30:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22692 and previous config saved to /var/cache/conftool/dbconfig/20220316-113037-marostegui.json [11:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [11:30:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [11:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22693 and previous config saved to /var/cache/conftool/dbconfig/20220316-113057-marostegui.json [11:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:06] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:31:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22694 and previous config saved to /var/cache/conftool/dbconfig/20220316-113152-marostegui.json [11:31:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [11:31:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [11:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:57] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [11:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T297189)', diff saved to https://phabricator.wikimedia.org/P22695 and previous config saved to /var/cache/conftool/dbconfig/20220316-113200-marostegui.json [11:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage [11:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:58] (03Merged) 10jenkins-bot: reposync: dont catch RepoSyncNoChangeError [software/spicerack] - 10https://gerrit.wikimedia.org/r/770003 (owner: 10Jbond) [11:34:56] (03CR) 10Emil Chetty: [C: 03+1] "Im Happy 😊" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [11:40:40] (03PS8) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [11:42:53] (03PS9) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [11:43:00] (03CR) 10Jbond: "update thanks" [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [11:43:33] (03CR) 10MVernon: [V: 03+1 C: 03+2] codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/769671 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [11:43:37] (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/769671 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [11:45:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22697 and previous config saved to /var/cache/conftool/dbconfig/20220316-114542-marostegui.json [11:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:53] (03CR) 10Awight: Deploy template features to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (https://phabricator.wikimedia.org/T302857) (owner: 10Awight) [11:51:41] (KubernetesCalicoDown) firing: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [12:00:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298294)', diff saved to https://phabricator.wikimedia.org/P22698 and previous config saved to /var/cache/conftool/dbconfig/20220316-120047-marostegui.json [12:00:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:00:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:00:52] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298294)', diff saved to https://phabricator.wikimedia.org/P22699 and previous config saved to /var/cache/conftool/dbconfig/20220316-120100-marostegui.json [12:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:44] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:52] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:02:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298294)', diff saved to https://phabricator.wikimedia.org/P22700 and previous config saved to /var/cache/conftool/dbconfig/20220316-120219-marostegui.json [12:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:26] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:12:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T297189)', diff saved to https://phabricator.wikimedia.org/P22701 and previous config saved to /var/cache/conftool/dbconfig/20220316-121240-marostegui.json [12:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:44] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [12:14:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6012.drmrs.wmnet with OS buster [12:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:14] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6012.drmrs.wmnet with OS buster completed: - cp6012 (**WARN**) -... [12:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22702 and previous config saved to /var/cache/conftool/dbconfig/20220316-121724-marostegui.json [12:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:00] (03CR) 10TsepoThoabala: [C: 03+1] Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [12:22:14] (03PS25) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [12:22:16] (03PS14) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [12:22:34] (03PS1) 10MVernon: codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771348 (https://phabricator.wikimedia.org/T303507) [12:23:40] (03CR) 10MVernon: "Hi," [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771348 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [12:25:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be [12:25:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe [12:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:52] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls [12:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6013.drmrs.wmnet with OS buster [12:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P22703 and previous config saved to /var/cache/conftool/dbconfig/20220316-122745-marostegui.json [12:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:54] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6013.drmrs.wmnet with OS buster [12:29:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22704 and previous config saved to /var/cache/conftool/dbconfig/20220316-122906-marostegui.json [12:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:11] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:32:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22705 and previous config saved to /var/cache/conftool/dbconfig/20220316-123229-marostegui.json [12:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:33] (CertAlmostExpired) firing: (2) Certificate for inference:30443 is about to expire - https://alerts.wikimedia.org [12:42:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P22707 and previous config saved to /var/cache/conftool/dbconfig/20220316-124250-marostegui.json [12:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P22708 and previous config saved to /var/cache/conftool/dbconfig/20220316-124411-marostegui.json [12:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298294)', diff saved to https://phabricator.wikimedia.org/P22709 and previous config saved to /var/cache/conftool/dbconfig/20220316-124734-marostegui.json [12:47:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:47:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:39] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22710 and previous config saved to /var/cache/conftool/dbconfig/20220316-124742-marostegui.json [12:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage [12:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22711 and previous config saved to /var/cache/conftool/dbconfig/20220316-124943-marostegui.json [12:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:15] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage [12:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:46] (03CR) 10Clare Ming: [C: 03+1] Add script to update vector skin preferences (031 comment) [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [12:57:26] (03PS5) 10Krinkle: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) [12:57:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T297189)', diff saved to https://phabricator.wikimedia.org/P22712 and previous config saved to /var/cache/conftool/dbconfig/20220316-125755-marostegui.json [12:57:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:57:58] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) Just a note that the task for cloudvirt1024 is T303773, this task is for 1025/1026. They are failing for different reasons AFAICT. [12:57:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance [12:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:00] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [12:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T297189)', diff saved to https://phabricator.wikimedia.org/P22713 and previous config saved to /var/cache/conftool/dbconfig/20220316-125803-marostegui.json [12:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P22714 and previous config saved to /var/cache/conftool/dbconfig/20220316-125916-marostegui.json [12:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1300). [13:00:05] awight: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:19] I can deploy. [13:01:15] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (https://phabricator.wikimedia.org/T302857) (owner: 10Awight) [13:01:26] (03CR) 10Krinkle: [C: 03+1] Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:01:47] (03PS2) 10Jbond: puppet: add vendored module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008 [13:01:58] (03Merged) 10jenkins-bot: Deploy template features to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (https://phabricator.wikimedia.org/T302857) (owner: 10Awight) [13:02:22] awight: let me know when the backport(s) are done [13:02:51] Krinkle: ack [13:03:05] WMDE-Fisch: new config is on mwdebug1001 [13:03:17] I'll have a look [13:04:05] I see the new features on enwiki [13:04:15] (but not too many :-) [13:04:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22715 and previous config saved to /var/cache/conftool/dbconfig/20220316-130448-marostegui.json [13:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:31] (03CR) 10Clare Ming: [C: 03+1] "waiting for Amir's review -- hopefully this can still be deployed here soon" [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [13:06:12] awight: Looks good. I could check off all things we wanted live. [13:06:21] Also the thing we want not live ;-) [13:06:25] Seems to work! [13:06:26] Thanks, syncing. [13:06:38] (03CR) 10Andrew Bogott: [C: 03+2] Fix invalid ref to last_backup_with_snapshot.valid [puppet] - 10https://gerrit.wikimedia.org/r/770999 (https://phabricator.wikimedia.org/T303870) (owner: 10Andrew Bogott) [13:07:30] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771331|Deploy template features to enwiki (T302857)]] (duration: 00m 50s) [13:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:34] T302857: Deploy first template focus-area improvements to enwiki - https://phabricator.wikimedia.org/T302857 [13:08:38] Krinkle: I'm all done, good luck! [13:08:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:06] awight: thans [13:10:09] Thanks! :) [13:10:48] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:10:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:20] (03CR) 10Krinkle: [C: 03+2] static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [13:14:09] (03Merged) 10jenkins-bot: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [13:14:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22716 and previous config saved to /var/cache/conftool/dbconfig/20220316-131421-marostegui.json [13:14:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [13:14:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [13:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:26] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [13:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22717 and previous config saved to /var/cache/conftool/dbconfig/20220316-131429-marostegui.json [13:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22718 and previous config saved to /var/cache/conftool/dbconfig/20220316-131953-marostegui.json [13:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:22:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:59] (03PS4) 10Volans: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 [13:24:11] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS buster [13:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:22] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS buster [13:25:07] !log krinkle@deploy1002 Synchronized w/static.php: 159dfd21d (duration: 00m 50s) [13:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6013.drmrs.wmnet with OS buster [13:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:03] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6013.drmrs.wmnet with OS buster completed: - cp6013 (**WARN**) -... [13:26:10] (03CR) 10Volans: [C: 03+2] Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [13:27:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:28:37] (03Merged) 10jenkins-bot: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [13:31:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T297189)', diff saved to https://phabricator.wikimedia.org/P22720 and previous config saved to /var/cache/conftool/dbconfig/20220316-133153-marostegui.json [13:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:58] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [13:34:48] (03PS2) 10BBlack: DNS: add drmrs dcmap ressources [dns] - 10https://gerrit.wikimedia.org/r/771342 (owner: 10Ayounsi) [13:34:50] (03PS1) 10BBlack: geo-res: align whitespace (no-op) [dns] - 10https://gerrit.wikimedia.org/r/771353 [13:35:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22721 and previous config saved to /var/cache/conftool/dbconfig/20220316-133458-marostegui.json [13:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:05] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:36:17] (03CR) 10BBlack: [C: 03+2] geo-res: align whitespace (no-op) [dns] - 10https://gerrit.wikimedia.org/r/771353 (owner: 10BBlack) [13:42:55] (03PS3) 10Jbond: puppet: add vendor_module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008 [13:44:02] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be [13:44:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe [13:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:11] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls [13:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P22722 and previous config saved to /var/cache/conftool/dbconfig/20220316-134658-marostegui.json [13:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:45] (03CR) 10Volans: [C: 03+1] "LGTM (1 typo inline)" [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:53:46] (03CR) 10Jbond: "pcc[1] shows no op" [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [13:57:02] (03PS10) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [13:57:04] (03CR) 10Jbond: "done thanks" [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:57:22] (03CR) 10Filippo Giunchedi: [C: 03+1] codfw-prod: rebalance the rings (031 comment) [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771348 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [13:57:32] (03CR) 10Ladsgroup: "I couldn't check it in depth as I'm not 100% familiar with how user preferences work. That being said, here are the suggestions:" [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [13:57:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6014.drmrs.wmnet with OS buster [13:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:53] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6014.drmrs.wmnet with OS buster [13:58:57] (03CR) 10BBlack: [C: 03+2] DNS: add drmrs dcmap ressources [dns] - 10https://gerrit.wikimedia.org/r/771342 (owner: 10Ayounsi) [14:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P22723 and previous config saved to /var/cache/conftool/dbconfig/20220316-140203-marostegui.json [14:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:24] (03PS1) 10Ayounsi: GeoDNS Cyprus to drmrs [dns] - 10https://gerrit.wikimedia.org/r/771354 [14:02:29] (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771348 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [14:03:35] (03PS1) 10Elukey: install_server: add the flat-noswap.cfg recipe/override [puppet] - 10https://gerrit.wikimedia.org/r/771355 [14:03:37] (03PS1) 10Elukey: install_server: move kubernetes200[5,6] to the new flat-noswap recipe [puppet] - 10https://gerrit.wikimedia.org/r/771356 (https://phabricator.wikimedia.org/T300744) [14:04:40] jouncebot: nowandnext [14:04:40] No deployments scheduled for the next 3 hour(s) and 55 minute(s) [14:04:40] In 3 hour(s) and 55 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [14:04:40] In 3 hour(s) and 55 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [14:05:01] (03PS1) 10Krinkle: static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) [14:05:32] (03PS1) 10Majavah: Replace use of deprecated RecentChange::getEngine [extensions/CentralAuth] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770942 (https://phabricator.wikimedia.org/T303861) [14:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [14:06:02] (03CR) 10Majavah: [C: 03+2] Replace use of deprecated RecentChange::getEngine [extensions/CentralAuth] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770942 (https://phabricator.wikimedia.org/T303861) (owner: 10Majavah) [14:07:22] (03PS1) 10Jbond: nagios_common: change ssle warnings from 10 days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) [14:08:54] (03Merged) 10jenkins-bot: Replace use of deprecated RecentChange::getEngine [extensions/CentralAuth] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770942 (https://phabricator.wikimedia.org/T303861) (owner: 10Majavah) [14:09:02] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34347/console" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [14:09:22] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34346/console" [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond) [14:10:09] !log grafana1002:~# systemctl restart grafana-ldap-users-sync.service T303064 [14:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:14] T303064: grafana-ldap-users-sync fails to finish intermittently - https://phabricator.wikimedia.org/T303064 [14:12:36] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:12:51] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/CentralAuth/includes/User/CentralAuthUser.php: Backport: [[gerrit:770942|Replace use of deprecated RecentChange::getEngine (T303861)]] (duration: 00m 51s) [14:12:53] (03CR) 10Elukey: "Alex, I know that you had questions about the priority and max partition size, but for this code review I tried to change as few items as " [puppet] - 10https://gerrit.wikimedia.org/r/771355 (owner: 10Elukey) [14:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:55] T303861: PHP Deprecated: Use of RecentChange::getEngine was deprecated in MediaWiki 1.29. [Called from MediaWiki\Extension\CentralAuth\User\CentralAuthUser::attach] - https://phabricator.wikimedia.org/T303861 [14:13:08] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) I updated that script to completely... [14:13:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Agreed re: tweaking sizes in a followup change" [puppet] - 10https://gerrit.wikimedia.org/r/771355 (owner: 10Elukey) [14:15:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:15:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:02] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Volans) >>! In T281249#7781991, @Ladsgroup wrot... [14:16:06] (03Abandoned) 10Ssingh: Add Wikidough's /24 to bgp_out in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/757635 (owner: 10Ssingh) [14:16:38] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [14:17:02] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) >>! In T281249#7781991, @Ladsgroup... [14:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T297189)', diff saved to https://phabricator.wikimedia.org/P22724 and previous config saved to /var/cache/conftool/dbconfig/20220316-141708-marostegui.json [14:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:17:12] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [14:17:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance [14:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance [14:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage [14:17:38] (03PS1) 10Ssingh: Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359 [14:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22725 and previous config saved to /var/cache/conftool/dbconfig/20220316-141918-marostegui.json [14:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [14:20:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:42] (03CR) 10Elukey: [C: 03+2] install_server: add the flat-noswap.cfg recipe/override [puppet] - 10https://gerrit.wikimedia.org/r/771355 (owner: 10Elukey) [14:25:13] !log depooling ms-fe100[5-8] prior to decommissioning T303733 [14:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:17] T303733: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733 [14:26:38] (03CR) 10Ayounsi: "As data point I ran 2 RIPE measurements from Cyprus to esams and drmrs:" [dns] - 10https://gerrit.wikimedia.org/r/771354 (owner: 10Ayounsi) [14:30:16] (03PS1) 10MVernon: swift: remove ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) [14:30:51] (03PS2) 10Ssingh: Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359 [14:33:46] jouncebot: nowandnext [14:33:47] No deployments scheduled for the next 3 hour(s) and 26 minute(s) [14:33:47] In 3 hour(s) and 26 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [14:33:47] In 3 hour(s) and 26 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [14:34:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P22726 and previous config saved to /var/cache/conftool/dbconfig/20220316-143423-marostegui.json [14:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:53] Amir1: thanks [14:35:02] (03PS3) 10Ssingh: Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359 [14:35:12] cjming: there is no deployment happening it seems, the floor is yours [14:35:18] (03CR) 10MVernon: "I think I caught all the necessary changes in one CR this time :)" [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon) [14:35:52] !log add anycast6 peers in drmrs [14:35:54] Amir1: cool [14:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:52] fyi for all, I'm going to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/770937 [14:37:01] ^ in the next few [14:39:26] (03CR) 10Elukey: [C: 03+2] install_server: move kubernetes200[5,6] to the new flat-noswap recipe [puppet] - 10https://gerrit.wikimedia.org/r/771356 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:39:52] (03PS2) 10Herron: watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) [14:40:50] (03CR) 10Herron: watchrat: require 3+ sites to agree on error status before alerting (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [14:42:53] (03CR) 10Filippo Giunchedi: [C: 03+1] watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [14:43:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6014.drmrs.wmnet with OS buster [14:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:03] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6014.drmrs.wmnet with OS buster completed: - cp6014 (**WARN**) -... [14:44:26] (03CR) 10Clare Ming: [C: 03+2] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [14:45:07] (03CR) 10Ayounsi: [C: 03+1] "Anycast neighbors manually configured on the switches." [homer/public] - 10https://gerrit.wikimedia.org/r/771359 (owner: 10Ssingh) [14:45:38] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be [14:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:41] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe [14:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls [14:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:15] (03CR) 10CDanis: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/770944 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [14:46:32] (03Merged) 10jenkins-bot: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [14:46:38] (03CR) 10CDanis: [C: 03+2] Cross-ref Grafana dashboard in statograph hiera [puppet] - 10https://gerrit.wikimedia.org/r/770944 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [14:46:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6015.drmrs.wmnet with OS buster [14:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:04] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6015.drmrs.wmnet with OS buster [14:47:28] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [14:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:06] (03PS1) 10JMeybohm: Change certificate expiry thresholds to 9d warn and 7d critical [alerts] - 10https://gerrit.wikimedia.org/r/771361 (https://phabricator.wikimedia.org/T303932) [14:48:59] (03CR) 10JMeybohm: [C: 03+1] nagios_common: change ssle warnings from 10 days to 9 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond) [14:49:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P22727 and previous config saved to /var/cache/conftool/dbconfig/20220316-144928-marostegui.json [14:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:25] (03PS3) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [14:50:42] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:51:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:53] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:52:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:52:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [14:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:53:04] !log rolling restart of pdns-recursor.service and dnsdist.service on doh* hosts for OpenSSL updates [14:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:24] !log restarting nginx/dhcpd on install/apt servers [14:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:23] !log cjming@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/WikimediaMaintenance/T299104.php: Backport: [[gerrit:770937|Add script to update vector skin preferences (T299104)]] (duration: 00m 51s) [14:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:27] T299104: Prepare script to update invalid user preferences after skins have been separated - https://phabricator.wikimedia.org/T299104 [14:55:45] !log rolling restart of nginx.service on durum* hosts for OpenSSL updates [14:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10cmooney) FYI I don't believe there is any reason E/F would be ruled out for these, if space/power is tight in the existing rows. [14:56:02] (03CR) 10Jbond: [C: 03+2] puppet: add vendor_module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008 (owner: 10Jbond) [14:56:39] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:57:41] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:58:42] (03Merged) 10jenkins-bot: puppet: add vendor_module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008 (owner: 10Jbond) [14:59:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:59:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22728 and previous config saved to /var/cache/conftool/dbconfig/20220316-145946-marostegui.json [14:59:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:50] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [15:00:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) [15:02:28] (03PS5) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [15:03:48] (03CR) 10AOkoth: [C: 03+2] vrts: rename mail module class variables [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [15:04:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22729 and previous config saved to /var/cache/conftool/dbconfig/20220316-150433-marostegui.json [15:04:35] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:04:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:04:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:38] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:51] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:05:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:05:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:59] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10dancy) I verified that I can run docker commands now. Thanks @Joe! [15:07:02] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for ssh-gitlab [puppet] - 10https://gerrit.wikimedia.org/r/771362 (https://phabricator.wikimedia.org/T135991) [15:08:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage [15:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:59] jouncebot nowandnext [15:09:59] No deployments scheduled for the next 2 hour(s) and 50 minute(s) [15:09:59] In 2 hour(s) and 50 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [15:09:59] In 2 hour(s) and 50 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [15:11:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage [15:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:19] !log Testing mediawiki image build on deploy server again [15:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:29] (03PS1) 10Btullis: Add dummy deployment user/tokens for datahub [labs/private] - 10https://gerrit.wikimedia.org/r/771363 (https://phabricator.wikimedia.org/T303049) [15:11:52] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I have created deployment users and tokens in `profile::kubernetes::infrastructure_users:` key in the private repo, as well as corresponding dummy valu... [15:12:03] !log dancy@deploy1002 Started scap: (no justification provided) [15:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:58] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) >>! In T281249#7782008, @Volans wrot... [15:14:48] (03PS1) 10Jbond: puppet_compiler: bump software version [puppet] - 10https://gerrit.wikimedia.org/r/771366 [15:15:41] !log dancy@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/bin/make -C /srv/mwbuilder/release/make-container-image -f Makefile build-and-push-all-images http_proxy=http://webproxy.eqiad.wmnet:8080 https_proxy=http://webproxy.eqiad.wmnet:8080 GIT_BASE=https://gerrit.wikimedia.org/r/ BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restricted/mediaw [15:15:42] iki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver MV_BASE_PACKAGES= MV_EXTRA_CA_CERT=' returned non-zero exit status 2. (duration: 03m 38s) [15:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:33] !log dancy@deploy1002 Started scap: testing mediawiki image build [15:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:27] (03PS2) 10Jbond: nagios_common: change ssl warnings from 10 days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) [15:18:35] (03CR) 10Jbond: nagios_common: change ssl warnings from 10 days to 9 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond) [15:19:02] (03CR) 10Jbond: [C: 03+1] Change certificate expiry thresholds to 9d warn and 7d critical [alerts] - 10https://gerrit.wikimedia.org/r/771361 (https://phabricator.wikimedia.org/T303932) (owner: 10JMeybohm) [15:19:29] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump software version [puppet] - 10https://gerrit.wikimedia.org/r/771366 (owner: 10Jbond) [15:24:14] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:26:21] (03CR) 10Klausman: "directory structure fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/770973" [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [15:28:50] (03PS1) 10Urbanecm: cswiki: Add celebration logo for 500k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771369 [15:28:52] jouncebot: nowandnext [15:28:52] No deployments scheduled for the next 2 hour(s) and 31 minute(s) [15:28:52] In 2 hour(s) and 31 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [15:28:52] In 2 hour(s) and 31 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [15:28:58] let me push the above out ^^ [15:29:05] urbanecm: check with dancy [15:29:18] (03CR) 10Urbanecm: [C: 03+2] cswiki: Add celebration logo for 500k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771369 (owner: 10Urbanecm) [15:29:25] sorry [15:29:31] dancy: may i? :) [15:29:36] (cancelled the +2 for now) [15:29:59] urbanecm: If it's urgent, I can cancel my operation and restart it after. I suspect it'll take about 30 more minutes to complete. [15:30:06] i can wait 30m [15:30:24] ok. If it goes longer than that I'll cancel. [15:35:15] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [15:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:23] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [15:35:23] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye [15:35:32] PROBLEM - Check size of conntrack table on kubernetes2005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:35:32] PROBLEM - Check systemd state on kubernetes2005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:33] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed... [15:36:06] PROBLEM - puppet last run on kubernetes2005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:36:06] (03PS2) 10Ladsgroup: idp: Open up orchestrator to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [15:37:30] !log [WCQS] Restarted updater across fleet to get out jvm sec upgrades: `ryankemper@cumin1001:~$ sudo -E cumin 'wcqs*' 'systemctl restart wcqs-updater.service'` [15:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:01] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [15:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:08] RECOVERY - Check size of conntrack table on kubernetes2005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:38:09] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye [15:39:12] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:40:12] (03CR) 10Hashar: [C: 03+1] Enable profile::auto_restarts::service for apache/CI [puppet] - 10https://gerrit.wikimedia.org/r/770467 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:40:14] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:40:26] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [15:41:26] (KubernetesCalicoDown) resolved: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:42:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22730 and previous config saved to /var/cache/conftool/dbconfig/20220316-154206-marostegui.json [15:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:11] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [15:42:50] RECOVERY - puppet last run on kubernetes2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:43:17] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [15:43:29] !log dancy@deploy1002 scap failed: RuntimeError dictionary changed size during iteration (duration: 25m 55s) [15:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:48] urbanecm: I tested up to the point that I needed to. All yours now. [15:43:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:56] thanks! [15:44:03] (03CR) 10Urbanecm: [C: 03+2] cswiki: Add celebration logo for 500k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771369 (owner: 10Urbanecm) [15:44:58] (03Merged) 10jenkins-bot: cswiki: Add celebration logo for 500k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771369 (owner: 10Urbanecm) [15:45:20] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for apache/CI [puppet] - 10https://gerrit.wikimedia.org/r/770467 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:45:26] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [15:46:08] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: cswiki celebration logos (duration: 00m 50s) [15:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:41] (03PS3) 10Ladsgroup: idp: Open up orchestrator to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [15:47:53] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [15:49:02] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: cswiki celebration logo (duration: 00m 49s) [15:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:55] dancy: i'm done. if you have anything else to test, feel free to resume [15:51:02] great, I shall. [15:51:12] !log restarting exim/spamasassin on MXes to pick up new OpenSSL [15:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6015.drmrs.wmnet with OS buster [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:21] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6015.drmrs.wmnet with OS buster completed: - cp6015 (**WARN**) -... [15:52:43] (03CR) 10Ladsgroup: "PCC looks good to me: https://puppet-compiler.wmflabs.org/pcc-worker1002/1244/dborch1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [15:52:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [15:52:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [15:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:56] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be [15:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298557)', diff saved to https://phabricator.wikimedia.org/P22731 and previous config saved to /var/cache/conftool/dbconfig/20220316-155300-marostegui.json [15:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe [15:53:05] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:10] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls [15:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P22732 and previous config saved to /var/cache/conftool/dbconfig/20220316-155711-marostegui.json [15:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:11] !log dancy@deploy1002 Synchronized README: testing mediawiki image build (duration: 02m 11s) [15:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:42] !log analytics/refinery - scap deply "Migrate session_length/daily from Oozie to Airflow" [16:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:21] (03PS1) 10Clare Ming: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771387 (https://phabricator.wikimedia.org/T299104) [16:03:47] (03CR) 10Vgutierrez: [C: 03+1] mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup) [16:05:39] (03PS1) 10MVernon: codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771375 (https://phabricator.wikimedia.org/T303507) [16:07:03] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10SRE Observability (FY2021/2022-Q3): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10CDanis) As of yesterday, instructions have been shared with the SRE... [16:07:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS buster [16:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:23] (03CR) 10MVernon: [V: 03+2 C: 03+2] "Another routine operation, so self-reviewing." [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771375 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [16:07:28] RECOVERY - Check systemd state on kubernetes2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:30] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster [16:08:26] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:26] (03PS3) 10Ladsgroup: mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) [16:09:30] (03CR) 10Ladsgroup: [C: 03+2] mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup) [16:09:31] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) >>! In T202061#7767276, @CDanis wrote: > [ ... ] > I'll put the above in a... [16:10:06] (03CR) 10Filippo Giunchedi: [C: 03+1] nagios_common: change ssl warnings from 10 days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond) [16:10:11] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup) [16:10:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Change certificate expiry thresholds to 9d warn and 7d critical [alerts] - 10https://gerrit.wikimedia.org/r/771361 (https://phabricator.wikimedia.org/T303932) (owner: 10JMeybohm) [16:10:41] 10SRE, 10Traffic, 10User-Ladsgroup: Rework education.wikimedia.org redirects - https://phabricator.wikimedia.org/T303397 (10Ladsgroup) 05Open→03Resolved [16:12:05] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon) [16:12:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P22733 and previous config saved to /var/cache/conftool/dbconfig/20220316-161216-marostegui.json [16:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:54] (03CR) 10JHathaway: [C: 03+2] Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [16:18:47] (03PS2) 10MVernon: swift: remove ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) [16:19:00] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:30] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:59] (03CR) 10MVernon: swift: remove ms-fe100[5-8] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon) [16:21:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon) [16:22:28] !log aqu@deploy1002 Started deploy [analytics/refinery@d039471]: Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] [16:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:40] (03CR) 10MVernon: [C: 03+2] swift: remove ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon) [16:27:01] (03CR) 10Jbond: [C: 03+2] nagios_common: change ssl warnings from 10 days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond) [16:27:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22734 and previous config saved to /var/cache/conftool/dbconfig/20220316-162721-marostegui.json [16:27:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:27:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:26] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [16:27:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [16:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:45] Emperor: happy for me to merge yours [16:28:02] !log moving swiftrepl and stats reporter host from ms-fe1005 to ms-fe1009 T303733 [16:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:06] T303733: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733 [16:28:18] jbond: OK, thanks [16:29:38] (03PS6) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) [16:30:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [16:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:07] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:24] Gah! [16:31:29] (03PS7) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) [16:32:32] 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10Cmjohnson) @BTullis Can you plan to shut this down tomorrow 17 March at 10a EST 1400 UTC. [16:33:18] 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10BTullis) Yes, will do. Both nodes at the same time? [16:34:28] PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.timer,swift_dispersion_stats_lowlatency.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:59] !log rolling restart of ms-fe10[09-12] so they know about removal of older proxies T303733 [16:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:02] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:37:03] T303733: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733 [16:37:11] (03PS2) 10Cwhite: grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) [16:37:47] (03CR) 10jerkins-bot: [V: 04-1] grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite) [16:38:28] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:39:39] (03PS3) 10Cwhite: grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) [16:40:45] (03PS8) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) [16:44:01] (03CR) 10JHathaway: "John I believe this is ready for another review, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [16:45:18] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10SRE Observability (FY2021/2022-Q3): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) @CDanis I think this is probably good to close, we can always... [16:45:29] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [16:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:37] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed... [16:47:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) Shouldn't be an issue with installing these in E4 / F4. However the configuration of the switches there won't be compl... [16:48:17] !log aqu@deploy1002 Finished deploy [analytics/refinery@d039471]: Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] (duration: 25m 49s) [16:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:28] PROBLEM - Number of mw swift objects in codfw greater than eqiad on alert1001 is CRITICAL: execution: found duplicate series for the match group {account=mw-media, class=deleted} on the right hand-side of the operation: [{__name__=swift_container_stats_objects_total, account=mw-media, class=deleted, cluster=swift, instance=ms-fe1009:9112, job=statsd_exporter, site=eqiad}, {__name__=swift_container_stats_objects_total, account=mw-media, cl [16:48:28] ted, cluster=swift, instance=ms-fe1005:9112, job=statsd_exporter, site=eqiad}]:many-to-many matching not allowed: matching labels must be unique on one side https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=codfw [16:51:55] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10SRE Observability (FY2021/2022-Q3): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) 05Open→03Resolved [16:51:59] (03Abandoned) 10Clare Ming: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771387 (https://phabricator.wikimedia.org/T299104) (owner: 10Clare Ming) [16:52:01] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [16:53:01] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jhathaway) [16:53:45] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) 05Open→03Resolved Community modules have now been moved to vendor_modules, thanks everyone for the discussion & feedback. [16:56:32] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: execution: multiple matches for labels: many-to-one matching must be explicit (group_left/group_right) https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [17:00:02] Emperor: --^ (I guess it is part of maintenance but in case it is not I am pinging you :) [17:00:52] elukey: thanks, yes, this seems to happen when we move swift stats_reporter_host around [17:01:05] it should resolve in ~10m or so [17:03:08] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:03:44] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) I'm messing around with the perccli64 binary, but I admit its new to me and I'm not versed in it at all. Additionally, the dumpsdata1`007 host isn't setup ideally, as I couldn't get the installer... [17:04:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:04:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:10] !log aqu@deploy1002 Started deploy [analytics/refinery@d039471] (thin): Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] [17:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:17] !log aqu@deploy1002 Finished deploy [analytics/refinery@d039471] (thin): Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] (duration: 00m 07s) [17:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:35] !log aqu@deploy1002 Started deploy [analytics/refinery@d039471] (hadoop-test): Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] [17:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:36] (03PS1) 10Majavah: P:wmcs::prometheus: set team: wmcs on all alerts [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) [17:11:47] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34360/console" [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [17:11:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS buster [17:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:08] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster completed: - cp6016 (**WARN**) -... [17:12:13] (03PS2) 10Majavah: P:wmcs::prometheus: set team: wmcs on all alerts [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) [17:12:37] (03PS26) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:12:39] (03PS15) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:13:59] !log aqu@deploy1002 Finished deploy [analytics/refinery@d039471] (hadoop-test): Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] (duration: 07m 23s) [17:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:19] (03CR) 10jerkins-bot: [V: 04-1] varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [17:14:37] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34361/console" [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [17:17:19] godog: the Number of mw swift objects in codfw greater than eqiad alerts don't seem to be self-resolving this time; any ideas? AFAICT swift_dispersion_stats.service on ms-fe1009 is happy... [17:21:10] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be [17:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:20] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe [17:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls [17:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:53] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) Echo of my testing so far: setting the drive info via show and setting it to on or offline works, but not setting to missing or sending rebuild command ` root@dumpsdata1007:/usr/local/bin# percc... [17:22:11] (03PS2) 10Milimetric: Eventlogging: Remove unused RUM Speed Index. [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [17:22:50] (03PS27) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:22:52] (03PS16) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:22:57] (03CR) 10Volans: [C: 03+1] "LGTM for the use of wmflib, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite) [17:23:13] (03CR) 10Milimetric: [C: 03+1] "+1 for me to remove, but I can't merge in this repo. I echo @ottomata's comment on removing from wgEventStreams" [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [17:24:09] (03PS1) 10Majavah: dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246) [17:24:22] (03PS1) 10Btullis: Add a kubeconfig configuration for datahub [puppet] - 10https://gerrit.wikimedia.org/r/771407 (https://phabricator.wikimedia.org/T303049) [17:25:21] (03CR) 10jerkins-bot: [V: 04-1] dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [17:27:06] (03PS2) 10Majavah: dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246) [17:28:09] (03PS1) 10Btullis: Add a namespace for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/771409 (https://phabricator.wikimedia.org/T303049) [17:31:03] (03PS1) 10Jbond: systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 [17:31:05] (03PS1) 10Jbond: P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) [17:32:09] jouncebot nowandnext [17:32:10] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [17:32:10] In 0 hour(s) and 27 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [17:32:10] In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800) [17:32:29] (03CR) 10Ahmon Dancy: [C: 03+2] mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy) [17:32:53] (03CR) 10jerkins-bot: [V: 04-1] P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [17:33:17] (03PS2) 10Jbond: P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) [17:33:35] (03Merged) 10jenkins-bot: mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy) [17:34:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34363/console" [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [17:36:52] !log dancy@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:771001|mwscript: Support --force-version flag (T303878)]] (duration: 00m 57s) [17:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:57] T303878: multiversion/MWScript.php: Allow specifying a specific version of code to run - https://phabricator.wikimedia.org/T303878 [17:37:28] (03CR) 10Ottomata: [C: 03+1] P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [17:39:06] (03CR) 10Ottomata: [C: 03+1] "I think this is a good idea. I expect that some people's data code might break, esp if they are hitting the MW API from within analytics " [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [17:47:35] (03PS1) 10Jbond: P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415 [17:48:39] (03PS4) 10Cwhite: grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) [17:49:21] (03CR) 10Cwhite: grafana ldap users sync: enable retries (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite) [17:50:25] (03CR) 10Jbond: P:java: update profile::java to use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771415 (owner: 10Jbond) [17:50:35] (03PS2) 10Jbond: P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415 [17:51:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34365/console" [puppet] - 10https://gerrit.wikimedia.org/r/771415 (owner: 10Jbond) [17:52:24] (03PS1) 10Hnowlan: WIP: build docker images using blubber and pip dependencies [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/771416 (https://phabricator.wikimedia.org/T267327) [17:52:29] (03PS1) 10Milimetric: Revert "Temporarily disable traffic data purge" [puppet] - 10https://gerrit.wikimedia.org/r/771389 [17:52:59] RECOVERY - Number of mw swift objects in eqiad greater than codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [17:53:07] RECOVERY - Number of mw swift objects in codfw greater than eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=codfw [17:53:52] (03CR) 10jerkins-bot: [V: 04-1] Revert "Temporarily disable traffic data purge" [puppet] - 10https://gerrit.wikimedia.org/r/771389 (owner: 10Milimetric) [17:54:19] (03CR) 10Majavah: systemd: Add new define to manage user service environments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771410 (owner: 10Jbond) [17:55:05] (03CR) 10Krinkle: "Note that "current" is only used for /static/current in mw-k8s which effectively receives no traffic currently, so that's essentially a no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [17:57:45] (03PS2) 10Milimetric: Revert "Temporarily disable traffic data purge" [puppet] - 10https://gerrit.wikimedia.org/r/771389 [17:57:58] (03CR) 10Milimetric: [C: 04-1] "hang on for a minute while we check with Olja" [puppet] - 10https://gerrit.wikimedia.org/r/771389 (owner: 10Milimetric) [18:00:04] jeena and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800). [18:00:04] jeena and dancy: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800). [18:00:16] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on karapace1001.eqiad.wmnet with reason: Setting up karapace for the first time [18:00:18] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on karapace1001.eqiad.wmnet with reason: Setting up karapace for the first time [18:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:26] Train is blocked. Sending the email [18:01:25] I kicked the prometheus-statsd-exporter on the old frontend, that is at least coincidental with the alert clearing... [18:02:23] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@257960f]: Migrate session_length/daily from Oozie to Airflow [airflow-dags/analytics@257960f] [18:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:32] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@257960f]: Migrate session_length/daily from Oozie to Airflow [airflow-dags/analytics@257960f] (duration: 00m 08s) [18:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:41] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite) [18:05:23] (03PS1) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565) [18:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [18:05:56] (03PS2) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565) [18:06:03] (03CR) 10jerkins-bot: [V: 04-1] karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [18:08:14] (03PS3) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565) [18:09:18] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@257960f]: Migrate session_length/daily from Oozie to Airflow [airflow-dags/analytics_test@257960f] [18:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:27] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@257960f]: Migrate session_length/daily from Oozie to Airflow [airflow-dags/analytics_test@257960f] (duration: 00m 08s) [18:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:35] (03PS4) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565) [18:13:33] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34368/console" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [18:14:07] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2015 [puppet] - 10https://gerrit.wikimedia.org/r/771422 (https://phabricator.wikimedia.org/T300744) [18:14:09] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2016 [puppet] - 10https://gerrit.wikimedia.org/r/771423 (https://phabricator.wikimedia.org/T300744) [18:14:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:02] (03PS5) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) [18:18:37] (03CR) 10Ebernhardson: elasticsearch: remove custom restart handling (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [18:20:51] (03CR) 10Razzi: "Catalog diff: https://puppet-compiler.wmflabs.org/pcc-worker1002/34368/karapace1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [18:20:59] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) [18:22:43] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:27:38] (03CR) 10Ssingh: [C: 03+2] Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359 (owner: 10Ssingh) [18:30:14] (03Merged) 10jenkins-bot: Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359 (owner: 10Ssingh) [18:32:52] !log running: homer "cr*-drmrs*" commit "Gerrit 771359: Set up BGP peering in drmrs for Wikidough." [18:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:26] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) Just to follow up I've the TAC case open with Juniper since this morning but they have been slow to respond, and not grasping the exact issue in the... [18:44:38] (03PS13) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [18:47:41] (03PS14) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [18:48:59] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:50:20] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [18:54:02] (03PS1) 10Ottomata: Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) [18:54:34] (03CR) 10jerkins-bot: [V: 04-1] Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata) [18:55:18] (03PS2) 10Ottomata: Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) [18:56:42] (03CR) 10jerkins-bot: [V: 04-1] Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata) [18:57:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) @cmrooney thanks!! [18:58:36] (03PS3) 10Ottomata: Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) [19:00:41] (03PS4) 10Ottomata: Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) [19:01:58] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34372/console" [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata) [19:02:23] (03CR) 10Ottomata: [V: 03+1] "This should be a no-op. Next patch will roll this out in test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata) [19:06:37] (03PS15) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [19:08:53] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata) [19:14:22] !log otto@deploy1002 Started deploy [analytics/refinery@2d2056a] (hadoop-test): (no justification provided) [19:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:06] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:20:56] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:13] !log otto@deploy1002 Finished deploy [analytics/refinery@2d2056a] (hadoop-test): (no justification provided) (duration: 07m 50s) [19:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:54] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:41] (03PS1) 10Jbond: wmflib: add class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) [19:39:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34373/console" [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [19:40:12] (03PS1) 10Ssingh: definitions: add drmrs to wikimedia-private [homer/public] - 10https://gerrit.wikimedia.org/r/771438 [19:42:08] (03CR) 10Cwhite: [C: 03+2] grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite) [19:43:29] (03CR) 10Hoo man: [C: 03+1] Write "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768089 (owner: 10Lucas Werkmeister (WMDE)) [19:51:51] (03PS16) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [19:56:20] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:40] 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) >>! In T303903#7780748, @Peachey88 wrote: > Did you keep a full copy of one of the tracerts that you could provide to the SRE team via [[ https://phabricator.wikimedia.org/paste/edit/form/36... [20:00:05] RoanKattouw and Urbanecm: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T2000). [20:00:05] zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:15] oh no, unfortunate deployments [20:00:27] but i won't disagree with you jouncebot [20:00:30] I can deploy today :-) [20:00:37] hello zabe [20:00:44] hey [20:00:52] (03PS17) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [20:01:09] zabe: we write to the wmg version already, right? [20:01:49] yes, see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/766229 [20:02:01] (03PS2) 10Jbond: wmflib: add class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) [20:02:03] (03PS1) 10Jbond: P:scap::dsh: Add scpa targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) [20:02:34] (03PS18) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [20:02:42] in that case, it should be syncable easily (in more or less any order), right? [20:02:52] * urbanecm tries to make sure this patch is safely deployable [20:03:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34374/console" [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [20:03:20] yes [20:04:27] let's do it then [20:04:30] (03PS3) 10Urbanecm: Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:04:35] (03CR) 10Ahmon Dancy: P:scap::dsh: Add scpa targets as a dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [20:04:38] (03CR) 10Urbanecm: [C: 03+2] Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:04:52] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:05:00] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:29] (03Merged) 10jenkins-bot: Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:06:33] (03CR) 10Ahmon Dancy: wmflib: add class_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [20:06:43] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [20:07:05] zabe: pulled to mwdebug1001, please have a look [20:07:21] jbond: Sorry for typo nitpicks. I'm very happy to see T303559 moving along! [20:07:22] T303559: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 [20:07:43] (03PS1) 10RobH: dumpsdata1006 setup info [puppet] - 10https://gerrit.wikimedia.org/r/771442 (https://phabricator.wikimedia.org/T302937) [20:08:08] (03CR) 10RobH: [C: 03+2] dumpsdata1006 setup info [puppet] - 10https://gerrit.wikimedia.org/r/771442 (https://phabricator.wikimedia.org/T302937) (owner: 10RobH) [20:09:14] urbanecm, lgtm, stuff doesn't seem to break and logstash looks clear [20:09:25] let's try it then [20:10:26] (03PS1) 10BryanDavis: wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) [20:10:30] (03PS1) 10BryanDavis: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) [20:10:49] (03PS19) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [20:11:24] !log urbanecm@deploy1002 Synchronized wmf-config/: f649199: Migrate wmfDatacenter(s) to wmgDatacenter(s) (T45956; 1/3) (duration: 00m 50s) [20:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:28] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:12:14] !log urbanecm@deploy1002 Synchronized multiversion/: f649199: Migrate wmfDatacenter(s) to wmgDatacenter(s) (T45956; 2/3) (duration: 00m 50s) [20:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:37] (03CR) 10Majavah: [C: 04-1] "This does not match hosts that only have mediawiki deployed via mediawiki::scap" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [20:13:04] !log urbanecm@deploy1002 Synchronized docroot/noc/db.php: f649199: Migrate wmfDatacenter(s) to wmgDatacenter(s) (T45956; 3/3) (duration: 00m 49s) [20:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:10] zabe: should be live [20:13:18] as always ,please check logstash for a bit :) [20:13:25] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10RLazarus) From the time sliders it looks like the issue is that all or part of the pad gets deleted and replaced by a character, at these revisions respectively: - https://etherpad.wikimedia.org/p/T... [20:13:33] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [20:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:47] thanks :) [20:14:05] urbanecm: Hi we have 2 late backports (maintenance scripts) [20:14:15] Jdlrobson: sure thing. can you update calendar please? [20:14:44] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata) [20:14:51] urbanecm: will do [20:16:03] 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10Aklapper) > You do not have permission to view this object. Sorry, should work now. [20:16:47] (03Restored) 10Clare Ming: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771387 (https://phabricator.wikimedia.org/T299104) (owner: 10Clare Ming) [20:17:03] (03CR) 10Jforrester: "❤️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:17:35] Jdlrobson: please ping me once it's there :) [20:17:49] (03CR) 10Jforrester: "This is a bit of a deploy-trap as written; we normally factor these out into three patches (first remove use from CS, then remove setting " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:18:36] (03PS1) 10Jdlrobson: Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) [20:18:54] (03CR) 10BryanDavis: [C: 04-1] DynamicSidebar: remove unused extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:20:43] (03PS2) 10Jdlrobson: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) [20:20:54] (03CR) 10jerkins-bot: [V: 04-1] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [20:21:55] (03PS2) 10BryanDavis: DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) [20:21:57] (03PS1) 10BryanDavis: DynamicSidebar: Remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447 [20:21:59] (03PS1) 10BryanDavis: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 [20:22:44] (03PS3) 10Jdlrobson: Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) [20:23:46] (03CR) 10BryanDavis: DynamicSidebar: remove from CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:24:02] urbanecm: hi, can I get a config patch into this window? [20:24:08] kostajh: sure [20:24:13] ok, patch coming [20:24:39] (03PS1) 10Jdlrobson: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771449 (https://phabricator.wikimedia.org/T299104) [20:24:48] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [20:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:59] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye [20:25:45] (03Abandoned) 10Jdlrobson: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771387 (https://phabricator.wikimedia.org/T299104) (owner: 10Clare Ming) [20:27:04] kostajh: please update the calendar once you have the patch [20:27:23] urbanecm: will do [20:27:24] urbanecm: have updated [20:27:28] thanks Jdlrobson [20:27:40] https://gerrit.wikimedia.org/r/c/771449/ first [20:27:45] https://gerrit.wikimedia.org/r/c/771390/ second [20:27:48] both are maintenance scripts [20:27:53] so i guess no syncing needed? [20:27:55] (03CR) 10Urbanecm: [C: 03+2] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771449 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [20:27:57] (03CR) 10Urbanecm: [C: 03+2] Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [20:28:02] (03CR) 10Clare Ming: [C: 03+1] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771449 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [20:28:04] we're planning to run them after the window closes. [20:28:12] Jdlrobson: i need to sync them so they get to the maint script [20:28:15] *maint server [20:28:16] (03CR) 10Clare Ming: [C: 03+1] Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [20:28:18] urbanecm: got it. Thanks! [20:28:48] but there will be no testing needed :) [20:29:29] (03PS20) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [20:30:06] urbanecm: can you remind me, if I need to modify both InitialiseSettings and InitialiseSettings-labs, should that be in two patches or one? [20:30:15] (03Merged) 10jenkins-bot: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771449 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [20:30:17] (03Merged) 10jenkins-bot: Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [20:30:19] kostajh: feel free to do it in a single patch [20:31:16] Jdlrobson: syncing the scripts [20:32:23] (03PS1) 10Kosta Harlan: GrowthExperiments: Set GEWelcomeSurveyShowMailingListQuestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771451 (https://phabricator.wikimedia.org/T303240) [20:32:32] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [20:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [20:34:00] urbanecm: added to the calendar [20:34:34] thanks, let me see [20:34:37] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/WikimediaMaintenance/: ebfc516: Add script to update vector skin preferences (T299104) (duration: 00m 51s) [20:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:41] T299104: Prepare script to update invalid user preferences after skins have been separated - https://phabricator.wikimedia.org/T299104 [20:34:43] thanks urbanecm [20:35:28] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/WikimediaMaintenance/: 9ba157b: Add insert option for update skin preferences script (T299104) (duration: 00m 50s) [20:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:39] Jdlrobson: should be live [20:36:41] (03CR) 10Krinkle: [C: 03+1] wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis) [20:36:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [20:37:19] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Set GEWelcomeSurveyShowMailingListQuestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771451 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [20:37:38] kostajh: since it is a no-op at prod, do you want to do a mwdebug test? [20:37:55] thanks urbanecm [20:38:00] (03Merged) 10jenkins-bot: GrowthExperiments: Set GEWelcomeSurveyShowMailingListQuestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771451 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [20:38:27] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [20:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:51] urbanecm: no, it can just be synced IMO [20:38:56] okay [20:40:20] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [no-op] 8efa537: GrowthExperiments: Set GEWelcomeSurveyShowMailingListQuestion (T303240) (duration: 00m 53s) [20:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:24] T303240: Welcome emails: opt-in checkbox - https://phabricator.wikimedia.org/T303240 [20:40:29] kostajh: done [20:40:32] anything else, anyone? [20:40:44] Krinkle: Can I nerd snipe you into volunteering to walk those DynamicSidebar removal patches through merge and deploy? [20:41:14] urbanecm: thank you! [20:41:23] happy to help [20:41:42] (03PS1) 10Jbond: puppet_compiler: fix facts processing script [puppet] - 10https://gerrit.wikimedia.org/r/771453 [20:42:06] urbanecm: I'd expect to see the checkbox field on https://es.wikipedia.beta.wmflabs.org/wiki/Especial:Encuesta_de_bienvenida, though [20:42:30] https://es.wikipedia.beta.wmflabs.org/wiki/Especial:Versi%C3%B3n says that the supporting code has synced [20:42:55] kostajh: it's not yet synced there. it will take up to 30 minutes [20:43:18] urbanecm: ah, the config patch didn't sync there. I see [20:43:37] (03PS21) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [20:45:08] yup yup [20:45:19] i can't easily change when it gets there [20:45:23] so i suggest waiting [20:46:04] (03PS2) 10Jbond: puppet_compiler: fix facts processing script [puppet] - 10https://gerrit.wikimedia.org/r/771453 [20:48:01] sounds good [20:52:51] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [20:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34375/console" [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [20:56:20] hi all - hope it's ok that we run a maintenance script in a few mins -- updating ~35 rows [20:57:20] (03PS3) 10Jbond: puppet_compiler: fix facts processing script [puppet] - 10https://gerrit.wikimedia.org/r/771453 [20:57:39] updating ~35 rows in hewiki + frwiki [20:58:29] urbanecm: will this interfere with anything you are doing? [20:58:39] cjming: Jdlrobson: go ahead [21:00:49] please long when done, its free! 0:-D [21:01:59] (03PS2) 10BryanDavis: wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) [21:02:01] (03PS3) 10BryanDavis: DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) [21:02:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) a:05RobH→03Cmjohnson cookbook sre.hosts.provision fails for dumpsdata1006. Please check its mgmt cable and attempt to rerun. [21:02:05] (03PS2) 10BryanDavis: DynamicSidebar: remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447 [21:02:07] (03PS2) 10BryanDavis: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448 [21:04:15] (03PS2) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) [21:05:20] (03PS3) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) [21:05:34] (03PS1) 10Cathal Mooney: Add ACL filter to Spine switch interface connecting CR routers Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/771461 (https://phabricator.wikimedia.org/T299758) [21:05:57] (03CR) 10Razzi: "Thanks for the input everybody, especially Volans for the many improvement suggestions." [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [21:06:40] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:07:02] (03CR) 10Jbond: P:scap::dsh: Add scap targets as a dsh group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [21:07:06] (03PS1) 10Zabe: wmcs: stop accessing gu_enabled and gu_enabled_method in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674) [21:07:37] (03PS3) 10Jbond: wmflib: add class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) [21:07:49] (03PS4) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) [21:08:02] (03CR) 10Jbond: wmflib: add class_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [21:09:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34376/console" [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [21:10:05] (03PS2) 10Cathal Mooney: Add ACL filter to Spine switch interface connecting CR routers Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/771461 (https://phabricator.wikimedia.org/T299758) [21:12:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34377/console" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [21:14:50] (03PS5) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) [21:15:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34378/console" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [21:17:07] !log end running skin update preference maintenance script [21:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:38] (03PS6) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) [21:33:41] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34379/console" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [21:39:34] (03CR) 10Volans: [C: 03+1] "makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/771453 (owner: 10Jbond) [21:42:14] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [21:50:50] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [21:57:58] (03PS2) 10Zabe: Migrate wmfDbconfigFromEtcd to wmgDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) [22:01:56] (03PS1) 10Zabe: Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) [22:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [22:07:15] (03PS1) 10Jdlrobson: Update invalid skin preference update script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771394 (https://phabricator.wikimedia.org/T299104) [22:08:04] (03CR) 10Clare Ming: [C: 03+1] Update invalid skin preference update script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771394 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [22:09:10] (03Abandoned) 10Jbond: varnish: rate limit http://intake-analytics.wm.o/ [puppet] - 10https://gerrit.wikimedia.org/r/768028 (owner: 10Jbond) [22:09:46] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:10:08] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: timed_out: False, active_shards: 290, active_shards_percent_as_number: 98.97610921501706, number_of_data_nodes: 2, number_of_nodes: 2, active_primary_shards: 163, delayed_unassigned_shards: 0, unassigned_shards: 1, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, initializing_shards: 2, number_of_in_f [22:10:08] tch: 0, number_of_pending_tasks: 0, status: yellow, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:10:12] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: active_shards_percent_as_number: 98.97610921501706, initializing_shards: 2, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, number_of_nodes: 2, unassigned_shards: 1, delayed_unassigned_shards: 0, status: yellow, active_primary_shards: 163, relocating_shards: 0, active_shards: 290, number_of_data_nodes: [22:10:12] ter_name: relforge-eqiad, number_of_in_flight_fetch: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [22:12:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [22:24:23] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10cmooney) Worth noting that we are planning in the short term to adjust t... [22:33:28] (03PS1) 10Samtar: Throttle: Increase limit for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771477 (https://phabricator.wikimedia.org/T304016) [22:34:35] (03PS2) 10Ryan Kemper: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [22:35:03] (03PS1) 10Jdlrobson: Fix updateUserLinksDropdownItems not being called [skins/Vector] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771395 (https://phabricator.wikimedia.org/T304002) [22:37:32] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [22:39:19] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771477 (https://phabricator.wikimedia.org/T304016) (owner: 10Samtar) [22:45:59] (03PS3) 10Ryan Kemper: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [22:47:48] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:59] 10SRE, 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD into centrallog1001 - https://phabricator.wikimedia.org/T302437 (10Jclark-ctr) Schedule adding drives tomorrow 3/17/2022 4pm utc [22:56:20] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:31] (03CR) 10Ebernhardson: elasticsearch: remove custom restart handling (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [23:16:00] 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) >>! In T303903#7783673, @Aklapper wrote: >> You do not have permission to view this object. > Sorry, should work now. Thanks, https://phabricator.wikimedia.org/P22736 [23:26:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:26:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:52:52] !log Removing two files for legal compliance [23:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log