[00:03:14] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 17.36 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[00:03:22] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8
[00:03:43] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[00:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye
[00:04:54] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 33.65 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[00:07:46] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 87.74 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[00:08:52] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[00:08:54] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:10:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron)
[00:10:45] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "To be backported tomorrow and run." [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[00:12:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6011.drmrs.wmnet with OS buster
[00:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6011.drmrs.wmnet with OS buster
[00:19:20] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:20:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:21:06] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:22:58] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Papaul) I was able to pxe  boot with 1024 but got  ` Failed to load ldlinux.c32 `
[00:33:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage
[00:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:46] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6011.drmrs.wmnet with reason: host reimage
[00:36:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:41:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for inference:30443 is about to expire   - https://alerts.wikimedia.org
[00:44:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:46:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:58:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https
[00:58:41] <icinga-wm>	 ech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:01:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:28:24] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6011.drmrs.wmnet with OS buster
[01:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:28:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6011.drmrs.wmnet with OS buster completed: - cp6011 (**WARN**)   -...
[01:29:33] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be
[01:29:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:29:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe
[01:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:29:44] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls
[01:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:08] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[01:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed...
[01:37:51] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bullseye
[01:37:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye
[01:43:20] <logmsgbot>	 !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1026.eqiad.wmnet with OS bullseye
[01:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:43:28] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye executed...
[01:44:20] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[01:54:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[02:00:50] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:05:55] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[02:08:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22637 and previous config saved to /var/cache/conftool/dbconfig/20220316-020831-marostegui.json
[02:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:08:36] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[02:23:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22638 and previous config saved to /var/cache/conftool/dbconfig/20220316-022336-marostegui.json
[02:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:32:02] <icinga-wm>	 PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100%
[02:38:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22639 and previous config saved to /var/cache/conftool/dbconfig/20220316-023842-marostegui.json
[02:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:53:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22640 and previous config saved to /var/cache/conftool/dbconfig/20220316-025347-marostegui.json
[02:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:53:52] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[03:02:24] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:58:22] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:41:33] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for inference:30443 is about to expire   - https://alerts.wikimedia.org
[05:01:43] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.106`. Pre-deploy tests passing on canary `wdqs1003`
[05:01:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:41] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@38de611]: 0.3.106
[05:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:13] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.106` on canary `wdqs1003`; proceeding to rest of fleet
[05:03:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:09:17] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@38de611]: 0.3.106 (duration: 06m 36s)
[05:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:10] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[05:11:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:13] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[05:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:27] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[05:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:45] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@38de611] (wcqs): Deploy 0.3.106 to WCQS
[05:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:12:06] <ryankemper>	 !log [WCQS Deploy] Tests look good following deploy of `0.3.106` to canary `wcqs1002.eqiad.wmnet`, proceeding to rest of fleet
[05:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:38] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@38de611] (wcqs): Deploy 0.3.106 to WCQS (duration: 01m 53s)
[05:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:53] <ryankemper>	 !log [WCQS Deploy] Test query passed on commons-query.wikimedia.org ; WCQS deploy complete
[05:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:58] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1068.eqiad.wmnet with OS stretch
[05:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1068.eqiad.wmnet with OS stretch executed with errors:...
[05:36:35] <ryankemper>	 !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good
[05:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:50:20] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:57:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[05:57:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[05:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:57:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[05:57:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[05:58:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298557)', diff saved to https://phabricator.wikimedia.org/P22641 and previous config saved to /var/cache/conftool/dbconfig/20220316-055805-marostegui.json
[05:58:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:09] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[05:58:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[05:58:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[05:58:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22642 and previous config saved to /var/cache/conftool/dbconfig/20220316-055903-marostegui.json
[05:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:07] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[05:59:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix typo in role name [puppet] - 10https://gerrit.wikimedia.org/r/771243
[06:00:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[06:00:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[06:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22643 and previous config saved to /var/cache/conftool/dbconfig/20220316-060008-marostegui.json
[06:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:12] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[06:03:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in role name [puppet] - 10https://gerrit.wikimedia.org/r/771243 (owner: 10Muehlenhoff)
[06:05:55] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[06:08:00] <wikibugs>	 (03CR) 10Marostegui: auto_schema: Add abaility to skip replicas (031 comment) [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup)
[06:08:22] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:22:43] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) Thanks for working on this @Ladsgro...
[06:33:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298557)', diff saved to https://phabricator.wikimedia.org/P22644 and previous config saved to /var/cache/conftool/dbconfig/20220316-063344-marostegui.json
[06:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:49] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[06:44:23] <elukey>	 qchris: o/ thanks for the istio repo!
[06:48:24] <wikibugs>	 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10RhinosF1)
[06:48:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22646 and previous config saved to /var/cache/conftool/dbconfig/20220316-064849-marostegui.json
[06:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:24] <icinga-wm>	 PROBLEM - Host ms-be1070 is DOWN: PING CRITICAL - Packet loss = 100%
[06:49:28] <icinga-wm>	 PROBLEM - Host ms-be1069 is DOWN: PING CRITICAL - Packet loss = 100%
[06:50:44] <icinga-wm>	 PROBLEM - Host ms-be1071 is DOWN: PING CRITICAL - Packet loss = 100%
[06:52:06] <icinga-wm>	 RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:55:35] <wikibugs>	 (03PS1) 10Marostegui: switchover-tmpl.sh: Add orchestrator tag notes [software] - 10https://gerrit.wikimedia.org/r/771257 (https://phabricator.wikimedia.org/T266869)
[06:59:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22647 and previous config saved to /var/cache/conftool/dbconfig/20220316-065918-marostegui.json
[06:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:23] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[07:00:05] <jouncebot>	 Amir1, awight, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T0700).
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:13] <urbanecm>	 'morning
[07:00:33] <urbanecm>	 i can deploy kart_ (unless you want to self-service?)
[07:00:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3312', diff saved to https://phabricator.wikimedia.org/P22648 and previous config saved to /var/cache/conftool/dbconfig/20220316-070033-marostegui.json
[07:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:51] <kart_>	 urbanecm: Thanks. Please go ahead :)
[07:01:25] <kart_>	 urbanecm: specially, new table creation on testwiki. I don't recall I've done it earlier or maybe it was too long back :)
[07:01:41] <urbanecm>	 kart_: i do recall doing it for you in the past :D
[07:01:48] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:16] <urbanecm>	 the tables should be only on testwiki now?
[07:02:31] <wikibugs>	 (03PS2) 10Urbanecm: Disable ContentTranslation for non-extended confirmed users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770882 (https://phabricator.wikimedia.org/T299636) (owner: 10KartikMistry)
[07:02:35] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Disable ContentTranslation for non-extended confirmed users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770882 (https://phabricator.wikimedia.org/T299636) (owner: 10KartikMistry)
[07:02:41] <kart_>	 urbanecm: yes. as wmf.26 yet to deploy on Group1 and 2.
[07:02:55] <urbanecm>	 kart_: well we can create the table everywhere now if that's the goal
[07:03:17] <wikibugs>	 (03Merged) 10jenkins-bot: Disable ContentTranslation for non-extended confirmed users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770882 (https://phabricator.wikimedia.org/T299636) (owner: 10KartikMistry)
[07:03:17] <kart_>	 urbanecm: let's wait. We need to do some testing on testwiki too.
[07:03:21] <urbanecm>	 in my understanding, it's usually better to do that (even if it stays empty on many wikis), as it's easier to keep track of which table exist where that way
[07:03:50] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:03:50] <kart_>	 urbanecm: oh, if that's possible - it requires to be create on x1 cluster.
[07:03:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22649 and previous config saved to /var/cache/conftool/dbconfig/20220316-070354-marostegui.json
[07:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22650 and previous config saved to /var/cache/conftool/dbconfig/20220316-070452-marostegui.json
[07:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:56] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[07:05:01] <urbanecm>	 kart_: in the meanwhile, pulled the config patch to mwdebug1001. please test.
[07:05:40] <kart_>	 sure. Testing.
[07:06:52] <urbanecm>	 kart_: so, you want me to create the tables where exactly? in wikishared on x1? in the per-wiki DB for testwiki on x1? in testwiki's main database? a combination of those
[07:06:53] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1158 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T303910 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:06:58] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10ops-monitoring-bot)
[07:07:26] <kart_>	 urbanecm: OK. Works. Shows expected msg.
[07:07:30] <urbanecm>	 great, syncing
[07:07:41] <urbanecm>	 https://phabricator.wikimedia.org/T302371#7756524 looks to say testwiki's main database and wikishared, but I'd like to confirm that before i do it, as table creation is hard to undo
[07:07:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10elukey) @cmooney hi! I tried on ms-be1068 and the arp cache looks broken, lldpi shows me that lsw1-e1-eqiad is the top of rack, maybe the same that happened...
[07:08:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Marostegui) p:05Triage→03Medium The RAID is indeed degraded: ` Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name                : RAID Level          : Primary-1, Secondary-0, RAID...
[07:08:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:08:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:56] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 455895168ab266813ae499e8fc353c66e6d5b450: Disable ContentTranslation for non-extended confirmed users on viwiki (T299636) (duration: 00m 51s)
[07:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:00] <stashbot>	 T299636: Disable ContentTranslation for non-extended confirmed users on viwiki - https://phabricator.wikimedia.org/T299636
[07:09:04] <icinga-wm>	 PROBLEM - SSH on kubernetes2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:09:11] <urbanecm>	 kart_: config patch live. waiting for your answer re table creation before i proceed with that :)
[07:10:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:10:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Marostegui) a:03Cmjohnson Disk #2 is gone: ` root@db1158:~# megacli -PDList -aALL | grep Slot Slot Number: 0 Slot Number: 1 Slot Number: 3 Slot Number: 4 Slot Number: 5 Slot Number: 6 Slot Number: 7 Slot Numbe...
[07:10:57] <kart_>	 urbanecm: Let's do that only for testwiki and with wmf.26 all Wikis, I'll schedule it on Monday. testwiki and other Wikipedias for CX uses different DBs (s3 v/s x1).
[07:10:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:31] <kart_>	 urbanecm: Please log command also, so I'll remember that :) 
[07:11:35] <urbanecm>	 kart_: sounds good to me. last confirmation: I create the tables, using the SQL files specified in T302371's description, in testwiki's s3 DB only
[07:11:36] <stashbot>	 T302371: Create new tables: cx_significant_edits and cx_section_translation - https://phabricator.wikimedia.org/T302371
[07:11:53] <kart_>	 urbanecm: Yes. Confirmed.
[07:11:56] <urbanecm>	 doing
[07:15:22] <urbanecm>	 !log Create `testwiki.cx_significant_edits` and `testwiki.cx_section_translation` at s3 (T302371; `mwscript sql.php --wiki=testwiki /srv/mediawiki-staging/php-1.38.0-wmf.26/extensions/ContentTranslation/sql/{section-translations,significant-edits}.sql)`)
[07:15:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:48] <urbanecm>	 kart_: should be done now, see https://www.irccloud.com/pastebin/yUotxgfS/
[07:16:10] <kart_>	 urbanecm: looks good!
[07:16:22] <urbanecm>	 kart_: I'm not sure how much the command is useful though. for x1, it'll look differently
[07:17:00] <kart_>	 urbanecm: No problem. Let's do that on Monday :)
[07:17:21] <urbanecm>	 sounds good :)
[07:17:25] <urbanecm>	 anything else i can do for you today?
[07:17:31] <kart_>	 urbanecm: Thanks a lot :)
[07:17:37] <kart_>	 urbanecm: Done for now :)
[07:17:44] <urbanecm>	 okay! see you later then
[07:18:04] <urbanecm>	 !log UTC morning B&C window done
[07:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298557)', diff saved to https://phabricator.wikimedia.org/P22651 and previous config saved to /var/cache/conftool/dbconfig/20220316-071859-marostegui.json
[07:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:04] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[07:19:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[07:19:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[07:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance
[07:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance
[07:19:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22652 and previous config saved to /var/cache/conftool/dbconfig/20220316-071957-marostegui.json
[07:20:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:16] <Nikerabbit>	 what happened to metawiki at UTC midnight today?
[07:27:52] <urbanecm>	 Nikerabbit: can you be a bit more specific?
[07:28:27] <Nikerabbit>	 urbanecm: check Language-Team dashboard in Logstash for past 12 hours
[07:28:35] <urbanecm>	 looking
[07:29:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set simpler partman recipe for kubernetes200[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/770912 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[07:34:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] admin: add releng to docker group on deployment [puppet] - 10https://gerrit.wikimedia.org/r/770976 (https://phabricator.wikimedia.org/T303450) (owner: 10Giuseppe Lavagetto)
[07:35:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22653 and previous config saved to /var/cache/conftool/dbconfig/20220316-073502-marostegui.json
[07:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:22] <wikibugs>	 (03CR) 10Ladsgroup: "Ping" [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup)
[07:45:15] <wikibugs>	 (03CR) 10Ladsgroup: "Ping. We have had another case of this last week. It was auto_schema otherwise I would have killed the dump." [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup)
[07:49:11] <Amir1>	 !log dbmaint on master of s4@eqiad (T298743)
[07:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:15] <stashbot>	 T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743
[07:50:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22654 and previous config saved to /var/cache/conftool/dbconfig/20220316-075007-marostegui.json
[07:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:12] <stashbot>	 T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563
[07:51:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[07:51:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:51:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[07:51:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:56] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:52:16] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:52:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[07:52:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[07:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298294)', diff saved to https://phabricator.wikimedia.org/P22655 and previous config saved to /var/cache/conftool/dbconfig/20220316-075248-marostegui.json
[07:52:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:52] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[07:54:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298294)', diff saved to https://phabricator.wikimedia.org/P22656 and previous config saved to /var/cache/conftool/dbconfig/20220316-075448-marostegui.json
[07:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[07:54:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[07:54:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22657 and previous config saved to /var/cache/conftool/dbconfig/20220316-075502-marostegui.json
[07:55:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:06] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[07:56:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22658 and previous config saved to /var/cache/conftool/dbconfig/20220316-075612-marostegui.json
[07:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:25] <wikibugs>	 (03PS2) 10Ladsgroup: auto_schema: Add ability to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779)
[08:00:09] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] switchover-tmpl.sh: Add orchestrator tag notes [software] - 10https://gerrit.wikimedia.org/r/771257 (https://phabricator.wikimedia.org/T266869) (owner: 10Marostegui)
[08:00:33] <Amir1>	 jouncebot: nowandnext
[08:00:33] <jouncebot>	 No deployments scheduled for the next 4 hour(s) and 59 minute(s)
[08:00:34] <jouncebot>	 In 4 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1300)
[08:00:40] <Amir1>	 noice
[08:00:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Joe) 05Open→03Resolved
[08:02:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto)
[08:02:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] auto_schema: Add ability to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup)
[08:03:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add ability to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup)
[08:03:35] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Add ability to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup)
[08:04:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Add orchestrator tag notes [software] - 10https://gerrit.wikimedia.org/r/771257 (https://phabricator.wikimedia.org/T266869) (owner: 10Marostegui)
[08:07:54] <wikibugs>	 10SRE-OnFire, 10DBA, 10Platform Engineering, 10Performance-Team (Radar), and 2 others: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Marostegui)
[08:08:07] <wikibugs>	 10SRE-OnFire, 10Data-Persistence (Consultation), 10Platform Engineering, 10Performance-Team (Radar), and 2 others: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10M...
[08:08:47] <wikibugs>	 (03PS2) 10Ladsgroup: Change A/V player to videojs in the first batch of production wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770130 (https://phabricator.wikimedia.org/T248418)
[08:09:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22659 and previous config saved to /var/cache/conftool/dbconfig/20220316-080953-marostegui.json
[08:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Change A/V player to videojs in the first batch of production wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770130 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup)
[08:10:47] <wikibugs>	 (03Merged) 10jenkins-bot: Change A/V player to videojs in the first batch of production wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770130 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup)
[08:10:50] <icinga-wm>	 RECOVERY - SSH on kubernetes2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:11:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P22660 and previous config saved to /var/cache/conftool/dbconfig/20220316-081117-marostegui.json
[08:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:57] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:770130|Change A/V player to videojs in the first batch of production wiki (T248418)]] (duration: 00m 49s)
[08:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:00] <stashbot>	 T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418
[08:12:24] <Amir1>	 marostegui: heads up, this change of a/v player in wikis will lead to ParserCache fragmentation, we tried to avoid it as much as possible but lmk if you see any issues
[08:12:32] <marostegui>	 wilco
[08:13:26] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[08:14:42] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4028 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:14:45] <wikibugs>	 (03PS1) 10Elukey: install_server: improve the kubernetes-node-virtual-overlay recipe [puppet] - 10https://gerrit.wikimedia.org/r/771319 (https://phabricator.wikimedia.org/T300744)
[08:16:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:16:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:48] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4028 is CRITICAL: reload-vcl failed to run since 0h, 7 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:17:30] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: add ACLs even if empty [puppet] - 10https://gerrit.wikimedia.org/r/771320
[08:17:56] <wikibugs>	 (03PS2) 10Elukey: install_server: improve the kubernetes-node-virtual-overlay recipe [puppet] - 10https://gerrit.wikimedia.org/r/771319 (https://phabricator.wikimedia.org/T300744)
[08:18:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] varnish: add ACLs even if empty [puppet] - 10https://gerrit.wikimedia.org/r/771320 (owner: 10Giuseppe Lavagetto)
[08:20:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: improve the kubernetes-node-virtual-overlay recipe [puppet] - 10https://gerrit.wikimedia.org/r/771319 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[08:21:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:21:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:24:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22661 and previous config saved to /var/cache/conftool/dbconfig/20220316-082458-marostegui.json
[08:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P22662 and previous config saved to /var/cache/conftool/dbconfig/20220316-082622-marostegui.json
[08:26:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:21] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4021 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:27:22] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4021 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:30:26] <RhinosF1>	 _joe_: ^ looks like your patch
[08:30:52] <_joe_>	 RhinosF1: it's a temporary problem with icinga yes
[08:31:01] <_joe_>	 I think
[08:31:12] <_joe_>	 let me try to run puppet on the alert host
[08:31:54] <_joe_>	 basically I removed that file and it's ok, it was not used directly
[08:32:08] <_joe_>	 but it should also not be checked anymore
[08:32:18] <RhinosF1>	 Makes sense
[08:32:40] <_joe_>	 uhhh no I think I know what the problem is
[08:32:56] <_joe_>	 some resources are not properly absented via confd::file I guess
[08:33:53] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp6002 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:33:53] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6003 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:33:54] <_joe_>	 so yes, it needs icinga to run puppet
[08:34:01] <_joe_>	 so it will happen on more servers :/
[08:35:27] <hashar>	 !log Restarting CI Jenkins
[08:35:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:33] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6006 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:35:35] <_joe_>	 but these are not actual issues
[08:35:46] <_joe_>	 uh wait
[08:36:25] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp2039 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:36:39] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp5012 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:36:39] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5006 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:36:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5015 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:37:08] <_joe_>	 ok not sure about these reload fails, stopping puppet on all cp servers
[08:37:29] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3058 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:37:33] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:37:33] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp6009 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:38:11] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp1077 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:38:17] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp2037 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:38:23] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5003 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:38:42] <_joe_>	 I can only run puppet on the alert server to make these errors go away
[08:38:44] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Volans) >>! In T303776#7780384, @Papaul wrote: > ` > Failed to load ldlinux.c32 > `  At first sight this might be an occurrence of this issue: htt...
[08:38:51] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6006 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[08:39:23] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp3060 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:39:41] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp6004 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:39:55] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6004 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:40:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298294)', diff saved to https://phabricator.wikimedia.org/P22663 and previous config saved to /var/cache/conftool/dbconfig/20220316-084003-marostegui.json
[08:40:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[08:40:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[08:40:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:08] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[08:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22664 and previous config saved to /var/cache/conftool/dbconfig/20220316-084011-marostegui.json
[08:40:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22665 and previous config saved to /var/cache/conftool/dbconfig/20220316-084127-marostegui.json
[08:41:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[08:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[08:41:31] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[08:41:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:33] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for inference:30443 is about to expire   - https://alerts.wikimedia.org
[08:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:39] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[08:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T297189)', diff saved to https://phabricator.wikimedia.org/P22666 and previous config saved to /var/cache/conftool/dbconfig/20220316-084140-marostegui.json
[08:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22667 and previous config saved to /var/cache/conftool/dbconfig/20220316-084219-marostegui.json
[08:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:26] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2037 is CRITICAL: reload-vcl failed to run since 0h, 9 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:44:50] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6002 is CRITICAL: reload-vcl failed to run since 0h, 14 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:44:50] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 9 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:47:00] <wikibugs>	 (03PS1) 10Elukey: install_server: try a simpler version of kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771321
[08:47:26] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2034 is CRITICAL: reload-vcl failed to run since 0h, 16 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:47:35] <_joe_>	 please ignore those vcl based reload alerts, I'm not evne sure why they're happening
[08:47:43] <_joe_>	 I'm going to clean them up soon
[08:47:50] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 10 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:48:46] <wikibugs>	 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10Peachey88) Did you keep a full copy of one of the tracerts that you could provide to the SRE team via [[ https://phabricator.wikimedia.org/paste/edit/form/36/ | private paste ]]?   For more information...
[08:50:00] <wikibugs>	 (03PS2) 10Elukey: install_server: try a simpler version of kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771321
[08:50:36] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[08:51:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: try a simpler version of kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771321 (owner: 10Elukey)
[08:52:22] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[08:52:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:11] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[08:55:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:15] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5003 is OK: reload-vcl successfully ran 0h, 3 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[08:55:15] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5006 is OK: reload-vcl successfully ran 0h, 3 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[08:55:51] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4034 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:56:19] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6003 is OK: reload-vcl successfully ran 0h, 4 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[08:56:19] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 4 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[08:56:33] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp3056 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[08:56:41] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3063 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:56:49] <_joe_>	 again sorry for the noise, please disregard
[08:57:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22668 and previous config saved to /var/cache/conftool/dbconfig/20220316-085724-marostegui.json
[08:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:11] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4028 is OK: reload-vcl successfully ran 0h, 6 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[08:58:19] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4022 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[08:58:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[08:58:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[08:58:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:25] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2034 is OK: reload-vcl successfully ran 0h, 7 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:00:20] <wikibugs>	 (03PS1) 10Elukey: install_server: add more options to kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771323
[09:00:27] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5013 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:01:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: add more options to kubernetes-node-virtual-overlay [puppet] - 10https://gerrit.wikimedia.org/r/771323 (owner: 10Elukey)
[09:02:59] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3058 is OK: reload-vcl successfully ran 0h, 11 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:04:11] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3050 is CRITICAL: reload-vcl failed to run since 0h, 10 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:04:13] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6010 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:04:21] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4021 is OK: reload-vcl successfully ran 0h, 12 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:05:17] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5015 is OK: reload-vcl successfully ran 0h, 13 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:09:17] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2040 is CRITICAL: reload-vcl failed to run since 0h, 17 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:09:20] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate if stopping mysql with buffer_pool dump between 10.4 versions is safe - https://phabricator.wikimedia.org/T303498 (10Marostegui) I have done quite a bunch of testing and so far I have not been able to reproduce the crashes when doing 10.4...
[09:09:28] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[09:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:53] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3056 is CRITICAL: reload-vcl failed to run since 0h, 15 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:10:15] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp2028 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[09:11:30] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2028 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:12:14] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:12:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22669 and previous config saved to /var/cache/conftool/dbconfig/20220316-091229-marostegui.json
[09:12:32] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4034 is CRITICAL: reload-vcl failed to run since 0h, 19 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:12:32] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4036 is CRITICAL: reload-vcl failed to run since 0h, 19 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:12:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:34] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5014 is CRITICAL: reload-vcl failed to run since 0h, 13 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:12:52] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp3065 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[09:13:46] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4026 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[09:13:48] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp3054 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[09:15:23] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3055 is CRITICAL: reload-vcl failed to run since 0h, 16 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:15:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 T303498', diff saved to https://phabricator.wikimedia.org/P22670 and previous config saved to /var/cache/conftool/dbconfig/20220316-091533-marostegui.json
[09:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:38] <stashbot>	 T303498: Investigate if stopping mysql with buffer_pool dump between 10.4 versions is safe - https://phabricator.wikimedia.org/T303498
[09:15:57] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6004 is OK: reload-vcl successfully ran 0h, 24 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:16:15] <moritzm>	 !log revert mx1001/mx2001 to the Bullseye version of Exim T303738
[09:16:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:29] <wikibugs>	 (03PS1) 10Marostegui: db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/771324 (https://phabricator.wikimedia.org/T303498)
[09:17:01] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp2033 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[09:17:09] <icinga-wm>	 PROBLEM - Confd template for /var/netmapper/abuse_networks.json on cp4033 is CRITICAL: File not found: /var/netmapper/abuse_networks.json https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[09:17:09] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3065 is CRITICAL: reload-vcl failed to run since 0h, 7 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:17:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/771324 (https://phabricator.wikimedia.org/T303498) (owner: 10Marostegui)
[09:18:21] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3059 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:18:21] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1080 is CRITICAL: reload-vcl failed to run since 0h, 22 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:18:39] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4035 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:18:41] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4033 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:19:05] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5008 is CRITICAL: reload-vcl failed to run since 0h, 21 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:19:05] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 27 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:19:29] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4025 is CRITICAL: reload-vcl failed to run since 0h, 21 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:19:33] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1086 is CRITICAL: reload-vcl failed to run since 0h, 26 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:20:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T297189)', diff saved to https://phabricator.wikimedia.org/P22671 and previous config saved to /var/cache/conftool/dbconfig/20220316-092004-marostegui.json
[09:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:10] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[09:20:35] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1081 is CRITICAL: reload-vcl failed to run since 0h, 8 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:20:59] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3051 is CRITICAL: reload-vcl failed to run since 0h, 25 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:21:01] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6002 is OK: reload-vcl successfully ran 0h, 29 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:21:23] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2038 is CRITICAL: reload-vcl failed to run since 0h, 25 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:21:23] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3052 is CRITICAL: reload-vcl failed to run since 0h, 25 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:21:49] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4026 is CRITICAL: reload-vcl failed to run since 0h, 11 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:21:55] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5007 is CRITICAL: reload-vcl failed to run since 0h, 9 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:23:23] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3061 is CRITICAL: reload-vcl failed to run since 0h, 30 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:24:11] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2037 is OK: reload-vcl successfully ran 0h, 32 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:25:13] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4035 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:25:13] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3065 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:25:15] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3051 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:25:17] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4033 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:25:38] <wikibugs>	 (03PS3) 10DCausse: Replace Swift native API with S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski)
[09:25:41] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:25:49] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2038 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:25:51] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3052 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:25:53] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5008 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:26:19] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1080 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:26:21] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3059 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:26:21] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1081 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:26:23] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4025 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:26:31] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4026 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:26:33] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1086 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:26:43] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5007 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:26:43] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3061 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:27:11] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3055 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:27:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22672 and previous config saved to /var/cache/conftool/dbconfig/20220316-092735-marostegui.json
[09:27:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[09:27:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[09:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:39] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "Must be deployed with care and I think a safe approach is to simply completely delete the deployment and the corresponding data on swift (" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski)
[09:27:40] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[09:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22673 and previous config saved to /var/cache/conftool/dbconfig/20220316-092742-marostegui.json
[09:27:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:59] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3050 is OK: reload-vcl successfully ran 0h, 2 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:29:25] <icinga-wm>	 PROBLEM - TFTP service on install1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd
[09:29:29] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3063 is OK: reload-vcl successfully ran 0h, 3 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:29:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22674 and previous config saved to /var/cache/conftool/dbconfig/20220316-092947-marostegui.json
[09:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:03] <icinga-wm>	 PROBLEM - Check systemd state on cp5003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_varnish-frontend-hospital.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:33:13] <wikibugs>	 (03PS1) 10Elukey: install_server: add missing 'echo' for kubernetes vms in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/771325
[09:33:19] <icinga-wm>	 RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[09:34:06] <wikibugs>	 (03CR) 10Ladsgroup: "don't merge it, I need to review it 😄" [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[09:34:08] <wikibugs>	 (03PS1) 10Vgutierrez: aptrepo:update-keys: Refresh gitlab key [puppet] - 10https://gerrit.wikimedia.org/r/771326
[09:35:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P22675 and previous config saved to /var/cache/conftool/dbconfig/20220316-093509-marostegui.json
[09:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: add missing 'echo' for kubernetes vms in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/771325 (owner: 10Elukey)
[09:36:05] <dcausse>	 !log T293862: manually restarted blazegraph on wdqs1010 with "-agentpath:/usr/lib/libjvmquake.so=1000,1,0,warn=30,touch=/tmp/jvmquake" 
[09:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:09] <stashbot>	 T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862
[09:36:13] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2028 is OK: reload-vcl successfully ran 0h, 10 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:36:36] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) > @jcrespo Did you test the POC I ment...
[09:38:37] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3056 is OK: reload-vcl successfully ran 0h, 12 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:39:35] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6010 is OK: reload-vcl successfully ran 0h, 13 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:39:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/771326 (owner: 10Vgutierrez)
[09:40:53] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4034 is OK: reload-vcl successfully ran 0h, 15 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:41:39] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4022 is OK: reload-vcl successfully ran 0h, 15 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:42:18] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/771326 (owner: 10Vgutierrez)
[09:42:23] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2040 is OK: reload-vcl successfully ran 0h, 16 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:42:59] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4036 is OK: reload-vcl successfully ran 0h, 17 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:43:01] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5014 is OK: reload-vcl successfully ran 0h, 17 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[09:44:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22676 and previous config saved to /var/cache/conftool/dbconfig/20220316-094452-marostegui.json
[09:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] aptrepo:update-keys: Refresh gitlab key [puppet] - 10https://gerrit.wikimedia.org/r/771326 (owner: 10Vgutierrez)
[09:46:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[09:46:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[09:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:29] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) >>! In T281249#7780918, @jcrespo wr...
[09:50:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P22677 and previous config saved to /var/cache/conftool/dbconfig/20220316-095014-marostegui.json
[09:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:31] <wikibugs>	 (03PS1) 10Marostegui: db1149: Remove candidate master [puppet] - 10https://gerrit.wikimedia.org/r/771329 (https://phabricator.wikimedia.org/T266869)
[09:55:10] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1071.eqiad.wmnet with OS buster
[09:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS buster executed with errors: -...
[09:55:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1149: Remove candidate master [puppet] - 10https://gerrit.wikimedia.org/r/771329 (https://phabricator.wikimedia.org/T266869) (owner: 10Marostegui)
[09:55:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1070.eqiad.wmnet with OS stretch
[09:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS stretch executed with errors:...
[09:56:28] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:56:29] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1069.eqiad.wmnet with OS stretch
[09:56:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch executed with errors:...
[09:59:30] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) > When we migrated to dbctl, we lost t...
[09:59:54] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) Fixed dbctl notes for s4. Checked a...
[09:59:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22678 and previous config saved to /var/cache/conftool/dbconfig/20220316-095957-marostegui.json
[10:00:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:55] <moritzm>	 !log installing openssl security updates
[10:01:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:25] <vgutierrez>	 !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia
[10:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T297189)', diff saved to https://phabricator.wikimedia.org/P22679 and previous config saved to /var/cache/conftool/dbconfig/20220316-100519-marostegui.json
[10:05:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[10:05:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[10:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:23] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[10:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22680 and previous config saved to /var/cache/conftool/dbconfig/20220316-100527-marostegui.json
[10:05:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:37] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) @elukey thanks for the heads up.  Yes this is very worrying, we have the same thing on, for instance ms-be1069, which is connected to lsw1-e2-eqiad....
[10:05:55] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[10:06:46] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:13:14] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:15:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22681 and previous config saved to /var/cache/conftool/dbconfig/20220316-101502-marostegui.json
[10:15:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:15:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:08] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[10:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:17] <vgutierrez>	 !log rolling restart of ats-tls and ats-backend to catch up on OpenSSL updates
[10:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[10:16:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[10:16:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 10 hosts with reason: Maintenance
[10:16:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 10 hosts with reason: Maintenance
[10:16:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[10:17:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[10:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298294)', diff saved to https://phabricator.wikimedia.org/P22682 and previous config saved to /var/cache/conftool/dbconfig/20220316-101729-marostegui.json
[10:17:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298294)', diff saved to https://phabricator.wikimedia.org/P22683 and previous config saved to /var/cache/conftool/dbconfig/20220316-101848-marostegui.json
[10:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:58] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (owner: 10Awight)
[10:28:14] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[10:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:22] <icinga-wm>	 RECOVERY - traffic_server backend process restarted on cp3051 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3051&var-layer=backend
[10:29:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for ppechelko [puppet] - 10https://gerrit.wikimedia.org/r/771332
[10:30:20] <wikibugs>	 (03PS8) 10Jbond: C:java: Refactor java code to work with cloud [puppet] - 10https://gerrit.wikimedia.org/r/770930
[10:31:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34340/console" [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond)
[10:33:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ppechelko [puppet] - 10https://gerrit.wikimedia.org/r/771332 (owner: 10Muehlenhoff)
[10:33:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22684 and previous config saved to /var/cache/conftool/dbconfig/20220316-103353-marostegui.json
[10:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[10:40:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[10:40:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:26] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] Deploy template features to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (owner: 10Awight)
[10:42:59] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[10:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for accraze [puppet] - 10https://gerrit.wikimedia.org/r/771333
[10:44:09] <wikibugs>	 (03PS1) 10Marostegui: switchover-tmpl.sh: Remove communication related steps [software] - 10https://gerrit.wikimedia.org/r/771334 (https://phabricator.wikimedia.org/T303605)
[10:46:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for accraze [puppet] - 10https://gerrit.wikimedia.org/r/771333 (owner: 10Muehlenhoff)
[10:46:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Alert itself LGTM, though the alerting file will need to be deployed as a global rule (i.e. thanos)" [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron)
[10:46:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22685 and previous config saved to /var/cache/conftool/dbconfig/20220316-104637-marostegui.json
[10:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:42] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[10:48:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22686 and previous config saved to /var/cache/conftool/dbconfig/20220316-104858-marostegui.json
[10:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[10:50:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/769671 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon)
[10:51:09] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy)
[10:51:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: grafana ldap users sync: enable retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite)
[10:52:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ayounsi) `name=initial PXE boot sequence CLIENT MAC ADDR: B0 26 28 29 5D F0  GUID: 4C4C4544-005A-5910-805A-C4C04F515032 CLIENT IP: 10.64.20.43  MA...
[10:55:14] <vgutierrez>	 !log rolling upgrade to HAProxy 2.4.15 on cache nodes
[10:55:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ayounsi) Is it possible to upgrade PXE? The current version seems quite old: 20150819
[10:58:14] <icinga-wm>	 RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:58:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove twentyafterfour from various access groups [puppet] - 10https://gerrit.wikimedia.org/r/771337
[10:59:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove twentyafterfour from various access groups [puppet] - 10https://gerrit.wikimedia.org/r/771337 (owner: 10Muehlenhoff)
[11:01:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[11:01:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P22687 and previous config saved to /var/cache/conftool/dbconfig/20220316-110142-marostegui.json
[11:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] clinic-duty: add coverage for work.gcalendarLink() [software] - 10https://gerrit.wikimedia.org/r/768142 (owner: 10Krinkle)
[11:03:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] clinic-duty: Use Date.parse() and assert.propContains() [software] - 10https://gerrit.wikimedia.org/r/768141 (owner: 10Krinkle)
[11:04:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298294)', diff saved to https://phabricator.wikimedia.org/P22688 and previous config saved to /var/cache/conftool/dbconfig/20220316-110403-marostegui.json
[11:04:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[11:04:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[11:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:08] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[11:04:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298294)', diff saved to https://phabricator.wikimedia.org/P22689 and previous config saved to /var/cache/conftool/dbconfig/20220316-110411-marostegui.json
[11:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:07] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] switchover-tmpl.sh: Remove communication related steps [software] - 10https://gerrit.wikimedia.org/r/771334 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui)
[11:08:18] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Remove communication related steps [software] - 10https://gerrit.wikimedia.org/r/771334 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui)
[11:09:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6012.drmrs.wmnet with OS buster
[11:09:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:19] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6012.drmrs.wmnet with OS buster
[11:09:52] <wikibugs>	 (03Merged) 10jenkins-bot: switchover-tmpl.sh: Remove communication related steps [software] - 10https://gerrit.wikimedia.org/r/771334 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui)
[11:09:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34342/console" [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond)
[11:13:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:java: Refactor java code to work with cloud [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond)
[11:15:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298294)', diff saved to https://phabricator.wikimedia.org/P22690 and previous config saved to /var/cache/conftool/dbconfig/20220316-111532-marostegui.json
[11:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:37] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[11:16:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P22691 and previous config saved to /var/cache/conftool/dbconfig/20220316-111647-marostegui.json
[11:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:46] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond)
[11:24:54] <wikibugs>	 (03PS20) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397)
[11:25:14] <wikibugs>	 (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[11:26:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] reposync: dont catch RepoSyncNoChangeError (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/770003 (owner: 10Jbond)
[11:27:39] <wikibugs>	 (03PS1) 10Ayounsi: DNS: add drmrs dcmap ressources [dns] - 10https://gerrit.wikimedia.org/r/771342
[11:28:16] <icinga-wm>	 RECOVERY - Check systemd state on cp5003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:56] <icinga-wm>	 PROBLEM - Host kubernetes2005 is DOWN: PING CRITICAL - Packet loss = 100%
[11:29:17] <wikibugs>	 (03PS2) 10Awight: Deploy template features to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (https://phabricator.wikimedia.org/T302857)
[11:29:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage
[11:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:04] <icinga-wm>	 RECOVERY - Host kubernetes2005 is UP: PING OK - Packet loss = 0%, RTA = 32.63 ms
[11:30:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22692 and previous config saved to /var/cache/conftool/dbconfig/20220316-113037-marostegui.json
[11:30:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[11:30:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[11:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22693 and previous config saved to /var/cache/conftool/dbconfig/20220316-113057-marostegui.json
[11:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:06] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[11:31:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22694 and previous config saved to /var/cache/conftool/dbconfig/20220316-113152-marostegui.json
[11:31:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[11:31:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[11:31:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:57] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[11:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T297189)', diff saved to https://phabricator.wikimedia.org/P22695 and previous config saved to /var/cache/conftool/dbconfig/20220316-113200-marostegui.json
[11:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6012.drmrs.wmnet with reason: host reimage
[11:33:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:58] <wikibugs>	 (03Merged) 10jenkins-bot: reposync: dont catch RepoSyncNoChangeError [software/spicerack] - 10https://gerrit.wikimedia.org/r/770003 (owner: 10Jbond)
[11:34:56] <wikibugs>	 (03CR) 10Emil Chetty: [C: 03+1] "Im Happy 😊" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[11:40:40] <wikibugs>	 (03PS8) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[11:42:53] <wikibugs>	 (03PS9) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[11:43:00] <wikibugs>	 (03CR) 10Jbond: "update thanks" [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[11:43:33] <wikibugs>	 (03CR) 10MVernon: [V: 03+1 C: 03+2] codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/769671 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon)
[11:43:37] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/769671 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon)
[11:45:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22697 and previous config saved to /var/cache/conftool/dbconfig/20220316-114542-marostegui.json
[11:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:53] <wikibugs>	 (03CR) 10Awight: Deploy template features to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (https://phabricator.wikimedia.org/T302857) (owner: 10Awight)
[11:51:41] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[12:00:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298294)', diff saved to https://phabricator.wikimedia.org/P22698 and previous config saved to /var/cache/conftool/dbconfig/20220316-120047-marostegui.json
[12:00:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[12:00:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[12:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[12:00:52] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[12:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[12:00:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298294)', diff saved to https://phabricator.wikimedia.org/P22699 and previous config saved to /var/cache/conftool/dbconfig/20220316-120100-marostegui.json
[12:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:44] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:02:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298294)', diff saved to https://phabricator.wikimedia.org/P22700 and previous config saved to /var/cache/conftool/dbconfig/20220316-120219-marostegui.json
[12:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:12:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T297189)', diff saved to https://phabricator.wikimedia.org/P22701 and previous config saved to /var/cache/conftool/dbconfig/20220316-121240-marostegui.json
[12:12:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:44] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[12:14:04] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6012.drmrs.wmnet with OS buster
[12:14:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6012.drmrs.wmnet with OS buster completed: - cp6012 (**WARN**)   -...
[12:17:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22702 and previous config saved to /var/cache/conftool/dbconfig/20220316-121724-marostegui.json
[12:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:00] <wikibugs>	 (03CR) 10TsepoThoabala: [C: 03+1] Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders)
[12:22:14] <wikibugs>	 (03PS25) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471)
[12:22:16] <wikibugs>	 (03PS14) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471)
[12:22:34] <wikibugs>	 (03PS1) 10MVernon: codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771348 (https://phabricator.wikimedia.org/T303507)
[12:23:40] <wikibugs>	 (03CR) 10MVernon: "Hi," [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771348 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon)
[12:25:43] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be
[12:25:46] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe
[12:25:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:52] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls
[12:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:43] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6013.drmrs.wmnet with OS buster
[12:27:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P22703 and previous config saved to /var/cache/conftool/dbconfig/20220316-122745-marostegui.json
[12:27:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:54] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6013.drmrs.wmnet with OS buster
[12:29:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22704 and previous config saved to /var/cache/conftool/dbconfig/20220316-122906-marostegui.json
[12:29:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:11] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[12:32:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22705 and previous config saved to /var/cache/conftool/dbconfig/20220316-123229-marostegui.json
[12:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:33] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for inference:30443 is about to expire   - https://alerts.wikimedia.org
[12:42:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P22707 and previous config saved to /var/cache/conftool/dbconfig/20220316-124250-marostegui.json
[12:42:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P22708 and previous config saved to /var/cache/conftool/dbconfig/20220316-124411-marostegui.json
[12:44:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298294)', diff saved to https://phabricator.wikimedia.org/P22709 and previous config saved to /var/cache/conftool/dbconfig/20220316-124734-marostegui.json
[12:47:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[12:47:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[12:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:39] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[12:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22710 and previous config saved to /var/cache/conftool/dbconfig/20220316-124742-marostegui.json
[12:47:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage
[12:49:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22711 and previous config saved to /var/cache/conftool/dbconfig/20220316-124943-marostegui.json
[12:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:15] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage
[12:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:46] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] Add script to update vector skin preferences (031 comment) [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[12:57:26] <wikibugs>	 (03PS5) 10Krinkle: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465)
[12:57:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T297189)', diff saved to https://phabricator.wikimedia.org/P22712 and previous config saved to /var/cache/conftool/dbconfig/20220316-125755-marostegui.json
[12:57:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[12:57:58] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) Just a note that the task for cloudvirt1024 is T303773, this task is for 1025/1026. They are failing for different reasons AFAICT.
[12:57:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[12:57:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:00] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[12:58:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T297189)', diff saved to https://phabricator.wikimedia.org/P22713 and previous config saved to /var/cache/conftool/dbconfig/20220316-125803-marostegui.json
[12:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P22714 and previous config saved to /var/cache/conftool/dbconfig/20220316-125916-marostegui.json
[12:59:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1300).
[13:00:05] <jouncebot>	 awight: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:19] <awight>	 I can deploy.
[13:01:15] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (https://phabricator.wikimedia.org/T302857) (owner: 10Awight)
[13:01:26] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:01:47] <wikibugs>	 (03PS2) 10Jbond: puppet: add vendored module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008
[13:01:58] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy template features to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771331 (https://phabricator.wikimedia.org/T302857) (owner: 10Awight)
[13:02:22] <Krinkle>	 awight: let me know when the backport(s) are done
[13:02:51] <awight>	 Krinkle: ack
[13:03:05] <awight>	 WMDE-Fisch: new config is on mwdebug1001
[13:03:17] <WMDE-Fisch>	 I'll have a look
[13:04:05] <awight>	 I see the new features on enwiki
[13:04:15] <awight>	 (but not too many :-)
[13:04:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22715 and previous config saved to /var/cache/conftool/dbconfig/20220316-130448-marostegui.json
[13:04:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:31] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] "waiting for Amir's review -- hopefully this can still be deployed here soon" [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[13:06:12] <WMDE-Fisch>	 awight: Looks good. I could check off all things we wanted live.
[13:06:21] <WMDE-Fisch>	 Also the thing we want not live ;-)
[13:06:25] <WMDE-Fisch>	 Seems to work!
[13:06:26] <awight>	 Thanks, syncing.
[13:06:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Fix invalid ref to last_backup_with_snapshot.valid [puppet] - 10https://gerrit.wikimedia.org/r/770999 (https://phabricator.wikimedia.org/T303870) (owner: 10Andrew Bogott)
[13:07:30] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771331|Deploy template features to enwiki (T302857)]] (duration: 00m 50s)
[13:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:34] <stashbot>	 T302857: Deploy first template focus-area improvements to enwiki - https://phabricator.wikimedia.org/T302857
[13:08:38] <awight>	 Krinkle: I'm all done, good luck!
[13:08:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:09:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:06] <Krinkle>	 awight: thans
[13:10:09] <Krinkle>	 Thanks! :)
[13:10:48] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:10:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:20] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[13:14:09] <wikibugs>	 (03Merged) 10jenkins-bot: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[13:14:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22716 and previous config saved to /var/cache/conftool/dbconfig/20220316-131421-marostegui.json
[13:14:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[13:14:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[13:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:26] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[13:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22717 and previous config saved to /var/cache/conftool/dbconfig/20220316-131429-marostegui.json
[13:14:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22718 and previous config saved to /var/cache/conftool/dbconfig/20220316-131953-marostegui.json
[13:19:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:22:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:59] <wikibugs>	 (03PS4) 10Volans: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456
[13:24:11] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS buster
[13:24:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:22] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS buster
[13:25:07] <logmsgbot>	 !log krinkle@deploy1002 Synchronized w/static.php: 159dfd21d (duration: 00m 50s)
[13:25:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:53] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6013.drmrs.wmnet with OS buster
[13:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6013.drmrs.wmnet with OS buster completed: - cp6013 (**WARN**)   -...
[13:26:10] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[13:27:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[13:28:37] <wikibugs>	 (03Merged) 10jenkins-bot: Adopt the new alerting API on all cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans)
[13:31:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T297189)', diff saved to https://phabricator.wikimedia.org/P22720 and previous config saved to /var/cache/conftool/dbconfig/20220316-133153-marostegui.json
[13:31:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:58] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[13:34:48] <wikibugs>	 (03PS2) 10BBlack: DNS: add drmrs dcmap ressources [dns] - 10https://gerrit.wikimedia.org/r/771342 (owner: 10Ayounsi)
[13:34:50] <wikibugs>	 (03PS1) 10BBlack: geo-res: align whitespace (no-op) [dns] - 10https://gerrit.wikimedia.org/r/771353
[13:35:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298294)', diff saved to https://phabricator.wikimedia.org/P22721 and previous config saved to /var/cache/conftool/dbconfig/20220316-133458-marostegui.json
[13:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:05] <stashbot>	 T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294
[13:36:17] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] geo-res: align whitespace (no-op) [dns] - 10https://gerrit.wikimedia.org/r/771353 (owner: 10BBlack)
[13:42:55] <wikibugs>	 (03PS3) 10Jbond: puppet: add vendor_module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008
[13:44:02] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be
[13:44:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe
[13:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:11] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls
[13:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P22722 and previous config saved to /var/cache/conftool/dbconfig/20220316-134658-marostegui.json
[13:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:45] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM (1 typo inline)" [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[13:53:46] <wikibugs>	 (03CR) 10Jbond: "pcc[1] shows no op" [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway)
[13:57:02] <wikibugs>	 (03PS10) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397)
[13:57:04] <wikibugs>	 (03CR) 10Jbond: "done thanks" [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond)
[13:57:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] codfw-prod: rebalance the rings (031 comment) [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771348 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon)
[13:57:32] <wikibugs>	 (03CR) 10Ladsgroup: "I couldn't check it in depth as I'm not 100% familiar with how user preferences work. That being said, here are the suggestions:" [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[13:57:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6014.drmrs.wmnet with OS buster
[13:57:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:53] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6014.drmrs.wmnet with OS buster
[13:58:57] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] DNS: add drmrs dcmap ressources [dns] - 10https://gerrit.wikimedia.org/r/771342 (owner: 10Ayounsi)
[14:02:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P22723 and previous config saved to /var/cache/conftool/dbconfig/20220316-140203-marostegui.json
[14:02:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:24] <wikibugs>	 (03PS1) 10Ayounsi: GeoDNS Cyprus to drmrs [dns] - 10https://gerrit.wikimedia.org/r/771354
[14:02:29] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771348 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon)
[14:03:35] <wikibugs>	 (03PS1) 10Elukey: install_server: add the flat-noswap.cfg recipe/override [puppet] - 10https://gerrit.wikimedia.org/r/771355
[14:03:37] <wikibugs>	 (03PS1) 10Elukey: install_server: move kubernetes200[5,6] to the new flat-noswap recipe [puppet] - 10https://gerrit.wikimedia.org/r/771356 (https://phabricator.wikimedia.org/T300744)
[14:04:40] <taavi>	 jouncebot: nowandnext
[14:04:40] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 55 minute(s)
[14:04:40] <jouncebot>	 In 3 hour(s) and 55 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[14:04:40] <jouncebot>	 In 3 hour(s) and 55 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[14:05:01] <wikibugs>	 (03PS1) 10Krinkle: static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465)
[14:05:32] <wikibugs>	 (03PS1) 10Majavah: Replace use of deprecated RecentChange::getEngine [extensions/CentralAuth] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770942 (https://phabricator.wikimedia.org/T303861)
[14:05:55] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[14:06:02] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Replace use of deprecated RecentChange::getEngine [extensions/CentralAuth] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770942 (https://phabricator.wikimedia.org/T303861) (owner: 10Majavah)
[14:07:22] <wikibugs>	 (03PS1) 10Jbond: nagios_common: change ssle warnings from 10 days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932)
[14:08:54] <wikibugs>	 (03Merged) 10jenkins-bot: Replace use of deprecated RecentChange::getEngine [extensions/CentralAuth] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770942 (https://phabricator.wikimedia.org/T303861) (owner: 10Majavah)
[14:09:02] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34347/console" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[14:09:22] <icinga-wm>	 RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34346/console" [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond)
[14:10:09] <herron>	 !log grafana1002:~# systemctl restart grafana-ldap-users-sync.service T303064
[14:10:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:14] <stashbot>	 T303064: grafana-ldap-users-sync fails to finish intermittently - https://phabricator.wikimedia.org/T303064
[14:12:36] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:12:51] <logmsgbot>	 !log taavi@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/CentralAuth/includes/User/CentralAuthUser.php: Backport: [[gerrit:770942|Replace use of deprecated RecentChange::getEngine (T303861)]] (duration: 00m 51s)
[14:12:53] <wikibugs>	 (03CR) 10Elukey: "Alex, I know that you had questions about the priority and max partition size, but for this code review I tried to change as few items as " [puppet] - 10https://gerrit.wikimedia.org/r/771355 (owner: 10Elukey)
[14:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:55] <stashbot>	 T303861: PHP Deprecated: Use of RecentChange::getEngine was deprecated in MediaWiki 1.29. [Called from MediaWiki\Extension\CentralAuth\User\CentralAuthUser::attach] - https://phabricator.wikimedia.org/T303861
[14:13:08] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) I updated that script to completely...
[14:13:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Agreed re: tweaking sizes in a followup change" [puppet] - 10https://gerrit.wikimedia.org/r/771355 (owner: 10Elukey)
[14:15:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:15:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:15:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:02] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Volans) >>! In T281249#7781991, @Ladsgroup wrot...
[14:16:06] <wikibugs>	 (03Abandoned) 10Ssingh: Add Wikidough's /24 to bgp_out in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/757635 (owner: 10Ssingh)
[14:16:38] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[14:17:02] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) >>! In T281249#7781991, @Ladsgroup...
[14:17:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T297189)', diff saved to https://phabricator.wikimedia.org/P22724 and previous config saved to /var/cache/conftool/dbconfig/20220316-141708-marostegui.json
[14:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance
[14:17:12] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[14:17:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance
[14:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance
[14:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance
[14:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:37] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage
[14:17:38] <wikibugs>	 (03PS1) 10Ssingh: Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359
[14:17:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22725 and previous config saved to /var/cache/conftool/dbconfig/20220316-141918-marostegui.json
[14:19:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:24] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[14:20:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage
[14:20:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: add the flat-noswap.cfg recipe/override [puppet] - 10https://gerrit.wikimedia.org/r/771355 (owner: 10Elukey)
[14:25:13] <Emperor>	 !log depooling ms-fe100[5-8] prior to decommissioning T303733
[14:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:17] <stashbot>	 T303733: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733
[14:26:38] <wikibugs>	 (03CR) 10Ayounsi: "As data point I ran 2 RIPE measurements from Cyprus to esams and drmrs:" [dns] - 10https://gerrit.wikimedia.org/r/771354 (owner: 10Ayounsi)
[14:30:16] <wikibugs>	 (03PS1) 10MVernon: swift: remove ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733)
[14:30:51] <wikibugs>	 (03PS2) 10Ssingh: Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359
[14:33:46] <Amir1>	 jouncebot: nowandnext
[14:33:47] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 26 minute(s)
[14:33:47] <jouncebot>	 In 3 hour(s) and 26 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[14:33:47] <jouncebot>	 In 3 hour(s) and 26 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[14:34:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P22726 and previous config saved to /var/cache/conftool/dbconfig/20220316-143423-marostegui.json
[14:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:53] <cjming>	 Amir1: thanks
[14:35:02] <wikibugs>	 (03PS3) 10Ssingh: Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359
[14:35:12] <Amir1>	 cjming: there is no deployment happening it seems, the floor is yours
[14:35:18] <wikibugs>	 (03CR) 10MVernon: "I think I caught all the necessary changes in one CR this time :)" [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon)
[14:35:52] <XioNoX>	 !log add anycast6 peers in drmrs
[14:35:54] <cjming>	 Amir1: cool
[14:35:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:52] <cjming>	 fyi for all, I'm going to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/770937
[14:37:01] <cjming>	 ^ in the next few
[14:39:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: move kubernetes200[5,6] to the new flat-noswap recipe [puppet] - 10https://gerrit.wikimedia.org/r/771356 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[14:39:52] <wikibugs>	 (03PS2) 10Herron: watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147)
[14:40:50] <wikibugs>	 (03CR) 10Herron: watchrat: require 3+ sites to agree on error status before alerting (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron)
[14:42:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron)
[14:43:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6014.drmrs.wmnet with OS buster
[14:43:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6014.drmrs.wmnet with OS buster completed: - cp6014 (**WARN**)   -...
[14:44:26] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[14:45:07] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Anycast neighbors manually configured on the switches." [homer/public] - 10https://gerrit.wikimedia.org/r/771359 (owner: 10Ssingh)
[14:45:38] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be
[14:45:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:41] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe
[14:45:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:46] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls
[14:45:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:15] <wikibugs>	 (03CR) 10CDanis: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/770944 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis)
[14:46:32] <wikibugs>	 (03Merged) 10jenkins-bot: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[14:46:38] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Cross-ref Grafana dashboard in statograph hiera [puppet] - 10https://gerrit.wikimedia.org/r/770944 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis)
[14:46:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6015.drmrs.wmnet with OS buster
[14:46:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6015.drmrs.wmnet with OS buster
[14:47:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[14:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:06] <wikibugs>	 (03PS1) 10JMeybohm: Change certificate expiry thresholds to 9d warn and 7d critical [alerts] - 10https://gerrit.wikimedia.org/r/771361 (https://phabricator.wikimedia.org/T303932)
[14:48:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] nagios_common: change ssle warnings from 10 days to 9 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond)
[14:49:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P22727 and previous config saved to /var/cache/conftool/dbconfig/20220316-144928-marostegui.json
[14:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:25] <wikibugs>	 (03PS3) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117)
[14:50:42] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:51:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:53] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:52:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:52:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:52:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:40] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[14:52:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:53:04] <sukhe>	 !log rolling restart of pdns-recursor.service and dnsdist.service on doh* hosts for OpenSSL updates
[14:53:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:24] <moritzm>	 !log restarting nginx/dhcpd on install/apt servers
[14:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:23] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/WikimediaMaintenance/T299104.php: Backport: [[gerrit:770937|Add script to update vector skin preferences (T299104)]] (duration: 00m 51s)
[14:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:27] <stashbot>	 T299104: Prepare script to update invalid user preferences after skins have been separated - https://phabricator.wikimedia.org/T299104
[14:55:45] <sukhe>	 !log rolling restart of nginx.service on durum* hosts for OpenSSL updates
[14:55:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10cmooney) FYI I don't believe there is any reason E/F would be ruled out for these, if space/power is tight in the existing rows.
[14:56:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet: add vendor_module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008 (owner: 10Jbond)
[14:56:39] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[14:57:41] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:58:42] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: add vendor_module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008 (owner: 10Jbond)
[14:59:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[14:59:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[14:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22728 and previous config saved to /var/cache/conftool/dbconfig/20220316-145946-marostegui.json
[14:59:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney)
[14:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:50] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[15:00:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney)
[15:02:28] <wikibugs>	 (03PS5) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[15:03:48] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: rename mail module class variables [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[15:04:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298557)', diff saved to https://phabricator.wikimedia.org/P22729 and previous config saved to /var/cache/conftool/dbconfig/20220316-150433-marostegui.json
[15:04:35] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[15:04:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[15:04:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[15:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:38] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[15:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:51] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[15:05:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:05:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:05:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10dancy) I verified that I can run docker commands now.  Thanks @Joe!
[15:07:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for ssh-gitlab [puppet] - 10https://gerrit.wikimedia.org/r/771362 (https://phabricator.wikimedia.org/T135991)
[15:08:22] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage
[15:08:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:59] <dancy>	 jouncebot nowandnext
[15:09:59] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 50 minute(s)
[15:09:59] <jouncebot>	 In 2 hour(s) and 50 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[15:09:59] <jouncebot>	 In 2 hour(s) and 50 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[15:11:03] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage
[15:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:19] <dancy>	 !log Testing mediawiki image build on deploy server again
[15:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:29] <wikibugs>	 (03PS1) 10Btullis: Add dummy deployment user/tokens for datahub [labs/private] - 10https://gerrit.wikimedia.org/r/771363 (https://phabricator.wikimedia.org/T303049)
[15:11:52] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I have created deployment users and tokens in `profile::kubernetes::infrastructure_users:` key in the private repo, as well as corresponding dummy valu...
[15:12:03] <logmsgbot>	 !log dancy@deploy1002 Started scap: (no justification provided)
[15:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:58] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) >>! In T281249#7782008, @Volans wrot...
[15:14:48] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump software version [puppet] - 10https://gerrit.wikimedia.org/r/771366
[15:15:41] <logmsgbot>	 !log dancy@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/bin/make -C /srv/mwbuilder/release/make-container-image -f Makefile build-and-push-all-images http_proxy=http://webproxy.eqiad.wmnet:8080 https_proxy=http://webproxy.eqiad.wmnet:8080 GIT_BASE=https://gerrit.wikimedia.org/r/ BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restricted/mediaw
[15:15:42] <logmsgbot>	 iki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver MV_BASE_PACKAGES= MV_EXTRA_CA_CERT=' returned non-zero exit status 2. (duration: 03m 38s)
[15:15:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:33] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing mediawiki image build
[15:17:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:27] <wikibugs>	 (03PS2) 10Jbond: nagios_common: change ssl warnings from 10 days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932)
[15:18:35] <wikibugs>	 (03CR) 10Jbond: nagios_common: change ssl warnings from 10 days to 9 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond)
[15:19:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Change certificate expiry thresholds to 9d warn and 7d critical [alerts] - 10https://gerrit.wikimedia.org/r/771361 (https://phabricator.wikimedia.org/T303932) (owner: 10JMeybohm)
[15:19:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump software version [puppet] - 10https://gerrit.wikimedia.org/r/771366 (owner: 10Jbond)
[15:24:14] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:26:21] <wikibugs>	 (03CR) 10Klausman: "directory structure fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/770973" [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman)
[15:28:50] <wikibugs>	 (03PS1) 10Urbanecm: cswiki: Add celebration logo for 500k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771369
[15:28:52] <urbanecm>	 jouncebot: nowandnext
[15:28:52] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 31 minute(s)
[15:28:52] <jouncebot>	 In 2 hour(s) and 31 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[15:28:52] <jouncebot>	 In 2 hour(s) and 31 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[15:28:58] <urbanecm>	 let me push the above out ^^
[15:29:05] <jeena>	 urbanecm: check with dancy
[15:29:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] cswiki: Add celebration logo for 500k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771369 (owner: 10Urbanecm)
[15:29:25] <urbanecm>	 sorry
[15:29:31] <urbanecm>	 dancy: may i? :)
[15:29:36] <urbanecm>	 (cancelled the +2 for now)
[15:29:59] <dancy>	 urbanecm: If it's urgent, I can cancel my operation and restart it after.  I suspect  it'll take about 30 more minutes to complete.
[15:30:06] <urbanecm>	 i can wait 30m
[15:30:24] <dancy>	 ok. If it goes longer than that I'll cancel.
[15:35:15] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[15:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:23] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[15:35:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye
[15:35:32] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[15:35:32] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:33] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed...
[15:36:06] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:36:06] <wikibugs>	 (03PS2) 10Ladsgroup: idp: Open up orchestrator to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[15:37:30] <ryankemper>	 !log [WCQS] Restarted updater across fleet to get out jvm sec upgrades: `ryankemper@cumin1001:~$ sudo -E cumin 'wcqs*' 'systemctl restart wcqs-updater.service'`
[15:37:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:01] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[15:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:08] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes2005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[15:38:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye
[15:39:12] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:40:12] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Enable profile::auto_restarts::service for apache/CI [puppet] - 10https://gerrit.wikimedia.org/r/770467 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:40:14] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:40:26] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[15:41:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:42:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22730 and previous config saved to /var/cache/conftool/dbconfig/20220316-154206-marostegui.json
[15:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:11] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[15:42:50] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:43:17] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[15:43:29] <logmsgbot>	 !log dancy@deploy1002 scap failed: RuntimeError dictionary changed size during iteration (duration: 25m 55s)
[15:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:48] <dancy>	 urbanecm: I tested up to the point that I needed to.  All yours now.
[15:43:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:43:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:56] <urbanecm>	 thanks!
[15:44:03] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] cswiki: Add celebration logo for 500k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771369 (owner: 10Urbanecm)
[15:44:58] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki: Add celebration logo for 500k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771369 (owner: 10Urbanecm)
[15:45:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for apache/CI [puppet] - 10https://gerrit.wikimedia.org/r/770467 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:45:26] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org
[15:46:08] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized static/images/project-logos/: cswiki celebration logos (duration: 00m 50s)
[15:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:41] <wikibugs>	 (03PS3) 10Ladsgroup: idp: Open up orchestrator to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[15:47:53] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[15:49:02] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: cswiki celebration logo (duration: 00m 49s)
[15:49:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:55] <urbanecm>	 dancy: i'm done. if you have anything else to test, feel free to resume
[15:51:02] <dancy>	 great, I shall.
[15:51:12] <moritzm>	 !log restarting exim/spamasassin on MXes to pick up new OpenSSL
[15:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:11] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6015.drmrs.wmnet with OS buster
[15:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:21] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6015.drmrs.wmnet with OS buster completed: - cp6015 (**WARN**)   -...
[15:52:43] <wikibugs>	 (03CR) 10Ladsgroup: "PCC looks good to me: https://puppet-compiler.wmflabs.org/pcc-worker1002/1244/dborch1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[15:52:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[15:52:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[15:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:56] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be
[15:52:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298557)', diff saved to https://phabricator.wikimedia.org/P22731 and previous config saved to /var/cache/conftool/dbconfig/20220316-155300-marostegui.json
[15:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe
[15:53:05] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[15:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:10] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls
[15:53:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:54:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P22732 and previous config saved to /var/cache/conftool/dbconfig/20220316-155711-marostegui.json
[15:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:11] <logmsgbot>	 !log dancy@deploy1002 Synchronized README: testing mediawiki image build (duration: 02m 11s)
[15:59:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:42] <aqu>	 !log analytics/refinery - scap deply "Migrate session_length/daily from Oozie to Airflow"
[16:02:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:21] <wikibugs>	 (03PS1) 10Clare Ming: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771387 (https://phabricator.wikimedia.org/T299104)
[16:03:47] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup)
[16:05:39] <wikibugs>	 (03PS1) 10MVernon: codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771375 (https://phabricator.wikimedia.org/T303507)
[16:07:03] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10SRE Observability (FY2021/2022-Q3): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10CDanis) As of yesterday, instructions have been shared with the SRE...
[16:07:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS buster
[16:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:23] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] "Another routine operation, so self-reviewing." [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771375 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon)
[16:07:28] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:07:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster
[16:08:26] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:09:26] <wikibugs>	 (03PS3) 10Ladsgroup: mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397)
[16:09:30] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup)
[16:09:31] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) >>! In T202061#7767276, @CDanis wrote: > [ ... ] > I'll put the above in a...
[16:10:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] nagios_common: change ssl warnings from 10 days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond)
[16:10:11] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup)
[16:10:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Change certificate expiry thresholds to 9d warn and 7d critical [alerts] - 10https://gerrit.wikimedia.org/r/771361 (https://phabricator.wikimedia.org/T303932) (owner: 10JMeybohm)
[16:10:41] <wikibugs>	 10SRE, 10Traffic, 10User-Ladsgroup: Rework education.wikimedia.org redirects - https://phabricator.wikimedia.org/T303397 (10Ladsgroup) 05Open→03Resolved
[16:12:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon)
[16:12:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P22733 and previous config saved to /var/cache/conftool/dbconfig/20220316-161216-marostegui.json
[16:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:54] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway)
[16:18:47] <wikibugs>	 (03PS2) 10MVernon: swift: remove ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733)
[16:19:00] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:19:30] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:19:59] <wikibugs>	 (03CR) 10MVernon: swift: remove ms-fe100[5-8] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon)
[16:21:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon)
[16:22:28] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@d039471]: Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471]
[16:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:40] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: remove ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/771360 (https://phabricator.wikimedia.org/T303733) (owner: 10MVernon)
[16:27:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] nagios_common: change ssl warnings from 10 days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/771358 (https://phabricator.wikimedia.org/T303932) (owner: 10Jbond)
[16:27:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T297189)', diff saved to https://phabricator.wikimedia.org/P22734 and previous config saved to /var/cache/conftool/dbconfig/20220316-162721-marostegui.json
[16:27:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[16:27:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[16:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:26] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[16:27:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage
[16:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:45] <jbond>	 Emperor: happy for me to merge yours
[16:28:02] <Emperor>	 !log moving swiftrepl and stats reporter host from ms-fe1005 to ms-fe1009 T303733
[16:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:06] <stashbot>	 T303733: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733
[16:28:18] <Emperor>	 jbond: OK, thanks
[16:29:38] <wikibugs>	 (03PS6) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138)
[16:30:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage
[16:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:07] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:24] <dancy>	 Gah!
[16:31:29] <wikibugs>	 (03PS7) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138)
[16:32:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10Cmjohnson) @BTullis Can you plan to shut this down tomorrow 17 March at 10a EST 1400 UTC.
[16:33:18] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10BTullis) Yes, will do. Both nodes at the same time?
[16:34:28] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.timer,swift_dispersion_stats_lowlatency.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:59] <Emperor>	 !log rolling restart of ms-fe10[09-12] so they know about removal of older proxies T303733
[16:37:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:02] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:37:03] <stashbot>	 T303733: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733
[16:37:11] <wikibugs>	 (03PS2) 10Cwhite: grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064)
[16:37:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite)
[16:38:28] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:39:39] <wikibugs>	 (03PS3) 10Cwhite: grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064)
[16:40:45] <wikibugs>	 (03PS8) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138)
[16:44:01] <wikibugs>	 (03CR) 10JHathaway: "John I believe this is ready for another review, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway)
[16:45:18] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10SRE Observability (FY2021/2022-Q3): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) @CDanis I think this is probably good to close, we can always...
[16:45:29] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[16:45:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed...
[16:47:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) Shouldn't be an issue with installing these in E4 / F4.  However the configuration of the switches there won't be compl...
[16:48:17] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@d039471]: Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] (duration: 25m 49s)
[16:48:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:28] <icinga-wm>	 PROBLEM - Number of mw swift objects in codfw greater than eqiad on alert1001 is CRITICAL: execution: found duplicate series for the match group {account=mw-media, class=deleted} on the right hand-side of the operation: [{__name__=swift_container_stats_objects_total, account=mw-media, class=deleted, cluster=swift, instance=ms-fe1009:9112, job=statsd_exporter, site=eqiad}, {__name__=swift_container_stats_objects_total, account=mw-media, cl
[16:48:28] <icinga-wm>	 ted, cluster=swift, instance=ms-fe1005:9112, job=statsd_exporter, site=eqiad}]:many-to-many matching not allowed: matching labels must be unique on one side https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=codfw
[16:51:55] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10SRE Observability (FY2021/2022-Q3): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) 05Open→03Resolved
[16:51:59] <wikibugs>	 (03Abandoned) 10Clare Ming: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771387 (https://phabricator.wikimedia.org/T299104) (owner: 10Clare Ming)
[16:52:01] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata)
[16:53:01] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jhathaway)
[16:53:45] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) 05Open→03Resolved Community modules have now been moved to vendor_modules, thanks everyone for the discussion & feedback.
[16:56:32] <icinga-wm>	 PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: execution: multiple matches for labels: many-to-one matching must be explicit (group_left/group_right) https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad
[17:00:02] <elukey>	 Emperor: --^ (I guess it is part of maintenance but in case it is not I am pinging you :)
[17:00:52] <Emperor>	 elukey: thanks, yes, this seems to happen when we move swift stats_reporter_host around
[17:01:05] <Emperor>	 it should resolve in ~10m or so
[17:03:08] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:03:44] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) I'm messing around with the perccli64 binary, but I admit its new to me and I'm not versed in it at all.  Additionally, the dumpsdata1`007 host isn't setup ideally, as I couldn't get the installer...
[17:04:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[17:04:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[17:04:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:10] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@d039471] (thin): Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471]
[17:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:17] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@d039471] (thin): Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] (duration: 00m 07s)
[17:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:35] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@d039471] (hadoop-test): Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471]
[17:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:36] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::prometheus: set team: wmcs on all alerts [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493)
[17:11:47] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34360/console" [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah)
[17:11:58] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS buster
[17:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:08] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster completed: - cp6016 (**WARN**)   -...
[17:12:13] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::prometheus: set team: wmcs on all alerts [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493)
[17:12:37] <wikibugs>	 (03PS26) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471)
[17:12:39] <wikibugs>	 (03PS15) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471)
[17:13:59] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@d039471] (hadoop-test): Migrate session_length/daily from Oozie to Airflow [analytics/refinery@d039471] (duration: 07m 23s)
[17:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[17:14:37] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34361/console" [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah)
[17:17:19] <Emperor>	 godog: the Number of mw swift objects in codfw greater than eqiad alerts don't seem to be self-resolving this time; any ideas? AFAICT swift_dispersion_stats.service on ms-fe1009 is happy...
[17:21:10] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be
[17:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:20] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe
[17:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:26] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls
[17:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:53] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) Echo of my testing so far:  setting the drive info via show and setting it to on or offline works, but not setting to missing or sending rebuild command   ` root@dumpsdata1007:/usr/local/bin# percc...
[17:22:11] <wikibugs>	 (03PS2) 10Milimetric: Eventlogging: Remove unused RUM Speed Index. [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog)
[17:22:50] <wikibugs>	 (03PS27) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471)
[17:22:52] <wikibugs>	 (03PS16) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471)
[17:22:57] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM for the use of wmflib, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite)
[17:23:13] <wikibugs>	 (03CR) 10Milimetric: [C: 03+1] "+1 for me to remove, but I can't merge in this repo.  I echo @ottomata's comment on removing from wgEventStreams" [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog)
[17:24:09] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246)
[17:24:22] <wikibugs>	 (03PS1) 10Btullis: Add a kubeconfig configuration for datahub [puppet] - 10https://gerrit.wikimedia.org/r/771407 (https://phabricator.wikimedia.org/T303049)
[17:25:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah)
[17:27:06] <wikibugs>	 (03PS2) 10Majavah: dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246)
[17:28:09] <wikibugs>	 (03PS1) 10Btullis: Add a namespace for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/771409 (https://phabricator.wikimedia.org/T303049)
[17:31:03] <wikibugs>	 (03PS1) 10Jbond: systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410
[17:31:05] <wikibugs>	 (03PS1) 10Jbond: P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315)
[17:32:09] <dancy>	 jouncebot nowandnext
[17:32:10] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 27 minute(s)
[17:32:10] <jouncebot>	 In 0 hour(s) and 27 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[17:32:10] <jouncebot>	 In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800)
[17:32:29] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy)
[17:32:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[17:33:17] <wikibugs>	 (03PS2) 10Jbond: P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315)
[17:33:35] <wikibugs>	 (03Merged) 10jenkins-bot: mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy)
[17:34:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34363/console" [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[17:36:52] <logmsgbot>	 !log dancy@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:771001|mwscript: Support --force-version flag (T303878)]] (duration: 00m 57s)
[17:36:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:57] <stashbot>	 T303878: multiversion/MWScript.php: Allow specifying a specific version of code to run - https://phabricator.wikimedia.org/T303878
[17:37:28] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[17:39:06] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "I think this is a good idea.  I expect that some people's data code might break, esp if they are hitting the MW API from within analytics " [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[17:47:35] <wikibugs>	 (03PS1) 10Jbond: P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415
[17:48:39] <wikibugs>	 (03PS4) 10Cwhite: grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064)
[17:49:21] <wikibugs>	 (03CR) 10Cwhite: grafana ldap users sync: enable retries (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite)
[17:50:25] <wikibugs>	 (03CR) 10Jbond: P:java: update profile::java to use systemd::environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771415 (owner: 10Jbond)
[17:50:35] <wikibugs>	 (03PS2) 10Jbond: P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415
[17:51:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34365/console" [puppet] - 10https://gerrit.wikimedia.org/r/771415 (owner: 10Jbond)
[17:52:24] <wikibugs>	 (03PS1) 10Hnowlan: WIP: build docker images using blubber and pip dependencies [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/771416 (https://phabricator.wikimedia.org/T267327)
[17:52:29] <wikibugs>	 (03PS1) 10Milimetric: Revert "Temporarily disable traffic data purge" [puppet] - 10https://gerrit.wikimedia.org/r/771389
[17:52:59] <icinga-wm>	 RECOVERY - Number of mw swift objects in eqiad greater than codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad
[17:53:07] <icinga-wm>	 RECOVERY - Number of mw swift objects in codfw greater than eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=codfw
[17:53:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Temporarily disable traffic data purge" [puppet] - 10https://gerrit.wikimedia.org/r/771389 (owner: 10Milimetric)
[17:54:19] <wikibugs>	 (03CR) 10Majavah: systemd: Add new define to manage user service environments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771410 (owner: 10Jbond)
[17:55:05] <wikibugs>	 (03CR) 10Krinkle: "Note that "current" is only used for /static/current in mw-k8s which effectively receives no traffic currently, so that's essentially a no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle)
[17:57:45] <wikibugs>	 (03PS2) 10Milimetric: Revert "Temporarily disable traffic data purge" [puppet] - 10https://gerrit.wikimedia.org/r/771389
[17:57:58] <wikibugs>	 (03CR) 10Milimetric: [C: 04-1] "hang on for a minute while we check with Olja" [puppet] - 10https://gerrit.wikimedia.org/r/771389 (owner: 10Milimetric)
[18:00:04] <jouncebot>	 jeena and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage with CPT . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800).
[18:00:04] <jouncebot>	 jeena and dancy: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T1800).
[18:00:16] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on karapace1001.eqiad.wmnet with reason: Setting up karapace for the first time
[18:00:18] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on karapace1001.eqiad.wmnet with reason: Setting up karapace for the first time
[18:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:26] <jeena>	 Train is blocked. Sending the email
[18:01:25] <Emperor>	 I kicked the prometheus-statsd-exporter on the old frontend, that is at least coincidental with the alert clearing...
[18:02:23] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@257960f]: Migrate session_length/daily from Oozie to Airflow [airflow-dags/analytics@257960f]
[18:02:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:32] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@257960f]: Migrate session_length/daily from Oozie to Airflow [airflow-dags/analytics@257960f] (duration: 00m 08s)
[18:02:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite)
[18:05:23] <wikibugs>	 (03PS1) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565)
[18:05:55] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[18:05:56] <wikibugs>	 (03PS2) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565)
[18:06:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[18:08:14] <wikibugs>	 (03PS3) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565)
[18:09:18] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@257960f]: Migrate session_length/daily from Oozie to Airflow [airflow-dags/analytics_test@257960f]
[18:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:27] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@257960f]: Migrate session_length/daily from Oozie to Airflow [airflow-dags/analytics_test@257960f] (duration: 00m 08s)
[18:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:35] <wikibugs>	 (03PS4) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565)
[18:13:33] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34368/console" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[18:14:07] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2015 [puppet] - 10https://gerrit.wikimedia.org/r/771422 (https://phabricator.wikimedia.org/T300744)
[18:14:09] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2016 [puppet] - 10https://gerrit.wikimedia.org/r/771423 (https://phabricator.wikimedia.org/T300744)
[18:14:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:16:02] <wikibugs>	 (03PS5) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562)
[18:18:37] <wikibugs>	 (03CR) 10Ebernhardson: elasticsearch: remove custom restart handling (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking)
[18:20:51] <wikibugs>	 (03CR) 10Razzi: "Catalog diff: https://puppet-compiler.wmflabs.org/pcc-worker1002/34368/karapace1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi)
[18:20:59] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis)
[18:22:43] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:27:38] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359 (owner: 10Ssingh)
[18:30:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add Wikidough's /24 (bgp_out) and /48 (bgp6_out) in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/771359 (owner: 10Ssingh)
[18:32:52] <sukhe>	 !log running: homer "cr*-drmrs*" commit "Gerrit 771359: Set up BGP peering in drmrs for Wikidough."
[18:32:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) Just to follow up I've the TAC case open with Juniper since this morning but they have been slow to respond, and not grasping the exact issue in the...
[18:44:38] <wikibugs>	 (03PS13) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[18:47:41] <wikibugs>	 (03PS14) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[18:48:59] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:50:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[18:54:02] <wikibugs>	 (03PS1) 10Ottomata: Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396)
[18:54:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata)
[18:55:18] <wikibugs>	 (03PS2) 10Ottomata: Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396)
[18:56:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata)
[18:57:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) @cmrooney thanks!!
[18:58:36] <wikibugs>	 (03PS3) 10Ottomata: Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396)
[19:00:41] <wikibugs>	 (03PS4) 10Ottomata: Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396)
[19:01:58] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34372/console" [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata)
[19:02:23] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "This should be a no-op.  Next patch will roll this out in test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata)
[19:06:37] <wikibugs>	 (03PS15) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[19:08:53] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1 C: 03+2] Added gobblin_shaded_jar param to gobblin_job [puppet] - 10https://gerrit.wikimedia.org/r/771430 (https://phabricator.wikimedia.org/T292396) (owner: 10Ottomata)
[19:14:22] <logmsgbot>	 !log otto@deploy1002 Started deploy [analytics/refinery@2d2056a] (hadoop-test): (no justification provided)
[19:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:06] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:20:56] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:22:13] <logmsgbot>	 !log otto@deploy1002 Finished deploy [analytics/refinery@2d2056a] (hadoop-test): (no justification provided) (duration: 07m 50s)
[19:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:54] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:41] <wikibugs>	 (03PS1) 10Jbond: wmflib: add class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559)
[19:39:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34373/console" [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[19:40:12] <wikibugs>	 (03PS1) 10Ssingh: definitions: add drmrs to wikimedia-private [homer/public] - 10https://gerrit.wikimedia.org/r/771438
[19:42:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite)
[19:43:29] <wikibugs>	 (03CR) 10Hoo man: [C: 03+1] Write "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768089 (owner: 10Lucas Werkmeister (WMDE))
[19:51:51] <wikibugs>	 (03PS16) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[19:56:20] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:57:40] <wikibugs>	 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) >>! In T303903#7780748, @Peachey88 wrote: > Did you keep a full copy of one of the tracerts that you could provide to the SRE team via [[ https://phabricator.wikimedia.org/paste/edit/form/36...
[20:00:05] <jouncebot>	 RoanKattouw and Urbanecm: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220316T2000).
[20:00:05] <jouncebot>	 zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <zabe>	 o/
[20:00:15] <urbanecm>	 oh no, unfortunate deployments
[20:00:27] <urbanecm>	 but i won't disagree with you jouncebot
[20:00:30] <urbanecm>	 I can deploy today :-)
[20:00:37] <urbanecm>	 hello zabe 
[20:00:44] <zabe>	 hey
[20:00:52] <wikibugs>	 (03PS17) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[20:01:09] <urbanecm>	 zabe: we write to the wmg version already, right?
[20:01:49] <zabe>	 yes, see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/766229
[20:02:01] <wikibugs>	 (03PS2) 10Jbond: wmflib: add class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559)
[20:02:03] <wikibugs>	 (03PS1) 10Jbond: P:scap::dsh: Add scpa targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559)
[20:02:34] <wikibugs>	 (03PS18) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[20:02:42] <urbanecm>	 in that case, it should be syncable easily (in more or less any order), right?
[20:02:52] * urbanecm tries to make sure this patch is safely deployable
[20:03:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34374/console" [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[20:03:20] <zabe>	 yes
[20:04:27] <urbanecm>	 let's do it then
[20:04:30] <wikibugs>	 (03PS3) 10Urbanecm: Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:04:35] <wikibugs>	 (03CR) 10Ahmon Dancy: P:scap::dsh: Add scpa targets as a dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[20:04:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:04:52] <icinga-wm>	 PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:05:00] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:05:29] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate wmfDatacenter(s) to wmgDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:06:33] <wikibugs>	 (03CR) 10Ahmon Dancy: wmflib: add class_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[20:06:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[20:07:05] <urbanecm>	 zabe: pulled to mwdebug1001, please have a look
[20:07:21] <dancy>	 jbond: Sorry for typo nitpicks.  I'm very happy to see T303559 moving along!
[20:07:22] <stashbot>	 T303559: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559
[20:07:43] <wikibugs>	 (03PS1) 10RobH: dumpsdata1006 setup info [puppet] - 10https://gerrit.wikimedia.org/r/771442 (https://phabricator.wikimedia.org/T302937)
[20:08:08] <wikibugs>	 (03CR) 10RobH: [C: 03+2] dumpsdata1006 setup info [puppet] - 10https://gerrit.wikimedia.org/r/771442 (https://phabricator.wikimedia.org/T302937) (owner: 10RobH)
[20:09:14] <zabe>	 urbanecm, lgtm, stuff doesn't seem to break and logstash looks clear
[20:09:25] <urbanecm>	 let's try it then
[20:10:26] <wikibugs>	 (03PS1) 10BryanDavis: wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006)
[20:10:30] <wikibugs>	 (03PS1) 10BryanDavis: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006)
[20:10:49] <wikibugs>	 (03PS19) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[20:11:24] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/: f649199: Migrate wmfDatacenter(s) to wmgDatacenter(s) (T45956; 1/3) (duration: 00m 50s)
[20:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:28] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[20:12:14] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized multiversion/: f649199: Migrate wmfDatacenter(s) to wmgDatacenter(s) (T45956; 2/3) (duration: 00m 50s)
[20:12:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:37] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "This does not match hosts that only have mediawiki deployed via mediawiki::scap" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[20:13:04] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized docroot/noc/db.php: f649199: Migrate wmfDatacenter(s) to wmgDatacenter(s) (T45956; 3/3) (duration: 00m 49s)
[20:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:10] <urbanecm>	 zabe: should be live
[20:13:18] <urbanecm>	 as always ,please check logstash for a bit :)
[20:13:25] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops: Etherpads corrupted - https://phabricator.wikimedia.org/T304005 (10RLazarus) From the time sliders it looks like the issue is that all or part of the pad gets deleted and replaced by a character, at these revisions respectively:  - https://etherpad.wikimedia.org/p/T...
[20:13:33] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED
[20:13:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:47] <zabe>	 thanks :)
[20:14:05] <Jdlrobson>	 urbanecm: Hi we have 2 late backports (maintenance scripts)
[20:14:15] <urbanecm>	 Jdlrobson: sure thing. can you update calendar please?
[20:14:44] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata)
[20:14:51] <Jdlrobson>	 urbanecm: will do
[20:16:03] <wikibugs>	 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10Aklapper) > You do not have permission to view this object. Sorry, should work now.
[20:16:47] <wikibugs>	 (03Restored) 10Clare Ming: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771387 (https://phabricator.wikimedia.org/T299104) (owner: 10Clare Ming)
[20:17:03] <wikibugs>	 (03CR) 10Jforrester: "❤️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:17:35] <urbanecm>	 Jdlrobson: please ping me once it's there :)
[20:17:49] <wikibugs>	 (03CR) 10Jforrester: "This is a bit of a deploy-trap as written; we normally factor these out into three patches (first remove use from CS, then remove setting " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:18:36] <wikibugs>	 (03PS1) 10Jdlrobson: Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104)
[20:18:54] <wikibugs>	 (03CR) 10BryanDavis: [C: 04-1] DynamicSidebar: remove unused extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:20:43] <wikibugs>	 (03PS2) 10Jdlrobson: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104)
[20:20:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[20:21:55] <wikibugs>	 (03PS2) 10BryanDavis: DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006)
[20:21:57] <wikibugs>	 (03PS1) 10BryanDavis: DynamicSidebar: Remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447
[20:21:59] <wikibugs>	 (03PS1) 10BryanDavis: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448
[20:22:44] <wikibugs>	 (03PS3) 10Jdlrobson: Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104)
[20:23:46] <wikibugs>	 (03CR) 10BryanDavis: DynamicSidebar: remove from CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:24:02] <kostajh>	 urbanecm: hi, can I get a config patch into this window?
[20:24:08] <urbanecm>	 kostajh: sure
[20:24:13] <kostajh>	 ok, patch coming
[20:24:39] <wikibugs>	 (03PS1) 10Jdlrobson: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771449 (https://phabricator.wikimedia.org/T299104)
[20:24:48] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye
[20:24:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye
[20:25:45] <wikibugs>	 (03Abandoned) 10Jdlrobson: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771387 (https://phabricator.wikimedia.org/T299104) (owner: 10Clare Ming)
[20:27:04] <urbanecm>	 kostajh: please update the calendar once you have the patch
[20:27:23] <kostajh>	 urbanecm: will do
[20:27:24] <Jdlrobson>	 urbanecm: have updated
[20:27:28] <urbanecm>	 thanks Jdlrobson 
[20:27:40] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/771449/ first
[20:27:45] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/771390/ second
[20:27:48] <Jdlrobson>	 both are maintenance scripts
[20:27:53] <Jdlrobson>	 so i guess no syncing needed?
[20:27:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771449 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[20:27:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[20:28:02] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771449 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[20:28:04] <Jdlrobson>	 we're planning to run them after the window closes.
[20:28:12] <urbanecm>	 Jdlrobson: i need to sync them so they get to the maint script
[20:28:15] <urbanecm>	 *maint server
[20:28:16] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[20:28:18] <Jdlrobson>	 urbanecm: got it. Thanks!
[20:28:48] <urbanecm>	 but there will be no testing needed :)
[20:29:29] <wikibugs>	 (03PS20) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[20:30:06] <kostajh>	 urbanecm: can you remind me, if I need to modify both InitialiseSettings and InitialiseSettings-labs, should that be in two patches or one?
[20:30:15] <wikibugs>	 (03Merged) 10jenkins-bot: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/771449 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[20:30:17] <wikibugs>	 (03Merged) 10jenkins-bot: Add insert option for update skin preferences script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771390 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[20:30:19] <urbanecm>	 kostajh: feel free to do it in a single patch
[20:31:16] <urbanecm>	 Jdlrobson: syncing the scripts
[20:32:23] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Set GEWelcomeSurveyShowMailingListQuestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771451 (https://phabricator.wikimedia.org/T303240)
[20:32:32] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED
[20:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH)
[20:34:00] <kostajh>	 urbanecm: added to the calendar
[20:34:34] <urbanecm>	 thanks, let me see
[20:34:37] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/WikimediaMaintenance/: ebfc516: Add script to update vector skin preferences (T299104) (duration: 00m 51s)
[20:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:41] <stashbot>	 T299104: Prepare script to update invalid user preferences after skins have been separated - https://phabricator.wikimedia.org/T299104
[20:34:43] <Jdlrobson>	 thanks urbanecm 
[20:35:28] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/WikimediaMaintenance/: 9ba157b: Add insert option for update skin preferences script (T299104) (duration: 00m 50s)
[20:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:39] <urbanecm>	 Jdlrobson: should be live
[20:36:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006) (owner: 10BryanDavis)
[20:36:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH)
[20:37:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Set GEWelcomeSurveyShowMailingListQuestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771451 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan)
[20:37:38] <urbanecm>	 kostajh: since it is a no-op at prod, do you want to do a mwdebug test?
[20:37:55] <Jdlrobson>	 thanks urbanecm 
[20:38:00] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Set GEWelcomeSurveyShowMailingListQuestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771451 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan)
[20:38:27] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED
[20:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:51] <kostajh>	 urbanecm: no, it can just be synced IMO
[20:38:56] <urbanecm>	 okay
[20:40:20] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [no-op] 8efa537: GrowthExperiments: Set GEWelcomeSurveyShowMailingListQuestion (T303240) (duration: 00m 53s)
[20:40:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:24] <stashbot>	 T303240: Welcome emails: opt-in checkbox - https://phabricator.wikimedia.org/T303240
[20:40:29] <urbanecm>	 kostajh: done
[20:40:32] <urbanecm>	 anything else, anyone?
[20:40:44] <bd808>	 Krinkle: Can I nerd snipe you into volunteering to walk those DynamicSidebar removal patches through merge and deploy?
[20:41:14] <kostajh>	 urbanecm: thank you!
[20:41:23] <urbanecm>	 happy to help
[20:41:42] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: fix facts processing script [puppet] - 10https://gerrit.wikimedia.org/r/771453
[20:42:06] <kostajh>	 urbanecm: I'd expect to see the checkbox field on https://es.wikipedia.beta.wmflabs.org/wiki/Especial:Encuesta_de_bienvenida, though
[20:42:30] <kostajh>	 https://es.wikipedia.beta.wmflabs.org/wiki/Especial:Versi%C3%B3n says that the supporting code has synced 
[20:42:55] <urbanecm>	 kostajh: it's not yet synced there. it will take up to 30 minutes
[20:43:18] <kostajh>	 urbanecm: ah, the config patch didn't sync there. I see
[20:43:37] <wikibugs>	 (03PS21) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026)
[20:45:08] <urbanecm>	 yup yup
[20:45:19] <urbanecm>	 i can't easily change when it gets there
[20:45:23] <urbanecm>	 so i suggest waiting
[20:46:04] <wikibugs>	 (03PS2) 10Jbond: puppet_compiler: fix facts processing script [puppet] - 10https://gerrit.wikimedia.org/r/771453
[20:48:01] <kostajh>	 sounds good
[20:52:51] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED
[20:52:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34375/console" [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[20:56:20] <cjming>	 hi all - hope it's ok that we run a maintenance script in a few mins -- updating ~35 rows
[20:57:20] <wikibugs>	 (03PS3) 10Jbond: puppet_compiler: fix facts processing script [puppet] - 10https://gerrit.wikimedia.org/r/771453
[20:57:39] <cjming>	 updating ~35 rows in hewiki + frwiki
[20:58:29] <Jdlrobson>	 urbanecm: will this interfere with anything you are doing?
[20:58:39] <urbanecm>	 cjming: Jdlrobson: go ahead
[21:00:49] <jynus>	 please long when done, its free! 0:-D
[21:01:59] <wikibugs>	 (03PS2) 10BryanDavis: wikitech: Remove DynamicSidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771443 (https://phabricator.wikimedia.org/T304006)
[21:02:01] <wikibugs>	 (03PS3) 10BryanDavis: DynamicSidebar: remove from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771444 (https://phabricator.wikimedia.org/T304006)
[21:02:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) a:05RobH→03Cmjohnson cookbook sre.hosts.provision fails for dumpsdata1006.  Please check its mgmt cable and attempt to rerun.
[21:02:05] <wikibugs>	 (03PS2) 10BryanDavis: DynamicSidebar: remove from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771447
[21:02:07] <wikibugs>	 (03PS2) 10BryanDavis: DynamicSidebar: remove unused extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771448
[21:04:15] <wikibugs>	 (03PS2) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559)
[21:05:20] <wikibugs>	 (03PS3) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559)
[21:05:34] <wikibugs>	 (03PS1) 10Cathal Mooney: Add ACL filter to Spine switch interface connecting CR routers Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/771461 (https://phabricator.wikimedia.org/T299758)
[21:05:57] <wikibugs>	 (03CR) 10Razzi: "Thanks for the input everybody, especially Volans for the many improvement suggestions." [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi)
[21:06:40] <icinga-wm>	 RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:07:02] <wikibugs>	 (03CR) 10Jbond: P:scap::dsh: Add scap targets as a dsh group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[21:07:06] <wikibugs>	 (03PS1) 10Zabe: wmcs: stop accessing gu_enabled and gu_enabled_method in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/771462 (https://phabricator.wikimedia.org/T301674)
[21:07:37] <wikibugs>	 (03PS3) 10Jbond: wmflib: add class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559)
[21:07:49] <wikibugs>	 (03PS4) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559)
[21:08:02] <wikibugs>	 (03CR) 10Jbond: wmflib: add class_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[21:09:21] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34376/console" [puppet] - 10https://gerrit.wikimedia.org/r/771437 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[21:10:05] <wikibugs>	 (03PS2) 10Cathal Mooney: Add ACL filter to Spine switch interface connecting CR routers Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/771461 (https://phabricator.wikimedia.org/T299758)
[21:12:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34377/console" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[21:14:50] <wikibugs>	 (03PS5) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559)
[21:15:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34378/console" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond)
[21:17:07] <cjming>	 !log end running skin update preference maintenance script
[21:17:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:38] <wikibugs>	 (03PS6) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562)
[21:33:41] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34379/console" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi)
[21:39:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/771453 (owner: 10Jbond)
[21:42:14] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:46:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[21:50:50] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:51:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[21:57:58] <wikibugs>	 (03PS2) 10Zabe: Migrate wmfDbconfigFromEtcd to wmgDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956)
[22:01:56] <wikibugs>	 (03PS1) 10Zabe: Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956)
[22:05:55] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org
[22:07:15] <wikibugs>	 (03PS1) 10Jdlrobson: Update invalid skin preference update script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771394 (https://phabricator.wikimedia.org/T299104)
[22:08:04] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] Update invalid skin preference update script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771394 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson)
[22:09:10] <wikibugs>	 (03Abandoned) 10Jbond: varnish: rate limit http://intake-analytics.wm.o/ [puppet] - 10https://gerrit.wikimedia.org/r/768028 (owner: 10Jbond)
[22:09:46] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:10:08] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: timed_out: False, active_shards: 290, active_shards_percent_as_number: 98.97610921501706, number_of_data_nodes: 2, number_of_nodes: 2, active_primary_shards: 163, delayed_unassigned_shards: 0, unassigned_shards: 1, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, initializing_shards: 2, number_of_in_f
[22:10:08] <icinga-wm>	 tch: 0, number_of_pending_tasks: 0, status: yellow, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:10:12] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: active_shards_percent_as_number: 98.97610921501706, initializing_shards: 2, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, number_of_nodes: 2, unassigned_shards: 1, delayed_unassigned_shards: 0, status: yellow, active_primary_shards: 163, relocating_shards: 0, active_shards: 290, number_of_data_nodes:
[22:10:12] <icinga-wm>	 ter_name: relforge-eqiad, number_of_in_flight_fetch: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:12:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[22:24:23] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10cmooney) Worth noting that we are planning in the short term to adjust t...
[22:33:28] <wikibugs>	 (03PS1) 10Samtar: Throttle: Increase limit for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771477 (https://phabricator.wikimedia.org/T304016)
[22:34:35] <wikibugs>	 (03PS2) 10Ryan Kemper: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking)
[22:35:03] <wikibugs>	 (03PS1) 10Jdlrobson: Fix updateUserLinksDropdownItems not being called [skins/Vector] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771395 (https://phabricator.wikimedia.org/T304002)
[22:37:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking)
[22:39:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771477 (https://phabricator.wikimedia.org/T304016) (owner: 10Samtar)
[22:45:59] <wikibugs>	 (03PS3) 10Ryan Kemper: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking)
[22:47:48] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:53:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD into centrallog1001 - https://phabricator.wikimedia.org/T302437 (10Jclark-ctr) Schedule adding drives tomorrow 3/17/2022  4pm utc
[22:56:20] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:31] <wikibugs>	 (03CR) 10Ebernhardson: elasticsearch: remove custom restart handling (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking)
[23:16:00] <wikibugs>	 10SRE, 10Traffic: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) >>! In T303903#7783673, @Aklapper wrote: >> You do not have permission to view this object. > Sorry, should work now.  Thanks, https://phabricator.wikimedia.org/P22736
[23:26:42] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:26:50] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:52:52] <tzatziki>	 !log Removing  two files for legal compliance
[23:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log