[00:13:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Cmjohnson) @Jclark-ctr mw1475 B6 19 1886 15 and mw1477 B6 21 1888 15 are both listed in port 15. Can you verify the ports again please. [00:15:45] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:26:27] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:33:18] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:40:56] (03Abandoned) 10Tim Starling: mcrouter mw-stats: make other write commands also async [puppet] - 10https://gerrit.wikimedia.org/r/807665 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [00:42:25] (03PS1) 10Tim Starling: Revert "mcrouter: Add stats route for fast increment" [puppet] - 10https://gerrit.wikimedia.org/r/808959 (https://phabricator.wikimedia.org/T310662) [00:42:57] (03PS2) 10Tim Starling: Revert "mcrouter: Add stats route for fast increment" [puppet] - 10https://gerrit.wikimedia.org/r/808959 (https://phabricator.wikimedia.org/T310662) [00:59:10] (03PS1) 10Cmjohnson: Adding new mw servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809041 (https://phabricator.wikimedia.org/T306121) [00:59:58] (03CR) 10Cmjohnson: [C: 03+2] Adding new mw servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809041 (https://phabricator.wikimedia.org/T306121) (owner: 10Cmjohnson) [01:00:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220628T0100) [01:05:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Cmjohnson) [01:06:50] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:07:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Cmjohnson) All network, mgmt, bios and operations setup has been completed. except for mw1477 and mw1496 which need sorting on-site. [01:08:06] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:16:01] (03CR) 10Krinkle: [C: 03+1] Revert "mcrouter: Add stats route for fast increment" [puppet] - 10https://gerrit.wikimedia.org/r/808959 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [01:36:06] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:46:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [02:07:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.18 [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809044 [02:07:31] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.18 [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809044 (owner: 10TrainBranchBot) [02:09:40] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:15:38] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::pdns: remove unused sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/799839 (owner: 10Majavah) [02:21:55] (03CR) 10Tim Starling: [C: 03+2] Revert "mcrouter: Add stats route for fast increment" [puppet] - 10https://gerrit.wikimedia.org/r/808959 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [02:22:27] (03CR) 10Andrew Bogott: [C: 03+2] openstack::designate: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795356 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [02:22:32] (03PS2) 10Andrew Bogott: openstack::designate: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795356 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [02:23:10] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.18 [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809044 (owner: 10TrainBranchBot) [02:37:16] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:49:54] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:20:26] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:56] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:33:56] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:04:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:06:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:36:45] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:55] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:00:49] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:03:03] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:06:57] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:03] (03PS2) 10Marostegui: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/808801 (https://phabricator.wikimedia.org/T311033) [05:09:17] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:50] metawiki is read-only? [05:17:17] now seems fine [05:21:09] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:21:27] legoktm: I am moving slaves around, so there might be some seconds of lag from time to time [05:21:35] I am preparing for s7 switchover [05:22:05] gotcha :) /me keeps mashing the save button [05:22:11] XD [05:25:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/808801 (https://phabricator.wikimedia.org/T311033) (owner: 10Marostegui) [05:26:11] good luck with the master switch, I am off! [05:26:57] (03PS1) 10Marostegui: db1136: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809049 (https://phabricator.wikimedia.org/T311033) [05:29:19] !log dbmaint s6@codfw T298557 [05:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:25] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [05:48:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:50:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:04] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220628T0600). [06:00:11] o/ [06:00:13] o/ [06:00:15] Let's go [06:00:18] let's go [06:00:25] !log Starting s7 eqiad failover from db1136 to db1181 - T311033 [06:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:32] T311033: Switchover s7 master db1136 -> db1181 - https://phabricator.wikimedia.org/T311033 [06:01:00] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10MMandere) [06:01:17] can't edit now [06:01:39] all done [06:01:43] can edit [06:02:09] can you try something with centralauth? [06:02:37] sure [06:03:04] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/808802 (https://phabricator.wikimedia.org/T311033) (owner: 10Marostegui) [06:04:03] marostegui: works fine [06:04:25] Amir1: sweet thanks!!! [06:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:10:17] I just realised that dbctl isn't logging things [06:16:44] (03CR) 10Marostegui: [C: 03+2] db1136: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809049 (https://phabricator.wikimedia.org/T311033) (owner: 10Marostegui) [06:20:36] (03CR) 10Ayounsi: [C: 03+1] "Overall lgtm, some nits inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 (owner: 10Volans) [06:41:49] (03PS1) 10Majavah: Allow DNS requests from GitLab runner containers [puppet] - 10https://gerrit.wikimedia.org/r/809085 (https://phabricator.wikimedia.org/T311241) [06:43:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36058/console" [puppet] - 10https://gerrit.wikimedia.org/r/809085 (https://phabricator.wikimedia.org/T311241) (owner: 10Majavah) [06:57:04] (03CR) 10Slyngshede: [C: 03+2] C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:57:06] (03PS1) 10MMandere: geodns: map selected EU countries to drmrs [dns] - 10https://gerrit.wikimedia.org/r/809088 (https://phabricator.wikimedia.org/T31142) [07:00:05] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220628T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:02] * urbanecm waves [07:01:13] can a nearby SRE restart logmsgbot please? looks it's awol. docs: hi, can a [07:01:15] * https://wikitech.wikimedia.org/wiki/Logmsgbot#Restart [07:01:59] urbanecm: done [07:02:01] thanks [07:02:09] and welcome back logmsgbot :) [07:02:14] XD [07:02:37] !log dbmaint s7@eqiad T310011 [07:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:44] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:02:45] !log dbmaint s7@eqiad T302659 [07:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:50] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [07:05:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T309311)', diff saved to https://phabricator.wikimedia.org/P30508 and previous config saved to /var/cache/conftool/dbconfig/20220628-070525-ladsgroup.json [07:05:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:05:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:31] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [07:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30509 and previous config saved to /var/cache/conftool/dbconfig/20220628-071001-root.json [07:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T298557)', diff saved to https://phabricator.wikimedia.org/P30510 and previous config saved to /var/cache/conftool/dbconfig/20220628-071157-marostegui.json [07:12:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:12:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:12:03] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:12:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298557)', diff saved to https://phabricator.wikimedia.org/P30511 and previous config saved to /var/cache/conftool/dbconfig/20220628-071210-marostegui.json [07:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:14:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T309311)', diff saved to https://phabricator.wikimedia.org/P30512 and previous config saved to /var/cache/conftool/dbconfig/20220628-071433-ladsgroup.json [07:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:40] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [07:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298557)', diff saved to https://phabricator.wikimedia.org/P30513 and previous config saved to /var/cache/conftool/dbconfig/20220628-072024-marostegui.json [07:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:31] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:21:39] (03PS1) 10Marostegui: install_server: Allow reimage db2153 - db2182 [puppet] - 10https://gerrit.wikimedia.org/r/809091 (https://phabricator.wikimedia.org/T306849) [07:23:21] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:25:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30514 and previous config saved to /var/cache/conftool/dbconfig/20220628-072505-root.json [07:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T309311)', diff saved to https://phabricator.wikimedia.org/P30515 and previous config saved to /var/cache/conftool/dbconfig/20220628-072534-ladsgroup.json [07:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:39] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [07:26:13] (03PS1) 10Marostegui: Revert "db1136: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/808962 [07:28:06] (03PS1) 10Slyngshede: C:apt enable thirdparty/hwraid for Bullseye hosts, via private repo. [puppet] - 10https://gerrit.wikimedia.org/r/809092 [07:29:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [07:35:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P30516 and previous config saved to /var/cache/conftool/dbconfig/20220628-073529-marostegui.json [07:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30517 and previous config saved to /var/cache/conftool/dbconfig/20220628-074009-root.json [07:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P30518 and previous config saved to /var/cache/conftool/dbconfig/20220628-074039-ladsgroup.json [07:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:45] (03PS2) 10Marostegui: install_server: Allow reimage db2153 - db2182 [puppet] - 10https://gerrit.wikimedia.org/r/809091 (https://phabricator.wikimedia.org/T306849) [07:46:02] (03CR) 10Marostegui: [C: 03+2] Revert "db1136: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/808962 (owner: 10Marostegui) [07:46:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30519 and previous config saved to /var/cache/conftool/dbconfig/20220628-074623-root.json [07:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:42] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow reimage db2153 - db2182 [puppet] - 10https://gerrit.wikimedia.org/r/809091 (https://phabricator.wikimedia.org/T306849) (owner: 10Marostegui) [07:50:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P30520 and previous config saved to /var/cache/conftool/dbconfig/20220628-075034-marostegui.json [07:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:49] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db2153 - db2182 [puppet] - 10https://gerrit.wikimedia.org/r/809091 (https://phabricator.wikimedia.org/T306849) (owner: 10Marostegui) [07:51:23] (03PS2) 10Filippo Giunchedi: smokeping: remove pfw some asw, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/808914 (https://phabricator.wikimedia.org/T169860) [07:51:52] (03CR) 10Filippo Giunchedi: smokeping: remove pfw some asw, moved to Prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/808914 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:55:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30521 and previous config saved to /var/cache/conftool/dbconfig/20220628-075513-root.json [07:55:16] (03CR) 10Ayounsi: geodns: map selected EU countries to drmrs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/809088 (https://phabricator.wikimedia.org/T31142) (owner: 10MMandere) [07:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P30522 and previous config saved to /var/cache/conftool/dbconfig/20220628-075544-ladsgroup.json [07:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:22] (03CR) 10Slyngshede: [C: 03+2] Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [07:57:44] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [08:01:07] (03CR) 10MMandere: geodns: map selected EU countries to drmrs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/809088 (https://phabricator.wikimedia.org/T31142) (owner: 10MMandere) [08:01:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30523 and previous config saved to /var/cache/conftool/dbconfig/20220628-080128-root.json [08:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:23] (03CR) 10Ayounsi: geodns: map selected EU countries to drmrs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/809088 (https://phabricator.wikimedia.org/T31142) (owner: 10MMandere) [08:04:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove absented diamond collector for puppet [puppet] - 10https://gerrit.wikimedia.org/r/808206 (owner: 10Muehlenhoff) [08:05:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298557)', diff saved to https://phabricator.wikimedia.org/P30524 and previous config saved to /var/cache/conftool/dbconfig/20220628-080539-marostegui.json [08:05:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:05:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:46] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:05:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30525 and previous config saved to /var/cache/conftool/dbconfig/20220628-080547-marostegui.json [08:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:17] (03CR) 10Ayounsi: [C: 03+1] "Great!" [puppet] - 10https://gerrit.wikimedia.org/r/808914 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:08:24] (03CR) 10DCausse: Increase weights on the language selector statement boosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [08:09:23] (03PS1) 10Jbond: realm.pp: Add defaults for file [puppet] - 10https://gerrit.wikimedia.org/r/809095 [08:10:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30527 and previous config saved to /var/cache/conftool/dbconfig/20220628-081017-root.json [08:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T309311)', diff saved to https://phabricator.wikimedia.org/P30528 and previous config saved to /var/cache/conftool/dbconfig/20220628-081049-ladsgroup.json [08:10:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [08:10:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [08:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:55] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [08:10:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T309311)', diff saved to https://phabricator.wikimedia.org/P30529 and previous config saved to /var/cache/conftool/dbconfig/20220628-081057-ladsgroup.json [08:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:47] (03PS1) 10Slyngshede: Remove build dependencies which are currently not required. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/809096 [08:16:21] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30530 and previous config saved to /var/cache/conftool/dbconfig/20220628-081632-root.json [08:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:15] (03CR) 10Muehlenhoff: C:apt enable thirdparty/hwraid for Bullseye hosts, via private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809092 (owner: 10Slyngshede) [08:27:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30531 and previous config saved to /var/cache/conftool/dbconfig/20220628-082755-marostegui.json [08:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:02] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:28:09] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:28:20] (03CR) 10Elukey: [C: 03+2] Add configuration for the ml-cache codfw Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/808907 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [08:28:26] (03PS5) 10Elukey: Add configuration for the ml-cache codfw Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/808907 (https://phabricator.wikimedia.org/T302232) [08:31:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30532 and previous config saved to /var/cache/conftool/dbconfig/20220628-083136-root.json [08:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:02] (03PS2) 10Slyngshede: C:apt enable thirdparty/hwraid for Bullseye hosts, via private repo. [puppet] - 10https://gerrit.wikimedia.org/r/809092 [08:35:06] (03CR) 10Slyngshede: C:apt enable thirdparty/hwraid for Bullseye hosts, via private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809092 (owner: 10Slyngshede) [08:35:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/809092 (owner: 10Slyngshede) [08:36:12] (03CR) 10Slyngshede: [C: 03+2] C:apt enable thirdparty/hwraid for Bullseye hosts, via private repo. [puppet] - 10https://gerrit.wikimedia.org/r/809092 (owner: 10Slyngshede) [08:43:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P30533 and previous config saved to /var/cache/conftool/dbconfig/20220628-084300-marostegui.json [08:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:03] !log installing openssl security updates [08:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:26] (03CR) 10Muehlenhoff: [C: 03+1] "Great idea!" [puppet] - 10https://gerrit.wikimedia.org/r/809095 (owner: 10Jbond) [08:46:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30534 and previous config saved to /var/cache/conftool/dbconfig/20220628-084640-root.json [08:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:35] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:32] (03CR) 10Volans: "addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 (owner: 10Volans) [08:53:37] (03PS2) 10Volans: ganeti: refactor Ganeti to support the new model [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 [08:54:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, dh-python in fact depends on python3 itself already." [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/809096 (owner: 10Slyngshede) [08:55:04] (03CR) 10Slyngshede: [C: 03+2] Remove build dependencies which are currently not required. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/809096 (owner: 10Slyngshede) [08:55:06] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Remove build dependencies which are currently not required. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/809096 (owner: 10Slyngshede) [08:57:45] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P30536 and previous config saved to /var/cache/conftool/dbconfig/20220628-085805-marostegui.json [08:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:20] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: remove pfw some asw, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/808914 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:01:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30537 and previous config saved to /var/cache/conftool/dbconfig/20220628-090144-root.json [09:01:45] PROBLEM - Check systemd state on ml-cache2002 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:33] (03CR) 10CI reject: [V: 04-1] ganeti: refactor Ganeti to support the new model [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 (owner: 10Volans) [09:03:46] (03CR) 10David Caro: [C: 03+1] "Just a question to verify, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807143 (owner: 10Majavah) [09:04:03] PROBLEM - cassandra-a service on ml-cache2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:04:33] (03PS3) 10David Caro: toolforge:toolviews: Allow disabling toolviews in hiera [puppet] - 10https://gerrit.wikimedia.org/r/699485 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [09:04:49] PROBLEM - Check systemd state on ml-cache2003 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:46] (03PS1) 10David Caro: openstack.galera: add nodecheck logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/809100 [09:11:21] (03CR) 10Vgutierrez: trafficserver: 9.x upgrade: remove wmf-tls log format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [09:11:49] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:55] RECOVERY - cassandra-a service on ml-cache2001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:12:00] (03CR) 10Majavah: sonofgridengine: grid_configurator: ignore non-ACTIVE instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807143 (owner: 10Majavah) [09:12:05] PROBLEM - cassandra-a CQL 10.192.16.190:9042 on ml-cache2002 is CRITICAL: connect to address 10.192.16.190 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:12:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30538 and previous config saved to /var/cache/conftool/dbconfig/20220628-091229-root.json [09:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30539 and previous config saved to /var/cache/conftool/dbconfig/20220628-091310-marostegui.json [09:13:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:13:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:13:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30540 and previous config saved to /var/cache/conftool/dbconfig/20220628-091313-root.json [09:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:16] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:13:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30541 and previous config saved to /var/cache/conftool/dbconfig/20220628-091318-marostegui.json [09:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T309311)', diff saved to https://phabricator.wikimedia.org/P30542 and previous config saved to /var/cache/conftool/dbconfig/20220628-091403-ladsgroup.json [09:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:09] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [09:14:45] PROBLEM - cassandra-a SSL 10.192.16.190:7001 on ml-cache2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:14:47] (03CR) 10David Caro: [C: 03+2] toolforge:toolviews: Allow disabling toolviews in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699485 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [09:15:25] the cassandra alerts on ml-cache are related to bootstrapping work, nothing live [09:16:08] elukey: ack [09:16:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30543 and previous config saved to /var/cache/conftool/dbconfig/20220628-091649-root.json [09:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:19] PROBLEM - cassandra-a service on ml-cache2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:17:21] (03PS2) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [09:18:00] (03CR) 10CI reject: [V: 04-1] c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [09:18:34] (03CR) 10David Caro: [C: 03+2] sonofgridengine: grid_configurator: ignore non-ACTIVE instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807143 (owner: 10Majavah) [09:19:10] (03CR) 10David Caro: [C: 03+2] toolforge:toolviews: Allow disabling toolviews in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699485 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [09:20:19] (03PS3) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [09:20:21] PROBLEM - cassandra-a service on ml-cache2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:20:41] (03PS4) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [09:20:59] RECOVERY - cassandra-a service on ml-cache2002 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:25:09] PROBLEM - cassandra-a CQL 10.192.32.72:9042 on ml-cache2003 is CRITICAL: connect to address 10.192.32.72 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:27:05] PROBLEM - cassandra-a service on ml-cache2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:27:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30544 and previous config saved to /var/cache/conftool/dbconfig/20220628-092733-root.json [09:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:49] PROBLEM - cassandra-a SSL 10.192.32.72:7001 on ml-cache2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:27:53] RECOVERY - Check systemd state on ml-cache2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:15] RECOVERY - cassandra-a SSL 10.192.16.190:7001 on ml-cache2002 is OK: SSL OK - Certificate ml-cache2002-a valid until 2024-06-15 08:50:24 +0000 (expires in 717 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:28:17] RECOVERY - Check systemd state on ml-cache2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30545 and previous config saved to /var/cache/conftool/dbconfig/20220628-092817-root.json [09:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:53] RECOVERY - cassandra-a CQL 10.192.16.190:9042 on ml-cache2002 is OK: TCP OK - 0.032 second response time on 10.192.16.190 port 9042 https://phabricator.wikimedia.org/T93886 [09:29:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P30546 and previous config saved to /var/cache/conftool/dbconfig/20220628-092908-ladsgroup.json [09:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:18] (03CR) 10David Caro: "Got a question (probably ok), LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/763584 (owner: 10Majavah) [09:29:56] (03PS3) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:30:53] (03CR) 10CI reject: [V: 04-1] Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:32:10] RECOVERY - cassandra-a service on ml-cache2001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:37:56] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) I was able to bootstrap the cassandra ML cluster in codfw on Bullseye. The only odd thing is that the `cassandra` package, for some reason, ended up in `rc` state after some puppet runs a... [09:38:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30547 and previous config saved to /var/cache/conftool/dbconfig/20220628-093807-marostegui.json [09:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:13] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:39:36] RECOVERY - cassandra-a CQL 10.192.32.72:9042 on ml-cache2003 is OK: TCP OK - 0.033 second response time on 10.192.32.72 port 9042 https://phabricator.wikimedia.org/T93886 [09:42:02] RECOVERY - cassandra-a SSL 10.192.32.72:7001 on ml-cache2003 is OK: SSL OK - Certificate ml-cache2003-a valid until 2024-06-15 08:50:27 +0000 (expires in 717 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:42:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30548 and previous config saved to /var/cache/conftool/dbconfig/20220628-094237-root.json [09:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30549 and previous config saved to /var/cache/conftool/dbconfig/20220628-094321-root.json [09:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P30550 and previous config saved to /var/cache/conftool/dbconfig/20220628-094414-ladsgroup.json [09:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:30] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:45:37] (03CR) 10Majavah: [V: 03+1] toolsdb: enable pt-heartbeat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763584 (owner: 10Majavah) [09:51:50] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809127 [09:51:52] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809128 [09:53:10] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P30551 and previous config saved to /var/cache/conftool/dbconfig/20220628-095312-marostegui.json [09:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:18] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:57:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30552 and previous config saved to /var/cache/conftool/dbconfig/20220628-095741-root.json [09:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30553 and previous config saved to /var/cache/conftool/dbconfig/20220628-095825-root.json [09:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T309311)', diff saved to https://phabricator.wikimedia.org/P30554 and previous config saved to /var/cache/conftool/dbconfig/20220628-095919-ladsgroup.json [09:59:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [09:59:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [09:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:24] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [09:59:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T309311)', diff saved to https://phabricator.wikimedia.org/P30555 and previous config saved to /var/cache/conftool/dbconfig/20220628-095927-ladsgroup.json [09:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:09] (03CR) 10Volans: "I've refactored the patch a bit adding a couple of methods and objects that I found useful while making the patch for the cookbooks side o" [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 (owner: 10Volans) [10:00:30] RECOVERY - cassandra-a service on ml-cache2002 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:02:00] (03PS3) 10Volans: ganeti: refactor Ganeti to support the new model [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 [10:04:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10dcaro) I'm not 100% sure, but I think it should be raid10-4dev (looking at ./modules/install_server/files/autoinstall/partman/). Maybe @nskaggs... [10:06:04] (03PS2) 10MMandere: geodns: map selected EU countries to drmrs [dns] - 10https://gerrit.wikimedia.org/r/809088 (https://phabricator.wikimedia.org/T311472) [10:06:12] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10fgiunchedi) As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at https://gerrit.wikimedia.org/r/c/operations/puppet/+/808870 and thus the cc). I don't know if it is helpful at thi... [10:08:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P30556 and previous config saved to /var/cache/conftool/dbconfig/20220628-100817-marostegui.json [10:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:10:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T309311)', diff saved to https://phabricator.wikimedia.org/P30557 and previous config saved to /var/cache/conftool/dbconfig/20220628-101035-ladsgroup.json [10:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:42] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [10:11:45] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host stat1010.eqiad.wmnet with OS bullseye [10:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye [10:12:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30558 and previous config saved to /var/cache/conftool/dbconfig/20220628-101244-root.json [10:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30559 and previous config saved to /var/cache/conftool/dbconfig/20220628-101329-root.json [10:13:30] !log upgrading Ganeti test cluster to 3.0.2 [10:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:06] (03PS1) 10David Caro: p:toolforge:grid:{exec_environ,toloviews}: fix/add tests [puppet] - 10https://gerrit.wikimedia.org/r/809130 [10:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30560 and previous config saved to /var/cache/conftool/dbconfig/20220628-102322-marostegui.json [10:23:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:23:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:29] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [10:23:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30561 and previous config saved to /var/cache/conftool/dbconfig/20220628-102331-marostegui.json [10:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:36] (03CR) 10CI reject: [V: 04-1] p:toolforge:grid:{exec_environ,toloviews}: fix/add tests [puppet] - 10https://gerrit.wikimedia.org/r/809130 (owner: 10David Caro) [10:25:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P30562 and previous config saved to /var/cache/conftool/dbconfig/20220628-102540-ladsgroup.json [10:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30563 and previous config saved to /var/cache/conftool/dbconfig/20220628-102748-root.json [10:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30564 and previous config saved to /var/cache/conftool/dbconfig/20220628-102833-root.json [10:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:27] (03PS1) 10Muehlenhoff: Drop jackson-module-kotlin (experimental) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/809132 (https://phabricator.wikimedia.org/T311235) [10:32:13] (03PS4) 10Volans: ganeti: refactor Ganeti to support the new model [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 [10:33:10] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:37] (03PS2) 10Giuseppe Lavagetto: mediawiki: install php7.4 on the mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/808908 (https://phabricator.wikimedia.org/T311386) [10:36:39] (03PS2) 10Giuseppe Lavagetto: mediawiki: install php7.4 on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/808909 (https://phabricator.wikimedia.org/T311386) [10:36:41] (03PS2) 10Giuseppe Lavagetto: mediawiki: install php7.4 on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/808910 (https://phabricator.wikimedia.org/T311386) [10:36:43] (03PS2) 10Giuseppe Lavagetto: mediawiki: install php7.4 on the maintenance server [puppet] - 10https://gerrit.wikimedia.org/r/808911 (https://phabricator.wikimedia.org/T311386) [10:36:45] (03PS2) 10Giuseppe Lavagetto: mediawiki: install php7.4 on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/808912 (https://phabricator.wikimedia.org/T311386) [10:39:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36061/console" [puppet] - 10https://gerrit.wikimedia.org/r/808908 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [10:40:03] (03PS1) 10Muehlenhoff: mariadb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809134 [10:40:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P30565 and previous config saved to /var/cache/conftool/dbconfig/20220628-104046-ladsgroup.json [10:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:34] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30566 and previous config saved to /var/cache/conftool/dbconfig/20220628-104252-root.json [10:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P30567 and previous config saved to /var/cache/conftool/dbconfig/20220628-104337-root.json [10:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:42] (03PS5) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [10:46:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30568 and previous config saved to /var/cache/conftool/dbconfig/20220628-104631-marostegui.json [10:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:35] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [10:47:17] (03CR) 10CI reject: [V: 04-1] c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [10:49:53] (03PS6) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [10:51:27] (03PS1) 10Urbanecm: ProtectionFilter: Only make a query if we have valid tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809110 (https://phabricator.wikimedia.org/T311482) [10:52:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki: install php7.4 on the mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/808908 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [10:54:26] (03CR) 10Ayounsi: [C: 03+1] "That looks good to me, better to have someone else from Traffic to have a look as well." [dns] - 10https://gerrit.wikimedia.org/r/809088 (https://phabricator.wikimedia.org/T311472) (owner: 10MMandere) [10:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T309311)', diff saved to https://phabricator.wikimedia.org/P30569 and previous config saved to /var/cache/conftool/dbconfig/20220628-105551-ladsgroup.json [10:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:57] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [10:57:29] (03PS1) 10Volans: sre.ganeti.*: adapt to latest Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/809136 [10:58:36] (03CR) 10Ayounsi: [C: 03+1] "Ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 (owner: 10Volans) [11:01:00] (03CR) 10Muehlenhoff: c:Ganeti Ganeti Prometheus exporter deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:01:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P30570 and previous config saved to /var/cache/conftool/dbconfig/20220628-110136-marostegui.json [11:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:37] (03PS7) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [11:05:18] (03PS8) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [11:05:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: force apt-update to happen before installing php [puppet] - 10https://gerrit.wikimedia.org/r/809137 [11:08:02] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1010.eqiad.wmnet with OS bullseye [11:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with errors: - stat1010 (**FAIL**)... [11:08:29] (03CR) 10CI reject: [V: 04-1] mediawiki::php: force apt-update to happen before installing php [puppet] - 10https://gerrit.wikimedia.org/r/809137 (owner: 10Giuseppe Lavagetto) [11:09:27] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: force apt-update to happen before installing php [puppet] - 10https://gerrit.wikimedia.org/r/809137 [11:09:48] (03PS9) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [11:10:30] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36065/console" [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:10:50] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36064/console" [puppet] - 10https://gerrit.wikimedia.org/r/809137 (owner: 10Giuseppe Lavagetto) [11:11:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::php: force apt-update to happen before installing php [puppet] - 10https://gerrit.wikimedia.org/r/809137 (owner: 10Giuseppe Lavagetto) [11:13:38] (03PS2) 10Volans: sre.ganeti.*: adapt to latest Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/809136 [11:16:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P30571 and previous config saved to /var/cache/conftool/dbconfig/20220628-111641-marostegui.json [11:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:55] (03PS10) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [11:18:11] (03CR) 10Muehlenhoff: c:Ganeti Ganeti Prometheus exporter deployment (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:18:48] (03CR) 10CI reject: [V: 04-1] c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:19:03] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36066/console" [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:19:16] !log installing squid security updates [11:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:20] (03PS1) 10Jbond: global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 [11:19:22] (03CR) 10JMeybohm: [C: 03+1] "Question inline, feel free to ignore" [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto) [11:19:52] (03PS11) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [11:20:01] (03PS1) 10Reedy: Add missing file from guzzlehttp/psr7 update [vendor] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809113 [11:20:16] (03CR) 10JMeybohm: [C: 03+1] sre: pretty-format mediawiki.yaml [alerts] - 10https://gerrit.wikimedia.org/r/797314 (owner: 10Giuseppe Lavagetto) [11:20:44] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36067/console" [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:21:23] (03CR) 10Reedy: [C: 03+2] Add missing file from guzzlehttp/psr7 update [vendor] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809113 (owner: 10Reedy) [11:23:24] (03PS2) 10Jcrespo: delete-media-file: Add failsafe to file deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808314 (https://phabricator.wikimedia.org/T311215) [11:23:58] (03CR) 10CI reject: [V: 04-1] delete-media-file: Add failsafe to file deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808314 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [11:24:30] (03CR) 10CI reject: [V: 04-1] global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [11:26:15] (03CR) 10JMeybohm: sre: add alerting for mediawiki on k8s (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/797315 (owner: 10Giuseppe Lavagetto) [11:26:54] (03CR) 10EllenR: [C: 03+1] "looks removed to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808936 (https://phabricator.wikimedia.org/T311429) (owner: 10Eigyan) [11:28:18] (03CR) 10EllenR: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808975 (https://phabricator.wikimedia.org/T311434) (owner: 10Eigyan) [11:31:32] (03PS3) 10Jcrespo: delete-media-file: Add failsafe to file deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808314 (https://phabricator.wikimedia.org/T311215) [11:31:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) It looks like this might be an instance of the bug identified in {T304483} I'm downgrading the NIC firmware to the previous version and then I will run the cookbook agai... [11:31:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298557)', diff saved to https://phabricator.wikimedia.org/P30572 and previous config saved to /var/cache/conftool/dbconfig/20220628-113146-marostegui.json [11:31:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:31:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:51] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:31:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298557)', diff saved to https://phabricator.wikimedia.org/P30573 and previous config saved to /var/cache/conftool/dbconfig/20220628-113154-marostegui.json [11:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:36:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10fgiunchedi) >>! In T307399#8027872, @Cmjohnson wrote: > @btullis @robh was working on this last wee. /dev/sda and /dev/sdb are swapped by the controller regardless of how they we... [11:38:00] (03PS2) 10Jbond: Add dotfiles for brennen [puppet] - 10https://gerrit.wikimedia.org/r/809037 (owner: 10Brennen Bearnes) [11:38:02] (03PS1) 10Jbond: rake spdx: skip user home files [puppet] - 10https://gerrit.wikimedia.org/r/809143 [11:38:17] (03Merged) 10jenkins-bot: Add missing file from guzzlehttp/psr7 update [vendor] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809113 (owner: 10Reedy) [11:39:21] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:36] (03CR) 10Jcrespo: [C: 03+2] Add new script delete-media-file to delete backed up files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808013 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [11:39:46] (03CR) 10Jcrespo: [C: 03+2] delete-media-file: Add failsafe to file deletion [software/mediabackups] - 10https://gerrit.wikimedia.org/r/808314 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [11:40:08] (03PS12) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [11:40:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298557)', diff saved to https://phabricator.wikimedia.org/P30574 and previous config saved to /var/cache/conftool/dbconfig/20220628-114020-marostegui.json [11:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:26] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:40:39] (03CR) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:41:41] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36068/console" [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:43:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:44:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host stat1010.eqiad.wmnet with OS bullseye [11:45:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye [11:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:02] (03PS3) 10Sergio Gimeno: MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) [11:47:55] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:47:55] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:49:02] (03PS1) 10ArielGlenn: give dumps co-maintainer root on labstore and other dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/809144 (https://phabricator.wikimedia.org/T302145) [11:49:43] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:50:06] (03PS1) 10Jcrespo: Prepare for 0.1.2 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809145 (https://phabricator.wikimedia.org/T311215) [11:50:28] (03CR) 10Muehlenhoff: "One final nit, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:51:17] (03PS13) 10Slyngshede: c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 [11:51:27] (03CR) 10Jbond: [C: 03+2] rake spdx: skip user home files [puppet] - 10https://gerrit.wikimedia.org/r/809143 (owner: 10Jbond) [11:51:30] (03CR) 10Jbond: [C: 03+2] Add dotfiles for brennen [puppet] - 10https://gerrit.wikimedia.org/r/809037 (owner: 10Brennen Bearnes) [11:54:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [11:55:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P30575 and previous config saved to /var/cache/conftool/dbconfig/20220628-115525-marostegui.json [11:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:46] (03CR) 10Jbond: [C: 03+1] "LGTM, just missing the spdx headers. may want to consider installing utils/hooks/pre-commit" [puppet] - 10https://gerrit.wikimedia.org/r/809130 (owner: 10David Caro) [11:57:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/809132 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [11:59:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809144 (https://phabricator.wikimedia.org/T302145) (owner: 10ArielGlenn) [12:00:47] (03CR) 10Slyngshede: [C: 03+2] c:Ganeti Ganeti Prometheus exporter deployment [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [12:00:51] (03PS2) 10ArielGlenn: make sure various dump related scripts send email when they run [puppet] - 10https://gerrit.wikimedia.org/r/808890 (https://phabricator.wikimedia.org/T273673) [12:01:08] (03CR) 10Slyngshede: [C: 03+2] c:Ganeti Ganeti Prometheus exporter deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807043 (owner: 10Slyngshede) [12:01:40] (03CR) 10ArielGlenn: [C: 03+2] make sure various dump related scripts send email when they run [puppet] - 10https://gerrit.wikimedia.org/r/808890 (https://phabricator.wikimedia.org/T273673) (owner: 10ArielGlenn) [12:07:28] (03PS2) 10Jcrespo: Prepare for 0.1.2 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809145 (https://phabricator.wikimedia.org/T311215) [12:09:08] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8032293, @elukey wrote: > I was able to bootstrap the cassandra ML cluster in codfw on Bullseye. The only odd thing is that the `cassandra` package, for some rea... [12:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P30576 and previous config saved to /var/cache/conftool/dbconfig/20220628-121030-marostegui.json [12:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:57] (03PS1) 10Klausman: ml-staging: Add inference services for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) [12:12:06] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10LSobanski) [12:21:15] (03PS2) 10Jbond: global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 [12:25:34] (03CR) 10CI reject: [V: 04-1] global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [12:25:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298557)', diff saved to https://phabricator.wikimedia.org/P30577 and previous config saved to /var/cache/conftool/dbconfig/20220628-122535-marostegui.json [12:25:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:25:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:42] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:25:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298557)', diff saved to https://phabricator.wikimedia.org/P30578 and previous config saved to /var/cache/conftool/dbconfig/20220628-122543-marostegui.json [12:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:35] (03PS1) 10Elukey: Add ml-staging-codfw among the helmfile envs to test [deployment-charts] - 10https://gerrit.wikimedia.org/r/809149 (https://phabricator.wikimedia.org/T302195) [12:32:23] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) I wrote a script to put some load (tbh, it wasn't much load when I ran it but I can turn the knob as high as you want) a... [12:34:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298557)', diff saved to https://phabricator.wikimedia.org/P30579 and previous config saved to /var/cache/conftool/dbconfig/20220628-123424-marostegui.json [12:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:30] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:34:57] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:35:51] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10MoritzMuehlenhoff) We discussed this in yesterday's SRE IF meeting: Let's start by adding sudo permissions for the three cookbooks listed, homer be im... [12:37:52] (03PS3) 10Jbond: global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 [12:38:33] (03CR) 10Vgutierrez: [C: 03+1] "patch looks good and properly reflects the data gathered on the spreadsheet" [dns] - 10https://gerrit.wikimedia.org/r/809088 (https://phabricator.wikimedia.org/T311472) (owner: 10MMandere) [12:40:02] (03PS1) 10Lucas Werkmeister (WMDE): Use LanguageSelectorStatementBoost instead of its plurar form [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809118 (https://phabricator.wikimedia.org/T307869) [12:40:14] I’m sneaking in an early backport ^ [12:40:17] (03CR) 10Elukey: "A couple of notes:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [12:40:28] (won’t be available during the proper backport window in 20 minutes, unfortunately) [12:40:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use LanguageSelectorStatementBoost instead of its plurar form [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809118 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [12:41:34] (03CR) 10CI reject: [V: 04-1] global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [12:43:15] (03PS1) 10Jcrespo: delete-media-backups: Default to dry-mode for deletions [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809152 (https://phabricator.wikimedia.org/T311215) [12:48:26] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@68e7c64]: Deploying and enabling datahub ingestion jobs [12:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:35] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@68e7c64]: Deploying and enabling datahub ingestion jobs (duration: 00m 09s) [12:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:54] (03PS4) 10Jbond: global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 [12:49:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) Hi @fgiunchedi - Thanks, I agree that it would be a pain to have to deviate from using `/dev/sda` for the primary OS drive. At the moment I have copied the only previous... [12:49:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P30580 and previous config saved to /var/cache/conftool/dbconfig/20220628-124929-marostegui.json [12:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:49] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808936 (https://phabricator.wikimedia.org/T311429) (owner: 10Eigyan) [12:51:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) [12:54:09] (03CR) 10Filippo Giunchedi: [C: 03+2] am: retry on CGI failure or empty output [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331) (owner: 10Filippo Giunchedi) [12:54:14] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: retry on CGI failure or empty output [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331) (owner: 10Filippo Giunchedi) [12:54:16] (03CR) 10CI reject: [V: 04-1] global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [12:54:45] (03PS1) 10Jgiannelos: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/809155 [12:55:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5012.eqsin.wmnet,service=ats-be [12:55:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5012.eqsin.wmnet,service=varnish-fe [12:55:31] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5012.eqsin.wmnet,service=ats-tls [12:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:48] (03CR) 10Volans: [C: 03+2] ganeti: refactor Ganeti to support the new model [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 (owner: 10Volans) [12:56:01] (03PS2) 10Eigyan: [beta]: Remove GDI quick survey from EN,ES wikis - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808936 (https://phabricator.wikimedia.org/T311429) [12:56:15] (03PS2) 10Eigyan: [wmf-config]: Deploy GDI Safety Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808975 (https://phabricator.wikimedia.org/T311434) [12:56:23] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) Thanks for all the help @RobH! Marking this as resolved as the host is now pooled. [12:56:33] (03CR) 10Klausman: [C: 03+1] Add ml-staging-codfw among the helmfile envs to test [deployment-charts] - 10https://gerrit.wikimedia.org/r/809149 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [12:56:35] 10SRE, 10ops-eqsin, 10Traffic: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) 05Open→03Resolved [12:57:29] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1010.eqiad.wmnet with OS bullseye [12:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with errors: - stat1010 (**FAIL**)... [12:57:40] (03CR) 10Volans: redfish: add a fqdn getter property and __str__ method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [12:57:44] (03PS2) 10Klausman: ml-staging: Add inference services for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) [12:57:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host stat1010.eqiad.wmnet with OS bullseye [12:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye [12:58:28] (03Merged) 10jenkins-bot: Use LanguageSelectorStatementBoost instead of its plurar form [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809118 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [12:58:41] (03CR) 10Klausman: ml-staging: Add inference services for testing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [12:59:00] (03PS1) 10Ssingh: admin: add TBurmeister to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/809156 (https://phabricator.wikimedia.org/T311453) [12:59:19] (03CR) 10Volans: [C: 03+1] "nit inline, LGTM otherwise" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [12:59:33] (03CR) 10Ssingh: [C: 03+1] admin: add gitlab-roots group to gitlab_runner role [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [12:59:38] Greetings Everyone! o/ [12:59:44] (03PS1) 10Muehlenhoff: Inline ganeti::kvm [puppet] - 10https://gerrit.wikimedia.org/r/809157 [12:59:50] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/809155 (owner: 10Jgiannelos) [13:00:04] FYI, I’m doing a `git fetch` in wmf.17 on the deployment host and it’s fetching *all* the REL branches (because of the security patches that got pushed) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220628T1300). [13:00:05] eigyan, eigyan, and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220628T1300) [13:00:15] just in case anyone is wondering later why *they’re* not seeing all that fetch output :D [13:00:17] hello! [13:00:53] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond) [13:01:16] I can’t deploy today, sadly [13:01:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:22] just waiting for a last-minute scap to finish [13:01:26] I can deploy today [13:01:30] thanks urbanecm! [13:01:33] Lucas_WMDE: are you scap'ing now? [13:01:41] urbanecm: cool. ty! [13:01:44] I’ll ping you when it’s done [13:01:46] ack [13:01:50] let me review patches now [13:02:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:02:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:16] (03CR) 10Urbanecm: [C: 03+2] ProtectionFilter: Only make a query if we have valid tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809110 (https://phabricator.wikimedia.org/T311482) (owner: 10Urbanecm) [13:02:25] (03CR) 10Ssingh: [C: 03+2] admin: add TBurmeister to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/809156 (https://phabricator.wikimedia.org/T311453) (owner: 10Ssingh) [13:02:31] (03CR) 10Urbanecm: [C: 03+2] [beta]: Remove GDI quick survey from EN,ES wikis - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808936 (https://phabricator.wikimedia.org/T311429) (owner: 10Eigyan) [13:02:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:59] (03Merged) 10jenkins-bot: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/809155 (owner: 10Jgiannelos) [13:03:28] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/WikibaseCirrusSearch/src/Hooks.php: Backport: [[gerrit:809118|Use LanguageSelectorStatementBoost instead of its plurar form (T307869)]] (duration: 03m 35s) [13:03:33] urbanecm: all done, you’re good to go – sorry it took a bit longer than expected [13:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:34] T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869 [13:03:48] (03Merged) 10jenkins-bot: [beta]: Remove GDI quick survey from EN,ES wikis - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808936 (https://phabricator.wikimedia.org/T311429) (owner: 10Eigyan) [13:04:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P30581 and previous config saved to /var/cache/conftool/dbconfig/20220628-130434-marostegui.json [13:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1458.eqiad.wmnet with OS buster [13:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1458.eqiad.wmnet with OS buster [13:06:12] (03PS5) 10Jbond: global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 [13:06:27] (03Merged) 10jenkins-bot: ganeti: refactor Ganeti to support the new model [software/spicerack] - 10https://gerrit.wikimedia.org/r/808897 (owner: 10Volans) [13:06:58] (03CR) 10Volans: "Couple of questions inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond) [13:07:44] Lucas_WMDE: thanks [13:07:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:08:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:58] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Triciaburmeister - https://phabricator.wikimedia.org/T311453 (10ssingh) 05Open→03Resolved a:03ssingh Added uid `tburmeister` to both `wmf` and `wmf-nda`. You should have access now, please feel free to reopen if there are any is... [13:09:01] !log deploy prometheus-icinga-exporter 0.20 - T310331 [13:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:06] T310331: Implement retries for prometheus-icinga-am on empty CGI body - https://phabricator.wikimedia.org/T310331 [13:09:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:14] (03CR) 10CI reject: [V: 04-1] global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [13:11:19] (03PS1) 10Slyngshede: C:profile::ganeti unbreak Puppet on Ganeti nodes. [puppet] - 10https://gerrit.wikimedia.org/r/809158 [13:12:45] I'm not sure whether there are any pre-requirements for quick survey deployment on enwiki. IIRC there is some performance penalty, but not sure [13:12:50] eigyan: do you happen to know? [13:13:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) [13:13:55] hi urbanecm for beta i usually do not test but for prod I test on one of the mwdebug machines [13:14:45] (03CR) 10Slyngshede: [C: 03+2] C:profile::ganeti unbreak Puppet on Ganeti nodes. [puppet] - 10https://gerrit.wikimedia.org/r/809158 (owner: 10Slyngshede) [13:14:54] eigyan: sorry, i probably wrote my message in an unclear way. i meant that I'm wondering whether there are any approvals needed for enabling QuickSurveys extension at additional wikis. [13:15:09] I can't find anything about it, but...i do feel like i read it somewhere [13:15:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1458.eqiad.wmnet with reason: host reimage [13:15:48] urbanecm I do not recall the additional approvals [13:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1457.eqiad.wmnet with OS buster [13:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1457.eqiad.wmnet with OS buster [13:16:10] okay. considering deployment of this was attempted before as https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808045/, let's just go ahead [13:16:13] i probably misremember [13:16:25] (03CR) 10Urbanecm: [C: 03+2] [wmf-config]: Deploy GDI Safety Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808975 (https://phabricator.wikimedia.org/T311434) (owner: 10Eigyan) [13:16:34] urbanecm sounds good! [13:16:41] and thank you! [13:17:01] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:17:02] (03PS3) 10Urbanecm: [wmf-config]: Deploy GDI Safety Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808975 (https://phabricator.wikimedia.org/T311434) (owner: 10Eigyan) [13:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:09] (03CR) 10Urbanecm: [C: 03+2] [wmf-config]: Deploy GDI Safety Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808975 (https://phabricator.wikimedia.org/T311434) (owner: 10Eigyan) [13:17:30] the beta patch will be soon depoyed to beta automatically, ftr [13:17:31] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:43] urandom Excellent [13:18:04] (03PS3) 10Urbanecm: GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: 10Kosta Harlan) [13:18:09] urbanecm Thanks@ [13:18:11] np [13:18:29] (03Merged) 10jenkins-bot: [wmf-config]: Deploy GDI Safety Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808975 (https://phabricator.wikimedia.org/T311434) (owner: 10Eigyan) [13:18:37] (03CR) 10Vgutierrez: "I'd keep the explicit owner/group parameters for those files being populated with secret() and also for directories that contains those fi" [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [13:18:42] (03CR) 10MMandere: [C: 03+2] geodns: map selected EU countries to drmrs (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/809088 (https://phabricator.wikimedia.org/T311472) (owner: 10MMandere) [13:18:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1458.eqiad.wmnet with reason: host reimage [13:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:12] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:19] eigyan: pulled to mwdebug1001, please test [13:19:34] urbanecm will do [13:19:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298557)', diff saved to https://phabricator.wikimedia.org/P30582 and previous config saved to /var/cache/conftool/dbconfig/20220628-131939-marostegui.json [13:19:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [13:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:44] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [13:19:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [13:19:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [13:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:51] !log update primary dcs for AD,AL,BY,CH,GI,IT,LI,MT,SK to drmrs - T311472 [13:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [13:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:19:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:00] T311472: DRMRS: Geodns Configuration -- Phase 2 - https://phabricator.wikimedia.org/T311472 [13:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) a:05Cmjohnson→03BTullis I'm just claiming this ticket and putting it on our team's workboard to reflect the fact that I'm working on it... [13:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:12] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:23] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:01] (03CR) 10Jbond: redfish: add wait for reboot function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond) [13:21:10] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:03] (03Merged) 10jenkins-bot: ProtectionFilter: Only make a query if we have valid tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809110 (https://phabricator.wikimedia.org/T311482) (owner: 10Urbanecm) [13:23:53] eigyan: how is the testing going? [13:24:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/809157 (owner: 10Muehlenhoff) [13:24:19] so far I have validated eswiki looking at other wikis [13:24:20] sergi0: wmf.18 is not yet at deployment server, so there is nothing to do :) [13:24:50] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:25:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:25] eigyan: ack, thanks. [13:25:41] (03PS1) 10Jgiannelos: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/809159 [13:25:44] urbanecm: right, thanks for taking care of updating wmf.18! [13:25:51] no problem [13:26:02] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:26:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:26:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:29] (03PS1) 10Slyngshede: C:ganeti enable ganeti prometheus exporter. [puppet] - 10https://gerrit.wikimedia.org/r/809160 (https://phabricator.wikimedia.org/T311288) [13:26:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1457.eqiad.wmnet with reason: host reimage [13:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/809160 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [13:29:09] (03CR) 10Slyngshede: [C: 03+2] C:ganeti enable ganeti prometheus exporter. [puppet] - 10https://gerrit.wikimedia.org/r/809160 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [13:29:46] (03CR) 10Tchanders: similar-users: make max queries per account configurable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/808923 (https://phabricator.wikimedia.org/T310646) (owner: 10Hnowlan) [13:29:50] (03CR) 10JMeybohm: [C: 03+1] Add ml-staging-codfw among the helmfile envs to test [deployment-charts] - 10https://gerrit.wikimedia.org/r/809149 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [13:31:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1457.eqiad.wmnet with reason: host reimage [13:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:33:04] (03PS2) 10Jcrespo: delete-media-backups: Default to dry-mode for deletions [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809152 (https://phabricator.wikimedia.org/T311215) [13:33:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:33:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1458.eqiad.wmnet with OS buster [13:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1458.eqiad.wmnet with OS buster completed: - mw1458 (**PASS**) -... [13:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:34:09] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:05] PROBLEM - Check systemd state on ganeti4002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:10] (03CR) 10Jgiannelos: [C: 03+2] proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/809159 (owner: 10Jgiannelos) [13:35:34] eigyan: how is it going? [13:36:27] (03PS1) 10Zabe: mailmap: add a few entries [puppet] - 10https://gerrit.wikimedia.org/r/809163 [13:36:36] urbanecm I have validated es,fr and pt wikis, however en and fa wiki are still not showing up... [13:37:08] urbanecm I should say the quicksurvey is not loading for fa and en wiki for some reason [13:37:18] (03PS1) 10Slyngshede: C:ganeti::prometheus fix config path [puppet] - 10https://gerrit.wikimedia.org/r/809164 [13:37:42] (03PS2) 10ArielGlenn: give dumps co-maintainer root on labstore and other dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/809144 (https://phabricator.wikimedia.org/T302145) [13:38:50] (03Merged) 10jenkins-bot: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/809159 (owner: 10Jgiannelos) [13:39:04] (03CR) 10Slyngshede: [C: 03+2] C:ganeti::prometheus fix config path [puppet] - 10https://gerrit.wikimedia.org/r/809164 (owner: 10Slyngshede) [13:39:07] (03CR) 10Muehlenhoff: "Changes to access groups are discussed in the weekly SRE IF meeting (the last one was yesterday, can this wait until next Monday?)" [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [13:39:50] eigyan: i see. i think we've to make a decision now -- AFAICS, it doesn't break anything, so i think we can just deploy it and debug later [13:40:00] alternatively, we can revert [13:40:03] let me know what you prefer [13:40:29] urbanecm lets let it remain please so I can investigate this:)  many thank! [13:40:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [13:40:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [13:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:11] PROBLEM - Check systemd state on ganeti5003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:17] eigyan: does that mean "let's sync and investigate later"? B&C windows themselves aren't meant for investigation, that's why I'm asking. [13:41:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1462.eqiad.wmnet with OS buster [13:41:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1461.eqiad.wmnet with OS buster [13:41:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1466.eqiad.wmnet with OS buster [13:41:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1465.eqiad.wmnet with OS buster [13:41:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1459.eqiad.wmnet with OS buster [13:41:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1460.eqiad.wmnet with OS buster [13:41:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1464.eqiad.wmnet with OS buster [13:41:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1463.eqiad.wmnet with OS buster [13:41:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1467.eqiad.wmnet with OS buster [13:41:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1468.eqiad.wmnet with OS buster [13:41:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1462.eqiad.wmnet with OS buster [13:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1461.eqiad.wmnet with OS buster [13:41:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1466.eqiad.wmnet with OS buster [13:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:38] urbanecm lets sync and investigate please [13:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1465.eqiad.wmnet with OS buster [13:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1459.eqiad.wmnet with OS buster [13:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1460.eqiad.wmnet with OS buster [13:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1463.eqiad.wmnet with OS buster [13:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1467.eqiad.wmnet with OS buster [13:42:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1469.eqiad.wmnet with OS buster [13:42:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1464.eqiad.wmnet with OS buster [13:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1468.eqiad.wmnet with OS buster [13:42:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1469.eqiad.wmnet with OS buster [13:42:29] (03PS1) 10Sbisson: Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) [13:42:32] eigyan: sounds good. Deploying [13:42:38] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [13:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:43] urbanecm Excellent [13:42:44] (03PS1) 10Volans: CHANGELOG: add changelogs for release v3.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/809166 [13:43:37] PROBLEM - Check systemd state on ganeti6004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:52] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:25] !log upload anycast-healthchecker 0.8.2-1wm1 to apt.wm.o (buster) - T310574 [13:44:27] (03PS1) 10ArielGlenn: fix name of the send mail parameter for dump script converted to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/809167 [13:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:30] T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 [13:45:05] (03CR) 10Jcrespo: [C: 03+2] delete-media-backups: Default to dry-mode for deletions [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809152 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [13:45:15] (03CR) 10Jcrespo: [C: 03+2] Prepare for 0.1.2 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809145 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [13:45:53] !log urbanecm@deploy1002 scap failed: average error rate on 3/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [13:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:01] upps [13:46:14] reverting [13:46:45] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REVERT: 3ef8aaf5b1f77ce1f4d3e3ae71ed633b6f930f61: Deploy GDI Safety Survey Wave 2 (T311434) (duration: 00m 32s) [13:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:51] T311434: [2nd Attempt] - Deploy GDI Safety Survey Wave 2 on EN, ES, FA, FR, and PT wikis - PROD - https://phabricator.wikimedia.org/T311434 [13:47:03] a lot of PHP Warning: array_key_exists() expects parameter 2 to be array, boolean given [13:47:21] (03PS1) 10Urbanecm: Revert "[beta]: Remove GDI quick survey from EN,ES wikis - BETA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809169 [13:47:23] (03CR) 10Urbanecm: [C: 03+2] Revert "[beta]: Remove GDI quick survey from EN,ES wikis - BETA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809169 (owner: 10Urbanecm) [13:47:28] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "[beta]: Remove GDI quick survey from EN,ES wikis - BETA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809169 (owner: 10Urbanecm) [13:47:51] urbanecm oh wow...this is a bummer...thanks I will have to see what is happening thank you [13:47:55] eigyan: I've reverted the patch, as it causes a lot of errors. please feel free to reschedule with a new version [13:48:09] yes indeed..thanks! [13:48:15] urbanecm [13:48:19] no problem eigyan :) [13:48:23] T311271 [13:48:24] T311271: PHP Warning: array_key_exists() expects parameter 2 to be array, float given - https://phabricator.wikimedia.org/T311271 [13:48:27] uh [13:48:40] i reverted wrong patch, didn't i [13:48:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:49:01] yes [13:49:13] looks like it in the git log [13:49:19] weird [13:49:21] trying again [13:49:33] (03CR) 10Marostegui: [C: 03+1] mariadb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809134 (owner: 10Muehlenhoff) [13:49:39] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REVERT: 3ef8aaf5b1f77ce1f4d3e3ae71ed633b6f930f61: Deploy GDI Safety Survey Wave 2 (T311434) (duration: 00m 31s) [13:49:42] (03CR) 10Hokwelum: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/809167 (owner: 10ArielGlenn) [13:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:44] but the spike in logstash also went away again [13:49:56] reverting anyway [13:50:05] (03PS1) 10Urbanecm: Revert "[wmf-config]: Deploy GDI Safety Survey Wave 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809170 [13:50:06] urbanecm the beta patch should be fine but no worries [13:50:10] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [13:50:13] (03CR) 10ArielGlenn: [C: 03+2] fix name of the send mail parameter for dump script converted to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/809167 (owner: 10ArielGlenn) [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:20] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "[wmf-config]: Deploy GDI Safety Survey Wave 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809170 (owner: 10Urbanecm) [13:50:42] (03PS1) 10Urbanecm: Revert "Revert "[beta]: Remove GDI quick survey from EN,ES wikis - BETA"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809121 [13:50:48] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "[beta]: Remove GDI quick survey from EN,ES wikis - BETA"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809121 (owner: 10Urbanecm) [13:50:56] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Revert "[beta]: Remove GDI quick survey from EN,ES wikis - BETA"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809121 (owner: 10Urbanecm) [13:51:41] running a wmf-config sync-file, just in case [13:51:43] (w/o --force) [13:51:47] @urba [13:51:47] PROBLEM - Check systemd state on ganeti3002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:04] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:11] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [13:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:15] sounds like a good idea [13:52:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1466.eqiad.wmnet with reason: host reimage [13:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1460.eqiad.wmnet with reason: host reimage [13:52:27] eigyan: so, i unreverted the beta patch, reverted the prod patch and hopefully all'll be good [13:52:27] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1464.eqiad.wmnet with reason: host reimage [13:52:28] urbanecm is it possible that  Remove GDI quick survey from EN,ES wikis - BETA can remain [13:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1462.eqiad.wmnet with reason: host reimage [13:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1463.eqiad.wmnet with reason: host reimage [13:52:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1467.eqiad.wmnet with reason: host reimage [13:52:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1459.eqiad.wmnet with reason: host reimage [13:52:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1461.eqiad.wmnet with reason: host reimage [13:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] eigyan: yes, i unreverted that one [13:52:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1465.eqiad.wmnet with reason: host reimage [13:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1468.eqiad.wmnet with reason: host reimage [13:52:44] it was an accident on my part, i didn't mean to revert it [13:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:56] urbanecm awesome thanks! [13:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1469.eqiad.wmnet with reason: host reimage [13:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:27] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:53:44] urbanecm np [13:54:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:27] PROBLEM - Check systemd state on ganeti5002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:42] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:54] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1462.eqiad.wmnet with reason: host reimage [13:54:54] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1469.eqiad.wmnet with reason: host reimage [13:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:05] !log urbanecm@deploy1002 Synchronized wmf-config/: ensuring wmf-config is up2date at appservers (duration: 03m 30s) [13:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:55:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:17] eigyan: should be all done. talk to you later! [13:55:31] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v3.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/809166 (owner: 10Volans) [13:55:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1466.eqiad.wmnet with reason: host reimage [13:55:38] urbanecm thank you for all your help :) [13:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:53] (03CR) 10ArielGlenn: [C: 03+2] give dumps co-maintainer root on labstore and other dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/809144 (https://phabricator.wikimedia.org/T302145) (owner: 10ArielGlenn) [13:55:58] no problem [13:57:11] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [13:57:14] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1468.eqiad.wmnet with reason: host reimage [13:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:15] (03PS4) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 [13:58:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:59:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:32] (03PS3) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 [13:59:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1457.eqiad.wmnet with OS buster [13:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:43] (03CR) 10Jbond: redfish: add a generation property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [13:59:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1457.eqiad.wmnet with OS buster completed: - mw1457 (**PASS**) -... [13:59:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1463.eqiad.wmnet with reason: host reimage [13:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:58] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1460.eqiad.wmnet with reason: host reimage [13:59:58] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1459.eqiad.wmnet with reason: host reimage [13:59:59] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1469.eqiad.wmnet with OS buster [14:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1469.eqiad.wmnet with OS buster executed with errors: - mw1469 (**F... [14:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:32] (03PS5) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 [14:01:21] (03PS1) 10Zabe: dbbackups: convert dumps-sections cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) [14:01:25] (03PS1) 10Zabe: dbbackups: remove absented dumps-sections cron [puppet] - 10https://gerrit.wikimedia.org/r/809172 (https://phabricator.wikimedia.org/T273673) [14:01:51] (03CR) 10CI reject: [V: 04-1] dbbackups: convert dumps-sections cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:02:11] (03PS2) 10Zabe: dbbackups: convert dumps-sections cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) [14:02:13] (03CR) 10CI reject: [V: 04-1] dbbackups: remove absented dumps-sections cron [puppet] - 10https://gerrit.wikimedia.org/r/809172 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:02:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1467.eqiad.wmnet with reason: host reimage [14:02:28] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1464.eqiad.wmnet with reason: host reimage [14:02:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1465.eqiad.wmnet with reason: host reimage [14:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:37] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1461.eqiad.wmnet with reason: host reimage [14:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:46] (03CR) 10CI reject: [V: 04-1] dbbackups: convert dumps-sections cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:03:40] (03PS3) 10Zabe: dbbackups: convert dumps-sections cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) [14:03:49] (03CR) 10Elukey: [C: 03+2] Add ml-staging-codfw among the helmfile envs to test [deployment-charts] - 10https://gerrit.wikimedia.org/r/809149 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [14:03:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:04:42] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v3.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/809166 (owner: 10Volans) [14:04:53] (03PS4) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 [14:04:55] (03PS3) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 [14:04:56] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1459.eqiad.wmnet with OS buster [14:04:57] (03PS3) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 [14:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1459.eqiad.wmnet with OS buster executed with errors: - mw1459 (**F... [14:05:24] (03PS3) 10Elukey: ml-staging: Add inference services for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:06:09] (03PS6) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 [14:06:15] (03PS7) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 [14:06:39] RECOVERY - Check systemd state on ganeti5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:23] RECOVERY - Check systemd state on ganeti5003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:39] (03PS1) 10Volans: Upstream release v3.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/809173 [14:07:56] (03CR) 10Volans: [C: 03+2] Upstream release v3.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/809173 (owner: 10Volans) [14:08:05] (03PS5) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 [14:08:15] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8032744, @MoritzMuehlenhoff wrote: >>>! In T310980#8032293, @elukey wrote: [ ... ] >> Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 2.7) | python2... [14:08:25] (03PS4) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 [14:08:29] (03PS4) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 [14:09:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew) Yep, everything together in one raid sounds good to me. [14:09:26] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1465.eqiad.wmnet with OS buster [14:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1465.eqiad.wmnet with OS buster executed with errors: - mw1465 (**F... [14:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:14:11] RECOVERY - Check systemd state on ganeti4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1463.eqiad.wmnet with OS buster [14:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1463.eqiad.wmnet with OS buster completed: - mw1463 (**PASS**) -... [14:18:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:18:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1467.eqiad.wmnet with OS buster [14:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:05] (03Merged) 10jenkins-bot: Upstream release v3.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/809173 (owner: 10Volans) [14:19:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1467.eqiad.wmnet with OS buster completed: - mw1467 (**PASS**) -... [14:20:21] (03PS1) 10Papaul: Add new PDU model to ps1-d1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/809177 (https://phabricator.wikimedia.org/T310146) [14:20:56] (03PS1) 10Zabe: dumps: remove absented dumps-fetches-wikitech cron [puppet] - 10https://gerrit.wikimedia.org/r/809178 (https://phabricator.wikimedia.org/T273673) [14:22:41] (03PS1) 10Zabe: snapshot: remove absented dumps-timechecker cron [puppet] - 10https://gerrit.wikimedia.org/r/809179 (https://phabricator.wikimedia.org/T273673) [14:24:07] RECOVERY - Check systemd state on ganeti3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:27] (03PS1) 10Zabe: logster: remove absented logster- cron [puppet] - 10https://gerrit.wikimedia.org/r/809181 (https://phabricator.wikimedia.org/T273673) [14:24:52] !log uploaded spicerack_3.0.0 to apt.wikimedia.org bullseye-wikimedia [14:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:35] RECOVERY - Check systemd state on ganeti6004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1466.eqiad.wmnet with OS buster [14:26:41] (03CR) 10Ahmon Dancy: [C: 03+1] "Looks reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/809085 (https://phabricator.wikimedia.org/T311241) (owner: 10Majavah) [14:26:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1466.eqiad.wmnet with OS buster completed: - mw1466 (**PASS**) -... [14:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1460.eqiad.wmnet with OS buster [14:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1460.eqiad.wmnet with OS buster completed: - mw1460 (**WARN**) -... [14:27:31] (03PS5) 10David Caro: kubeadm: label nodes with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) (owner: 10Majavah) [14:28:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:29:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1461.eqiad.wmnet with OS buster [14:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:36] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1464.eqiad.wmnet with OS buster [14:29:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1461.eqiad.wmnet with OS buster completed: - mw1461 (**WARN**) -... [14:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1464.eqiad.wmnet with OS buster completed: - mw1464 (**WARN**) -... [14:30:07] (03CR) 10CI reject: [V: 04-1] kubeadm: label nodes with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) (owner: 10Majavah) [14:30:28] (03PS2) 10David Caro: p:toolforge:grid:{exec_environ,toloviews}: fix/add tests [puppet] - 10https://gerrit.wikimedia.org/r/809130 [14:30:30] (03CR) 10David Caro: p:toolforge:grid:{exec_environ,toloviews}: fix/add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809130 (owner: 10David Caro) [14:31:17] (03PS1) 10Giuseppe Lavagetto: mediawiki/php: fix ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/809187 [14:31:43] (03PS6) 10Majavah: kubeadm: label nodes with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) [14:32:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1468.eqiad.wmnet with OS buster [14:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1468.eqiad.wmnet with OS buster completed: - mw1468 (**PASS**) -... [14:32:17] (03Abandoned) 10Volans: sre.hosts.decommission: unblock decoms [cookbooks] - 10https://gerrit.wikimedia.org/r/808209 (owner: 10Volans) [14:32:21] !log on going PDU maintenance in RACk D1 codfw [14:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:25] D1: Initial commit - https://phabricator.wikimedia.org/D1 [14:32:57] (03PS3) 10David Caro: p:toolforge:grid:{exec_environ,toloviews}: fix/add tests [puppet] - 10https://gerrit.wikimedia.org/r/809130 [14:33:12] (03CR) 10Majavah: p:toolforge:grid:{exec_environ,toloviews}: fix/add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809130 (owner: 10David Caro) [14:33:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36072/console" [puppet] - 10https://gerrit.wikimedia.org/r/809187 (owner: 10Giuseppe Lavagetto) [14:33:45] Welcome back _joe_ :-) [14:34:06] <_joe_> dancy: <3 [14:35:33] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36074/console" [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) (owner: 10Majavah) [14:36:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Increase weights on the language selector statement boosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [14:36:38] PROBLEM - Host cloudcontrol2004-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:36:56] ^ I’ll deploy this config change if nobody shouts at me [14:37:02] PROBLEM - Host mc2032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:08] PROBLEM - Host ores2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:16] PROBLEM - Host pc2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:20] (03PS2) 10Lucas Werkmeister (WMDE): Increase weights on the language selector statement boosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [14:37:24] PROBLEM - Host restbase2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:26] PROBLEM - Host restbase2017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:30] hm, something going on in codfw? [14:37:40] PROBLEM - Host wcqs2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:43] ah, related to the recent SAL I guess [14:37:55] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) (owner: 10Majavah) [14:38:01] then I’ll wait a bit with the config change [14:38:40] PROBLEM - Host db2078.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:38:40] PROBLEM - Host db2088.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:38:40] PROBLEM - Host elastic2034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:38:44] PROBLEM - Host db2117.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:38:54] PROBLEM - Host db2128.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:38:54] PROBLEM - Host db2139.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:39:08] PROBLEM - Host db2151.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:39:24] PROBLEM - Host es2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:39:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1462.eqiad.wmnet with OS buster [14:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1462.eqiad.wmnet with OS buster completed: - mw1462 (**WARN**) -... [14:39:56] PROBLEM - Host ganeti2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:40:04] PROBLEM - Host ganeti2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:40:27] (03PS2) 10Giuseppe Lavagetto: mediawiki/php: fix ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/809187 [14:41:29] (03CR) 10Lucas Werkmeister (WMDE): "Hm, I briefly tried this out on mwdebug and it still didn’t seem to make a difference, even when I bumped the number to 500. (Though I can" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [14:41:31] (03PS4) 10David Caro: p:toolforge:grid:{exec_environ,toolviews}: fix/add tests [puppet] - 10https://gerrit.wikimedia.org/r/809130 [14:41:42] (03CR) 10David Caro: p:toolforge:grid:{exec_environ,toolviews}: fix/add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809130 (owner: 10David Caro) [14:42:28] (03PS2) 10Jbond: realm.pp: Add defaults for file [puppet] - 10https://gerrit.wikimedia.org/r/809095 [14:42:41] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36075/console" [puppet] - 10https://gerrit.wikimedia.org/r/809187 (owner: 10Giuseppe Lavagetto) [14:42:55] (03CR) 10David Caro: [C: 03+2] p:toolforge:grid:{exec_environ,toolviews}: fix/add tests [puppet] - 10https://gerrit.wikimedia.org/r/809130 (owner: 10David Caro) [14:44:16] PROBLEM - Host db2100.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:01] (03CR) 10Jcrespo: [C: 04-1] "I'd prefer to solve this by tuning the application behavior- I want to be notified if, for example, the config file is missing or corrupte" [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:45:52] heads-up: Arzhel and I will be deploying the bird2 upgrade that will affect the following services: Internal recursors, syslog, Wikidough, durum. For more information, see https://phabricator.wikimedia.org/T310574. [14:48:30] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki/php: fix ordering issues [puppet] - 10https://gerrit.wikimedia.org/r/809187 (owner: 10Giuseppe Lavagetto) [14:48:44] There are still a ton of PHP Warning: array_key_exists() expects parameter 2 to be array, float given errors in logstash. [14:49:09] !log disable puppet on P{R:Class = bird} [14:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:13] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809194 [14:49:20] (03CR) 10Ayounsi: "Didn't review addnode.py yet." [cookbooks] - 10https://gerrit.wikimedia.org/r/809136 (owner: 10Volans) [14:49:55] alright! [14:49:59] could someone double check that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/809170 has been completly synced? [14:50:01] urbanecm, ^ [14:50:06] sukhe: which server are you starting with? [14:50:10] XioNoX: durum1001 :) [14:50:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) Well the partitioning recipe didn't do what we wanted to anyway. * `/dev/sda` is the big RAID10 drive (as we suspected) * `/dev/sdb` is the... [14:50:54] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) >>! In T310980#8032744, @MoritzMuehlenhoff wrote: >>>! In T310980#8032293, @elukey wrote: >> I was able to bootstrap the cassandra ML cluster in codfw on Bullseye. The only odd thing is t... [14:51:05] XioNoX: https://debmonitor.wikimedia.org/hosts/durum1001.eqiad.wmnet [14:51:12] anycast-healthchecker 0.8.2-1wm1 upgrade [14:51:17] python3-anycast-healthchecker 0.8.2-1wm1 upgrade [14:51:18] look good [14:51:22] I am going to merge the patch [14:51:23] ok? [14:51:28] yep! [14:51:39] (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/808043 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [14:52:28] 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10BCornwall) I spoke with @Vgutierrez and they would like to keep these instances around. I'll do a little more digging into fixing this, particularly sin... [14:53:34] XioNoX: the durum false alarms were fixed so we shouldn't have any of those [14:53:40] running puppet agent now on durum1001 [14:53:48] false alarms? [14:53:53] the Icinga checks [14:53:59] that were obvioulsy false :) [14:54:05] (03CR) 10Jcrespo: [C: 04-1] "Also this is a duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/792113 which means there is a lack of previous coordinati" [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:54:24] ah right [14:54:59] (03CR) 10Zabe: dbbackups: convert dumps-sections cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:55:02] (03Abandoned) 10Zabe: dbbackups: convert dumps-sections cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:55:08] (03Abandoned) 10Zabe: dbbackups: remove absented dumps-sections cron [puppet] - 10https://gerrit.wikimedia.org/r/809172 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:55:48] sukhe: not receiving any prefixes on the router side [14:56:23] XioNoX: just merged [14:56:31] jouncebot: now [14:56:31] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [14:57:29] healthy anycast-hc and ACAST_PS_ADVERTISE [14:57:39] Jun 28 14:57:35 durum1001 bird[27426]: bfd1: Bad packet from 2620:0:861:ffff::1 - unknown session id (2124633584) [14:57:56] (03PS4) 10Klausman: ml-staging: Add inference services for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) [14:58:14] (03PS5) 10Klausman: ml-staging: Add inference services for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) [14:58:18] PROBLEM - IPMI Sensor Status on mc2032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:58:42] PROBLEM - IPMI Sensor Status on ores2007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:59:01] sukhe: so v4 is established (inc. BFD) but it's not receiving any prefixes [14:59:22] PROBLEM - IPMI Sensor Status on db2139 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:59:24] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:59:29] hmm [14:59:43] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8033497, @elukey wrote: >>>! In T310980#8032744, @MoritzMuehlenhoff wrote: >>>>! In T310980#8032293, @elukey wrote: >>> I was able to bootstrap the cassandra ML cluster in... [14:59:47] sukhe: v6's BFD is *not* established* but BGP is working fine [15:00:14] (03CR) 10Dzahn: "releng correct me if I'm wrong but I think it can wait until Monday, I was debugging the current situation there with buildkit and meanwhi" [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [15:00:16] (03PS3) 10Volans: sre.ganeti.*: adapt to latest Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/809136 [15:00:27] (03CR) 10Volans: "addressed comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/809136 (owner: 10Volans) [15:00:36] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:00:49] XioNoX: there is one more package :) [15:00:49] prometheus-bird-exporter [15:00:52] prometheus-bird-exporter : Depends: bird but it is not going to be installed [15:01:01] but that's not related to the current issue [15:01:10] going to use mwdebug1001 to test a patch if no-one objects [15:01:14] indeed [15:01:26] what does `birdc show bfd sessions` look like? [15:01:49] sukhe@durum1001:~$ sudo birdc show bfd sessions [15:01:49] BIRD 2.0.7 ready. [15:01:49] bfd1: [15:01:49] IP address Interface State Since Interval Timeout [15:01:52] 208.80.154.197 --- Up 14:54:51.930 0.300 0.900 [15:01:55] 208.80.154.196 --- Up 14:54:51.848 0.300 0.900 [15:02:05] also we still have a bird6 service when we should not and didn't last time [15:02:11] /var/log/anycast-healthchecker/anycast-healthchecker.log is from January [15:02:44] this is expected, since we set [15:02:44] profile::bird::anycasthc_logging: level: 'critical' num_backups: 1 [15:03:02] RECOVERY - IPMI Sensor Status on ores2007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:03:12] RECOVERY - Host ganeti2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [15:03:47] bird_reconfigure_cmd = /usr/sbin/birdc configure [15:03:47] bird6_reconfigure_cmd = /usr/sbin/birdc configure [15:03:48] this is fine [15:03:54] and so is bird_reconfigure_cmd = /usr/sbin/birdc configure [15:03:54] bird6_reconfigure_cmd = /usr/sbin/birdc configure [15:03:59] bird6_conf = /etc/bird/anycast6-prefixes.conf [15:04:14] yeah why is there a bird6 still running? [15:04:19] https://www.irccloud.com/pastebin/saR5Y4DG/ [15:04:30] last time it was not running and we didn't change anything related to that in the new commit [15:04:47] https://puppet-compiler.wmflabs.org/pcc-worker1002/36018/doh1001.wikimedia.org/index.html [15:04:50] Resources only in the old catalog [15:04:53] sukhe: restarting bird [15:04:53] Systemd::Unit[bird6] [15:04:54] Systemd::Service[bird6] [15:05:14] sukhe: you need to explicitely set ensure => absent for a custom service unit to be removed [15:05:39] if that's the case, I wonder how was it removed the last time? because I checked the logs and it was [15:06:12] RECOVERY - Host ores2007.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 519.43 ms [15:06:12] RECOVERY - Host pc2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 472.73 ms [15:06:26] RECOVERY - Host restbase2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.57 ms [15:06:26] RECOVERY - Host restbase2017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.23 ms [15:06:30] RECOVERY - Host wcqs2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.60 ms [15:07:00] I am updating the prometheus-bird-exporter package in the meantime [15:07:00] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:07:10] RECOVERY - Host db2100.mgmt is UP: PING OK - Packet loss = 0%, RTA = 46.31 ms [15:07:16] essentially, [15:07:16] Depends: bird | bird2, ${misc:Depends}, ${shlibs:Depends} [15:07:24] RECOVERY - Host db2078.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.32 ms [15:07:24] RECOVERY - Host db2088.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.21 ms [15:07:24] RECOVERY - Host elastic2034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms [15:07:24] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:07:26] RECOVERY - Host db2117.mgmt is UP: PING OK - Packet loss = 0%, RTA = 46.49 ms [15:07:30] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>>>! In T310980#8033319, @Eevans wrote: >>>>>! In T310980#8032744, @MoritzMuehlenhoff wrote: >>>>>>! In T310980#8032293, @elukey wrote: [ ... ] >>> >>> We definitely use/need cqlsh.... [15:07:30] sukhe: so the BFD issue was because of that old process [15:07:34] RECOVERY - Host db2128.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.01 ms [15:07:34] RECOVERY - Host db2139.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.02 ms [15:07:35] interesting! [15:07:45] BIRD 2.0.7 ready. [15:07:45] bfd1: [15:07:46] IP address Interface State Since Interval Timeout [15:07:46] RECOVERY - Host db2151.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.56 ms [15:07:48] RECOVERY - Host es2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.64 ms [15:07:49] 208.80.154.196 --- Up 15:06:45.938 0.300 0.900 [15:07:51] 2620:0:861:ffff::1 --- Up 15:06:45.097 0.300 0.900 [15:07:54] 2620:0:861:ffff::2 --- Up 15:06:45.203 0.300 0.900 [15:07:57] 208.80.154.197 --- Up 15:06:46.263 0.300 0.900 [15:08:02] sukhe: but bird is still not advertising the prefixes [15:08:32] RECOVERY - Host ganeti2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.42 ms [15:09:46] PROBLEM - IPMI Sensor Status on restbase2017 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:10:00] XioNoX: try making the anycast-hc looks more verbose to see if says something? [15:10:06] oh seems like you are doing that [15:10:23] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) @MoritzMuehlenhoff given what Eric said, would it be ok to stay on Buster until Cassandra 4.0 is added to our repos? [15:10:27] the prefixes are in the define [15:10:33] in /etc/bird/anycast6-prefixes.conf etc [15:10:35] yeah... [15:10:47] so I think anycast healthchecker is working as expected [15:10:51] so anycast-hc is fine then [15:10:58] RECOVERY - Host cloudcontrol2004-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [15:11:04] PROBLEM - IPMI Sensor Status on ml-staging2002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:07] increasing logging just in case [15:11:08] PROBLEM - IPMI Sensor Status on es2034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:11:10] (03CR) 10Zabe: dbbackups: convert dumps-sections cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809171 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:11:30] RECOVERY - Host mc2032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms [15:11:46] yep, that's going well [15:11:54] so it's something with the bird config [15:11:57] yep [15:12:01] checking [15:12:15] thanks, I am uploading the new build of prometheus-bird to at least have a clean puppet run [15:13:49] !log upload prometheus-bird-exporter (1.2.2-1wm1) buster-wikimedia - T310574 [15:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:55] T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 [15:13:57] (03PS1) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) [15:14:20] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:15:03] sukhe: durum1001:~$ ps aux | grep bird6 [15:15:04] bird 14342 0.0 0.0 11400 1912 ? Ssl 15:11 0:00 /usr/sbin/bird6 -f -u bird -g bird [15:15:08] it's back :) [15:15:18] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:15:22] puppet run [15:15:23] fixing [15:15:29] yeah, no big deal [15:15:52] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:16:12] (03PS2) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) [15:16:54] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:17:11] Routes: 0 imported, 0 exported, 0 preferred [15:17:13] cool, prometheus-bird-exporter accepted the optional bird2 [15:17:48] clean puppet run [15:17:53] XioNoX: that leaves us with bird config? [15:17:59] zabe: sorry, i was afk for a while. i spot checked a few appservers, and $wmgUseQuickSurveys was false on them. then i checked logstash, and mw1417/mw1418 don't have the change for some weird reason. [15:18:02] let me...fix that [15:18:04] jouncebot: nowandnext [15:18:04] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [15:18:04] In 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220628T1600) [15:18:09] looking there now [15:19:14] (03CR) 10Hnowlan: "Some ratelimit service output proving that this change works" [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [15:19:22] (03CR) 10DCausse: Increase weights on the language selector statement boosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [15:20:10] PROBLEM - IPMI Sensor Status on elastic2034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:22:06] !log urbanecm@deploy1002 Synchronized wmf-config/: ensuring wmf-config is up2date at appservers (at least mw1417/mw1418 have old config) (duration: 03m 39s) [15:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:52] mw1417 has new config now, mw1418 too [15:23:01] checked mw1414 (also in logstash), OK [15:23:07] so i guess it's now actually fixed [15:23:17] sukhe: yeah still digging [15:23:24] ok, thanks. A bit odd that not all appservers got the new config. [15:23:26] XioNoX: is it just IPv6? [15:23:37] both v4 and v6 [15:23:42] I see [15:23:50] I think I know [15:23:52] one sec [15:23:57] please say so :D [15:25:22] zabe: i agree. i see some reimages in SAL, but not for the hosts that were wrong. [15:27:48] (03PS1) 10Sbisson: Localisation updates from https://translatewiki.net. [extensions/Wikistories] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809122 [15:28:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:29:13] (03CR) 10Urbanecm: [C: 03+2] "per Sbisson's request, wmf.18 is not at deployment host yet, so this will just make the change ride the train" [extensions/Wikistories] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809122 (owner: 10Sbisson) [15:30:51] sukhe: so bascially bird doesn't import the loopback IPs [15:30:55] PROBLEM - Host mc2032 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:11] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/Wikistories] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809122 (owner: 10Sbisson) [15:31:53] PROBLEM - Host es2033 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:43] XioNoX: ok, I am trying to see what changed then in bird2 [15:32:51] RECOVERY - Host mc2032 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [15:33:29] (03CR) 10Bking: elastic: configure keystore values for restore (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [15:33:31] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: clear ExecStartPost for php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/809204 [15:33:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10fgiunchedi) Thank you for the investigation @BTullis ! In case you haven't come across it yet: the `partman/custom/kafka-jumbo.cfg` configuration wou... [15:34:31] RECOVERY - Host es2033 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [15:35:05] XioNoX: would it be setting the interfaces in protocol device? [15:35:07] is that the part that handles it? [15:35:27] FIXED [15:35:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:32] wohoo! [15:35:45] and confirmed on the router side [15:35:46] export all? [15:36:12] so I did a few changes, I think only the export all under protocol direct is needed [15:36:18] ok! [15:36:21] rolling the others back [15:36:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:36:24] I am diffing and preparing a patch [15:36:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:35] sukhe: are you editing the config file? [15:36:41] just read-only [15:36:46] let me cat it instead [15:36:50] Found a swap file by the name "/etc/bird/.bird.conf.swp" [15:36:54] try now? [15:37:01] I was in edit-and-read mode before [15:37:03] now I am just viewing it [15:37:09] bird 2 switched to a reject by default [15:37:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36076/console" [puppet] - 10https://gerrit.wikimedia.org/r/809204 (owner: 10Giuseppe Lavagetto) [15:37:51] PROBLEM - mysqld processes on es2033 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:38:12] em [15:38:24] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::php: clear ExecStartPost for php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/809204 (owner: 10Giuseppe Lavagetto) [15:38:30] sukhe: alright, current config works [15:38:35] pushing patch for review [15:38:45] is that a downtime expiration, a crash, a network issue? [15:39:25] (03PS1) 10Ssingh: bird: update bird.conf for bird2 changes [puppet] - 10https://gerrit.wikimedia.org/r/809205 (https://phabricator.wikimedia.org/T310574) [15:39:26] jynus: es2033:~$ uptime [15:39:26] 15:39:15 up 5 min [15:39:32] I see [15:39:37] due to power work I guess? [15:39:40] PDU work I mean [15:40:07] RECOVERY - IPMI Sensor Status on restbase2017 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:40:08] who from dba team is around? [15:40:08] sukhe: not all those changes are needed [15:40:12] mc2032 also just rebooted [15:40:13] sukhe: see current file [15:40:21] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36077/console" [puppet] - 10https://gerrit.wikimedia.org/r/809205 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [15:40:55] ouch [15:41:00] XioNoX: did the puppet run override it? [15:41:24] or was it not saved? I see the swp [15:41:29] RECOVERY - IPMI Sensor Status on ml-staging2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:41:31] sukhe: saved now [15:41:34] sukhe: basically this [15:41:36] thanks, checking [15:41:37] https://www.irccloud.com/pastebin/bnrvIxR9/ [15:41:37] RECOVERY - IPMI Sensor Status on es2034 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:41:49] ah! [15:42:08] we could restrict it even further, to interface "lo*"; but that's for another day :) [15:42:15] yeah :D [15:42:23] all I need is an optional for do_ipv6 [15:42:25] and I am pushing this [15:42:26] for review [15:43:36] yep [15:44:31] (03PS2) 10Ssingh: bird: update bird.conf for bird2 changes [puppet] - 10https://gerrit.wikimedia.org/r/809205 (https://phabricator.wikimedia.org/T310574) [15:45:19] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36078/console" [puppet] - 10https://gerrit.wikimedia.org/r/809205 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [15:45:27] sukhe: we should do the final rollout another day, I have to step away now, and we still have to solve the bird6 issue [15:45:46] XioNoX: we will need to keep Puppet disabled then on the other hosts... [15:45:56] or, to revert this change [15:46:05] of course that's fine with me! [15:46:07] sukhe: I'd say revert [15:46:16] got it [15:46:35] so we don't rush the bird6 aspect of it [15:46:36] also, I think the bird6 thing happened because of the failed puppet run but I will do the ensure => absent thing taavi suggested above, jsut to be extra sure [15:46:38] sure [15:46:50] so then let's leave this unmerged [15:46:52] but reviewed [15:46:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/809205 [15:47:06] 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10jcrespo) [15:47:08] and I will rollback and revert durum1001 [15:47:23] I mean, at least it works now :) [15:47:44] PROBLEM - MariaDB read only es2 on es2033 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:48:06] 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10jcrespo) [15:48:17] XioNoX: ok then, I am reverting [15:48:42] (03CR) 10Ayounsi: [C: 03+1] bird: update bird.conf for bird2 changes [puppet] - 10https://gerrit.wikimedia.org/r/809205 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [15:48:59] yeah we have it fully working, that's what matters [15:49:08] the other packages: anycast-healthchecker, python3-anycast-healthchecker, prometheus-bird-exporter should not auto update anyway but even if they do, that's OK since the only thing that changed was the package Depends [15:49:40] (03PS1) 10Ssingh: Revert "bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations)" [puppet] - 10https://gerrit.wikimedia.org/r/809123 [15:49:52] and depends is bird | bird2, so we are fine. sharing for posterity mostly [15:49:59] noted [15:50:07] thanks! nice catch! [15:50:13] feel free to leave, I got the revert part [15:50:14] <3 [15:50:23] !log volans@cumin2002 START - Cookbook sre.dns.netbox [15:50:24] alright! thanks for your time! [15:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:29] RECOVERY - Host ps1-d1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [15:50:35] PROBLEM - ps1-d1-codfw-infeed-load-tower-A-phase-Y on ps1-d1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:50:47] PROBLEM - ps1-d1-codfw-infeed-load-tower-B-phase-X on ps1-d1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:50:57] RECOVERY - IPMI Sensor Status on elastic2034 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:51:01] PROBLEM - ps1-d1-codfw-infeed-load-tower-A-phase-X on ps1-d1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:01] PROBLEM - ps1-d1-codfw-infeed-load-tower-B-phase-Z on ps1-d1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:15] PROBLEM - ps1-d1-codfw-infeed-load-tower-A-phase-Z on ps1-d1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:31] 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10Marostegui) a:03Marostegui Will check it! Thanks for the initial triage!! [15:51:33] PROBLEM - ps1-d1-codfw-infeed-load-tower-B-phase-Y on ps1-d1-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:39] RECOVERY - IPMI Sensor Status on mc2032 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:52:23] 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10Marostegui) Many icinga power alerts on IRC right now, so might be related [15:52:29] (03CR) 10Papaul: [C: 03+2] Add new PDU model to ps1-d1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/809177 (https://phabricator.wikimedia.org/T310146) (owner: 10Papaul) [15:53:19] marostegui: PDU maintenance in codfw as we speak [15:53:30] yep [15:53:34] 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10Volans) Given that mc2032 also got rebooted around the same time and they are in the same rack it could be related to the work in T309956. It could also indicate that either they have the power cables not plugged... [15:53:53] RECOVERY - IPMI Sensor Status on db2139 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:54:10] 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10Marostegui) Yeah, definitely related to that maintenance [15:54:10] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:16] (03CR) 10Ssingh: [C: 03+2] Revert "bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations)" [puppet] - 10https://gerrit.wikimedia.org/r/809123 (owner: 10Ssingh) [15:54:49] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:55:20] please ignore durum1001 for the next few minutes [15:55:41] I will silence it if it gets out of hand, but the IRC notifications are helpful. there is no paging [15:56:47] (03CR) 10Elukey: [C: 03+1] "The CI diff looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:56:57] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:57:03] RECOVERY - ps1-d1-codfw-infeed-load-tower-A-phase-Y on ps1-d1-codfw is OK: SNMP OK - ps1-d1-codfw-infeed-load-tower-A-phase-Y 310 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:58:41] PROBLEM - Bird Internet Routing Daemon on durum1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:58:46] ^ expected [15:59:03] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:59:07] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:59:27] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) [15:59:51] !log PDU maintenance in RAck D1 codfw complete [15:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:56] D1: Initial commit - https://phabricator.wikimedia.org/D1 [16:00:05] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220628T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:57] RECOVERY - Bird Internet Routing Daemon on durum1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:01:25] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:01:31] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:04:03] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 3 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10LSobanski) @Varnent After chatting about this some more, how about we do the following: - Redirect policy.wikimedia.org as requ... [16:04:24] !log dancy@deploy1002 Installing scap version "4.10.0" for 561 hosts [16:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:44] !log dancy@deploy1002 Installation of scap version "4.10.0" completed for 561 hosts [16:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:21] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-staging: Add inference services for testing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/809146 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:07:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:47] (03PS3) 10Majavah: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 [16:07:49] (03PS1) 10Majavah: P:wmcs: toolsdb: support mariadb 10.4 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/809213 (https://phabricator.wikimedia.org/T301949) [16:08:01] !log enable puppet on P{R:Class = bird} (complete rollback of Ieab3abb6) [16:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:06] (03PS2) 10Majavah: P:wmcs: toolsdb: support mariadb 10.4 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/809213 (https://phabricator.wikimedia.org/T301949) [16:09:27] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [16:09:37] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36080/console" [puppet] - 10https://gerrit.wikimedia.org/r/809213 (https://phabricator.wikimedia.org/T301949) (owner: 10Majavah) [16:09:59] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [16:10:18] (03Abandoned) 10Ssingh: bird: update bird.conf for bird2 changes [puppet] - 10https://gerrit.wikimedia.org/r/809205 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [16:10:52] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:15] (03CR) 10David Caro: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/809213 (https://phabricator.wikimedia.org/T301949) (owner: 10Majavah) [16:11:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:49] RECOVERY - ps1-d1-codfw-infeed-load-tower-A-phase-X on ps1-d1-codfw is OK: SNMP OK - ps1-d1-codfw-infeed-load-tower-A-phase-X 161 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:13] (03CR) 10Brennen Bearnes: [C: 03+1] admin: add gitlab-roots group to gitlab_runner role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [16:18:09] RECOVERY - ps1-d1-codfw-infeed-load-tower-A-phase-Z on ps1-d1-codfw is OK: SNMP OK - ps1-d1-codfw-infeed-load-tower-A-phase-Z 190 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:20:33] RECOVERY - ps1-d1-codfw-infeed-load-tower-B-phase-X on ps1-d1-codfw is OK: SNMP OK - ps1-d1-codfw-infeed-load-tower-B-phase-X 167 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:23:03] RECOVERY - ps1-d1-codfw-infeed-load-tower-B-phase-Y on ps1-d1-codfw is OK: SNMP OK - ps1-d1-codfw-infeed-load-tower-B-phase-Y 291 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:23:27] (03PS1) 10Majavah: P:wmcs: toolsdb: extend binlog retention [puppet] - 10https://gerrit.wikimedia.org/r/809219 [16:24:21] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36081/console" [puppet] - 10https://gerrit.wikimedia.org/r/809219 (owner: 10Majavah) [16:25:19] RECOVERY - ps1-d1-codfw-infeed-load-tower-B-phase-Z on ps1-d1-codfw is OK: SNMP OK - ps1-d1-codfw-infeed-load-tower-B-phase-Z 191 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:26:59] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Papaul) [16:32:15] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:43] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:10] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:06] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:56:34] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Papaul) [16:57:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1470.eqiad.wmnet with OS buster [16:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1470.eqiad.wmnet with OS buster [16:57:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudcontrol2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [16:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1471.eqiad.wmnet with OS buster [16:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1471.eqiad.wmnet with OS buster [16:59:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1473.eqiad.wmnet with OS buster [16:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1473.eqiad.wmnet with OS buster [16:59:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1474.eqiad.wmnet with OS buster [16:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1474.eqiad.wmnet with OS buster [16:59:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1476.eqiad.wmnet with OS buster [16:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1476.eqiad.wmnet with OS buster [17:00:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1478.eqiad.wmnet with OS buster [17:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1478.eqiad.wmnet with OS buster [17:01:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1479.eqiad.wmnet with OS buster [17:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1479.eqiad.wmnet with OS buster [17:01:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1480.eqiad.wmnet with OS buster [17:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1480.eqiad.wmnet with OS buster [17:02:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1481.eqiad.wmnet with OS buster [17:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1481.eqiad.wmnet with OS buster [17:02:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1482.eqiad.wmnet with OS buster [17:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1482.eqiad.wmnet with OS buster [17:03:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1483.eqiad.wmnet with OS buster [17:03:05] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1483.eqiad.wmnet with OS buster [17:03:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1484.eqiad.wmnet with OS buster [17:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1484.eqiad.wmnet with OS buster [17:04:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:04:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:04:57] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Andrew) [17:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1472.eqiad.wmnet with OS buster [17:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1472.eqiad.wmnet with OS buster [17:08:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1470.eqiad.wmnet with reason: host reimage [17:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1471.eqiad.wmnet with reason: host reimage [17:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1473.eqiad.wmnet with reason: host reimage [17:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1474.eqiad.wmnet with reason: host reimage [17:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1476.eqiad.wmnet with reason: host reimage [17:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1470.eqiad.wmnet with reason: host reimage [17:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1478.eqiad.wmnet with reason: host reimage [17:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1479.eqiad.wmnet with reason: host reimage [17:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1480.eqiad.wmnet with reason: host reimage [17:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1481.eqiad.wmnet with reason: host reimage [17:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1482.eqiad.wmnet with reason: host reimage [17:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1474.eqiad.wmnet with reason: host reimage [17:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1483.eqiad.wmnet with reason: host reimage [17:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1484.eqiad.wmnet with reason: host reimage [17:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1473.eqiad.wmnet with reason: host reimage [17:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1472.eqiad.wmnet with reason: host reimage [17:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:58] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools on mobile at partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809012 (https://phabricator.wikimedia.org/T298221) [17:17:00] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools visualenhancements at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809011 (https://phabricator.wikimedia.org/T310960) [17:17:02] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools on mobile at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809223 (https://phabricator.wikimedia.org/T310960) [17:18:11] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1482.eqiad.wmnet with reason: host reimage [17:18:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1476.eqiad.wmnet with reason: host reimage [17:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:55] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1471.eqiad.wmnet with reason: host reimage [17:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:30] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@f3e667d]: (no justification provided) [17:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:39] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@f3e667d]: (no justification provided) (duration: 00m 09s) [17:20:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1483.eqiad.wmnet with reason: host reimage [17:20:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1479.eqiad.wmnet with reason: host reimage [17:20:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1484.eqiad.wmnet with reason: host reimage [17:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:37] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1478.eqiad.wmnet with reason: host reimage [17:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:47] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1480.eqiad.wmnet with reason: host reimage [17:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1481.eqiad.wmnet with reason: host reimage [17:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1472.eqiad.wmnet with reason: host reimage [17:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) >>! In T304888#8033321, @Andrew wrote: > Yep, everything together in one raid sounds good to me. Great, thanks for the input [17:26:34] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1484.eqiad.wmnet with OS buster [17:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1484.eqiad.wmnet with OS buster executed with errors: - mw1484 (**F... [17:28:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:46] (03PS1) 10Jbond: beaker: add initial beaker files (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/809224 [17:28:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:08] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) [17:33:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1482.eqiad.wmnet with OS buster [17:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1482.eqiad.wmnet with OS buster completed: - mw1482 (**WARN**) -... [17:34:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1485.eqiad.wmnet with OS buster [17:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1485.eqiad.wmnet with OS buster [17:34:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1486.eqiad.wmnet with OS buster [17:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1486.eqiad.wmnet with OS buster [17:37:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1481.eqiad.wmnet with OS buster [17:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1481.eqiad.wmnet with OS buster completed: - mw1481 (**WARN**) -... [17:39:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1487.eqiad.wmnet with OS buster [17:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1487.eqiad.wmnet with OS buster [17:40:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1483.eqiad.wmnet with OS buster [17:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1483.eqiad.wmnet with OS buster completed: - mw1483 (**WARN**) -... [17:40:36] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.18 refs T308071 [17:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:42] T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071 [17:40:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:42:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:42:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1471.eqiad.wmnet with OS buster [17:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1471.eqiad.wmnet with OS buster completed: - mw1471 (**WARN**) -... [17:43:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1474.eqiad.wmnet with OS buster [17:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1474.eqiad.wmnet with OS buster completed: - mw1474 (**PASS**) -... [17:43:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1488.eqiad.wmnet with OS buster [17:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1488.eqiad.wmnet with OS buster [17:43:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1489.eqiad.wmnet with OS buster [17:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1489.eqiad.wmnet with OS buster [17:44:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1470.eqiad.wmnet with OS buster [17:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1470.eqiad.wmnet with OS buster completed: - mw1470 (**PASS**) -... [17:44:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1490.eqiad.wmnet with OS buster [17:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1490.eqiad.wmnet with OS buster [17:44:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1491.eqiad.wmnet with OS buster [17:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1491.eqiad.wmnet with OS buster [17:45:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1485.eqiad.wmnet with reason: host reimage [17:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1486.eqiad.wmnet with reason: host reimage [17:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1485.eqiad.wmnet with reason: host reimage [17:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1487.eqiad.wmnet with reason: host reimage [17:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1486.eqiad.wmnet with reason: host reimage [17:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:20] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Andrew) For these boxes, please make one big hardware raid10 out of all the drives, and then use partma... [17:52:29] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.18 refs T308071 (duration: 11m 52s) [17:52:33] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Andrew) [17:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:34] T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071 [17:53:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1487.eqiad.wmnet with reason: host reimage [17:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:42] (03PS1) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) [17:53:50] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:53:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1473.eqiad.wmnet with OS buster [17:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1473.eqiad.wmnet with OS buster completed: - mw1473 (**PASS**) -... [17:54:27] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1488.eqiad.wmnet with reason: host reimage [17:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36082/console" [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [17:54:49] !log dduvall@deploy1002 Pruned MediaWiki: 1.39.0-wmf.16, 1.39.0-wmf.15 (duration: 02m 18s) [17:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1489.eqiad.wmnet with reason: host reimage [17:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1492.eqiad.wmnet with OS buster [17:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1492.eqiad.wmnet with OS buster [17:55:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1490.eqiad.wmnet with reason: host reimage [17:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:12] (03CR) 10Ssingh: [V: 03+1] "Same as Ieab3abb635, but with the following changes:" [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [17:56:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:56:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1491.eqiad.wmnet with reason: host reimage [17:56:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1488.eqiad.wmnet with reason: host reimage [17:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1476.eqiad.wmnet with OS buster [17:58:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1476.eqiad.wmnet with OS buster completed: - mw1476 (**PASS**) -... [17:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:52] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1493.eqiad.wmnet with OS buster [17:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1493.eqiad.wmnet with OS buster [18:00:04] dduvall and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220628T1800). [18:00:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1489.eqiad.wmnet with reason: host reimage [18:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1479.eqiad.wmnet with OS buster [18:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1479.eqiad.wmnet with OS buster completed: - mw1479 (**WARN**) -... [18:01:39] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 (10ssingh) Notes from today's deployment: - We were missing an additional bird config, notably, bird2 rejects by default so we need to explicitly set `export all` f... [18:01:40] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1480.eqiad.wmnet with OS buster [18:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1480.eqiad.wmnet with OS buster completed: - mw1480 (**WARN**) -... [18:01:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1490.eqiad.wmnet with reason: host reimage [18:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1491.eqiad.wmnet with reason: host reimage [18:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage [18:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1478.eqiad.wmnet with OS buster [18:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1478.eqiad.wmnet with OS buster completed: - mw1478 (**WARN**) -... [18:08:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1494.eqiad.wmnet with OS buster [18:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:08:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1494.eqiad.wmnet with OS buster [18:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1495.eqiad.wmnet with OS buster [18:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1495.eqiad.wmnet with OS buster [18:09:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage [18:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:10:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1472.eqiad.wmnet with OS buster [18:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1472.eqiad.wmnet with OS buster completed: - mw1472 (**PASS**) -... [18:11:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1493.eqiad.wmnet with reason: host reimage [18:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1486.eqiad.wmnet with OS buster [18:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1486.eqiad.wmnet with OS buster completed: - mw1486 (**PASS**) -... [18:13:07] !log volans@cumin2002 START - Cookbook sre.dns.netbox [18:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1493.eqiad.wmnet with reason: host reimage [18:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1497.eqiad.wmnet with OS buster [18:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1497.eqiad.wmnet with OS buster [18:17:02] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:17:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1498.eqiad.wmnet with OS buster [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1498.eqiad.wmnet with OS buster [18:18:32] (03PS1) 10Dduvall: testwikis wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809232 (https://phabricator.wikimedia.org/T308071) [18:18:34] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809232 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall) [18:18:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1489.eqiad.wmnet with OS buster [18:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1489.eqiad.wmnet with OS buster completed: - mw1489 (**PASS**) -... [18:19:16] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809232 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall) [18:19:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1494.eqiad.wmnet with reason: host reimage [18:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:27] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1495.eqiad.wmnet with reason: host reimage [18:19:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) [18:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1498.eqiad.wmnet with OS buster [18:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1498.eqiad.wmnet with OS buster [18:19:49] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1498.eqiad.wmnet with OS buster [18:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1498.eqiad.wmnet with OS buster executed with errors: - mw1498 (**F... [18:20:16] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.18 refs T308071 [18:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:21] T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071 [18:21:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1498.eqiad.wmnet with OS buster [18:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1498.eqiad.wmnet with OS buster [18:21:35] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1498.eqiad.wmnet with OS buster [18:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1498.eqiad.wmnet with OS buster executed with errors: - mw1498 (**F... [18:22:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1494.eqiad.wmnet with reason: host reimage [18:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1495.eqiad.wmnet with reason: host reimage [18:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Cmjohnson) mw1498 did not install and failed immediately after inputting the mgmt password. [18:24:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:24:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:04] (03PS3) 10Ssingh: trafficserver: 9.x upgrade: remove wmf-tls log format [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) [18:25:27] (03CR) 10Ssingh: trafficserver: 9.x upgrade: remove wmf-tls log format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [18:25:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:14] (03PS9) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) [18:26:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1485.eqiad.wmnet with OS buster [18:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1485.eqiad.wmnet with OS buster completed: - mw1485 (**WARN**) -... [18:27:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1487.eqiad.wmnet with OS buster [18:27:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1490.eqiad.wmnet with OS buster [18:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1487.eqiad.wmnet with OS buster completed: - mw1487 (**WARN**) -... [18:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:20] (03CR) 10Dduvall: "Friendly ping. When should I expect review of this?" [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [18:27:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1490.eqiad.wmnet with OS buster completed: - mw1490 (**WARN**) -... [18:28:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1497.eqiad.wmnet with reason: host reimage [18:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1498.eqiad.wmnet with reason: host reimage [18:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1491.eqiad.wmnet with OS buster [18:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1491.eqiad.wmnet with OS buster completed: - mw1491 (**PASS**) -... [18:30:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1488.eqiad.wmnet with OS buster [18:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1488.eqiad.wmnet with OS buster completed: - mw1488 (**PASS**) -... [18:32:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1497.eqiad.wmnet with reason: host reimage [18:32:52] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.039 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:34:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1498.eqiad.wmnet with reason: host reimage [18:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1493.eqiad.wmnet with OS buster [18:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1493.eqiad.wmnet with OS buster completed: - mw1493 (**PASS**) -... [18:36:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudgw2003-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:29] (03PS1) 10Ryan Kemper: Revert "elastic: add fake elasticsearch.keystore" [labs/private] - 10https://gerrit.wikimedia.org/r/809246 [18:38:32] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48248 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:41:02] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "elastic: add fake elasticsearch.keystore" [labs/private] - 10https://gerrit.wikimedia.org/r/809246 (owner: 10Ryan Kemper) [18:41:44] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:41:45] (03PS1) 10Ottomata: analytics_test_cluster presto - remove deprecated hive.parquet.fail-on-corrupted-statistics [puppet] - 10https://gerrit.wikimedia.org/r/809236 (https://phabricator.wikimedia.org/T311525) [18:42:25] (03CR) 10CI reject: [V: 04-1] analytics_test_cluster presto - remove deprecated hive.parquet.fail-on-corrupted-statistics [puppet] - 10https://gerrit.wikimedia.org/r/809236 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [18:44:53] (03PS2) 10Ottomata: analytics_test_cluster presto - remove deprecated hive.parquet.fail-on-corrupted-statistics [puppet] - 10https://gerrit.wikimedia.org/r/809236 (https://phabricator.wikimedia.org/T311525) [18:45:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1494.eqiad.wmnet with OS buster [18:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1494.eqiad.wmnet with OS buster completed: - mw1494 (**PASS**) -... [18:45:35] (03CR) 10CI reject: [V: 04-1] analytics_test_cluster presto - remove deprecated hive.parquet.fail-on-corrupted-statistics [puppet] - 10https://gerrit.wikimedia.org/r/809236 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [18:45:47] (03PS3) 10Ottomata: analytics_test_cluster presto - remove deprecated property [puppet] - 10https://gerrit.wikimedia.org/r/809236 (https://phabricator.wikimedia.org/T311525) [18:46:55] (03CR) 10Ottomata: [C: 03+2] analytics_test_cluster presto - remove deprecated property [puppet] - 10https://gerrit.wikimedia.org/r/809236 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [18:47:50] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.18 refs T308071 (duration: 27m 34s) [18:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:56] T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071 [18:49:02] (03CR) 10Andrew Bogott: [C: 03+2] "“Do one thing every day that scares you.”" [puppet] - 10https://gerrit.wikimedia.org/r/795357 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [18:49:07] (03PS2) 10Andrew Bogott: openstack::neutron: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795357 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [18:49:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1498.eqiad.wmnet with OS buster [18:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1498.eqiad.wmnet with OS buster completed: - mw1498 (**PASS**) -... [18:51:01] (03PS1) 10Dduvall: group0 wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809237 (https://phabricator.wikimedia.org/T308071) [18:51:03] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809237 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall) [18:51:23] (03CR) 10BPirkle: api-gateway: allow discovery services to set custom rate limits (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [18:52:24] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809237 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall) [18:56:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:56:18] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.18 refs T308071 [18:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:25] T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071 [18:56:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1492.eqiad.wmnet with OS buster [18:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:34] (03PS1) 10Papaul: Add new Cloud node to site.pp and to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/809239 (https://phabricator.wikimedia.org/T306854) [18:56:36] (03PS1) 10Papaul: Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/809240 (https://phabricator.wikimedia.org/T306854) [18:56:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1492.eqiad.wmnet with OS buster completed: - mw1492 (**PASS**) -... [18:58:21] (03PS1) 10Ottomata: presto - enable iceberg catalog in analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/809241 (https://phabricator.wikimedia.org/T311525) [18:59:36] (03PS2) 10Ottomata: presto - enable iceberg catalog in analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/809241 (https://phabricator.wikimedia.org/T311525) [19:00:41] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36083/console" [puppet] - 10https://gerrit.wikimedia.org/r/809241 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [19:01:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudgw2003-dev.mgmt.codfw.wmnet with reboot policy FORCED [19:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:32] (03CR) 10Ottomata: [V: 03+1 C: 03+2] presto - enable iceberg catalog in analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/809241 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [19:02:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [19:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [19:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [19:04:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:04:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:07] (03CR) 10Ryan Kemper: [V: 03+1] elastic: configure keystore values for restore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:21] (03CR) 10Papaul: [C: 03+2] Add new Cloud node to site.pp and to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/809239 (https://phabricator.wikimedia.org/T306854) (owner: 10Papaul) [19:05:24] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36084/console" [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:06:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1495.eqiad.wmnet with OS buster [19:06:06] (03CR) 10Papaul: [C: 03+2] Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/809240 (https://phabricator.wikimedia.org/T306854) (owner: 10Papaul) [19:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1495.eqiad.wmnet with OS buster completed: - mw1495 (**PASS**) -... [19:06:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [19:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1497.eqiad.wmnet with OS buster [19:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1497.eqiad.wmnet with OS buster completed: - mw1497 (**PASS**) -... [19:08:18] !log T309648 Disabling puppet across all cirrus hosts in order to test out https://gerrit.wikimedia.org/r/c/operations/puppet/+/807623: `ryankemper@cumin1001:~$ sudo -E cumin 'R:elasticsearch::instance' 'disable-puppet "T309648"'` [19:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:24] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [19:09:57] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: configure keystore values for restore [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:10:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:10:26] (03CR) 10Bking: [C: 03+1] elastic: configure keystore values for restore [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:35] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795358 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [19:11:46] (03PS2) 10Andrew Bogott: openstack::nova: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795358 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [19:12:07] (03PS7) 10BryanDavis: striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) [19:12:36] papaul: I went ahead and puppet-merged your two changes `b2830db6f5` and `9db7078e1e` [19:14:02] (03CR) 10BryanDavis: striker: Add profile to provision docker container (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [19:14:39] !log T309648 Enabling puppet on just `elastic2053` and running puppet agent. Expecting to see result of https://gerrit.wikimedia.org/r/807623 being that the new s3 user/pass creds are added to the elasticsearch keystore [19:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:45] (03CR) 10CI reject: [V: 04-1] striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [19:14:46] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [19:15:20] (03PS2) 10Andrew Bogott: openstack::trove: enable rabbitmq tls for api [puppet] - 10https://gerrit.wikimedia.org/r/795361 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [19:19:11] (03PS1) 10Ryan Kemper: elastic: fix s3 user/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/809243 (https://phabricator.wikimedia.org/T309648) [19:19:54] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] "Haha, both Brian and I forgot to git-review -R our respective changes. See new patch for fix: https://gerrit.wikimedia.org/r/c/operations/" [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:20:20] (03CR) 10CI reject: [V: 04-1] elastic: fix s3 user/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/809243 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:20:53] (03PS2) 10Ryan Kemper: elastic: fix s3 user/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/809243 (https://phabricator.wikimedia.org/T309648) [19:22:00] (03CR) 10CI reject: [V: 04-1] elastic: fix s3 user/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/809243 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:22:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2005-dev.wikimedia.org with OS bullseye [19:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:44] (03PS3) 10Ryan Kemper: elastic: fix s3 user/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/809243 (https://phabricator.wikimedia.org/T309648) [19:22:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cu... [19:23:58] (03PS1) 10Jdlrobson: Prevent skinStyles from applying to the Vector 2022 skin. [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809245 (https://phabricator.wikimedia.org/T310197) [19:25:25] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36088/console" [puppet] - 10https://gerrit.wikimedia.org/r/809243 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:25:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:55] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:28:07] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:28:27] (03CR) 10Bking: [C: 03+1] elastic: configure keystore values for restore [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:29:05] (03CR) 10Bking: [C: 03+1] elastic: fix s3 user/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/809243 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:29:26] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: fix s3 user/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/809243 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:29:53] (03CR) 10Ayounsi: [C: 03+1] "lgtm! to be tested" [cookbooks] - 10https://gerrit.wikimedia.org/r/809136 (owner: 10Volans) [19:32:07] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:32:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:59] (03PS1) 10Ryan Kemper: elastic: elasticsearch-keystore takes from stdin [puppet] - 10https://gerrit.wikimedia.org/r/809267 (https://phabricator.wikimedia.org/T309648) [19:35:21] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:40] (03CR) 10Ryan Kemper: "ERROR THAT LED TO THIS CHANGE" [puppet] - 10https://gerrit.wikimedia.org/r/809267 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:36:45] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36089/console" [puppet] - 10https://gerrit.wikimedia.org/r/809267 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:36:46] !log installing nodejs 12 security updates (as shipped in Debian bullseye) [19:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2005-dev.wikimedia.org with reason: host reimage [19:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2005-dev.wikimedia.org with reason: host reimage [19:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:14] !log restarting turnilo to pick up new nodejs [19:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:39] !log restarting etherpad to pick up new nodejs [19:48:41] (03PS2) 10Ryan Kemper: elastic: elasticsearch-keystore takes from stdin [puppet] - 10https://gerrit.wikimedia.org/r/809267 (https://phabricator.wikimedia.org/T309648) [19:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:45] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36090/console" [puppet] - 10https://gerrit.wikimedia.org/r/809267 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [20:03:04] hi - i will deploy [20:04:21] !log gitlab-runner* -disabling puppet - deploying firewall change on 2004 first [20:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:45] Jdlrobson: I think you're here -- proceeding with 1st patch [20:15:32] !log volans@cumin2002 START - Cookbook sre.hosts.decommission for hosts webperf2002.codfw.wmnet [20:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:52] Jdlrobson: i see you here now -- not sure what happened/is happening [20:16:34] o/ [20:16:56] !log gitlab-runner2004 - fixing /etc/resolv.conf and with that the puppet run, leftover mistake from tests [20:16:56] So yeh as I said on DM https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/808071 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/809245 can be merged and synced at the same time as the Vector one [20:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:37] (03PS1) 10Volans: sre.hosts.decommission: fix call to ganeti sync [cookbooks] - 10https://gerrit.wikimedia.org/r/809275 [20:18:09] weird - ok thanks Jdlrobson: can you add https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/809245 to the deployment calendar? [20:18:42] it should be there.. is it not? [20:19:01] oh i used the wrong change id [20:19:05] ok done [20:19:09] !log volans@cumin2002 START - Cookbook sre.dns.netbox [20:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:15] (03PS1) 10Ryan Kemper: elastic: don't mutate keystore group [puppet] - 10https://gerrit.wikimedia.org/r/809276 (https://phabricator.wikimedia.org/T309648) [20:21:16] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36098/console" [puppet] - 10https://gerrit.wikimedia.org/r/809276 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [20:21:37] (03CR) 10Clare Ming: [C: 03+2] Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/808071 (https://phabricator.wikimedia.org/T310197) (owner: 10Jdlrobson) [20:21:39] (03Merged) 10jenkins-bot: Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` [skins/Vector] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/808068 (https://phabricator.wikimedia.org/T310197) (owner: 10Jdlrobson) [20:21:52] (03PS1) 10Jdlrobson: Prevent skinStyles from applying to the Vector 2022 skin. [extensions/VisualEditor] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809249 (https://phabricator.wikimedia.org/T310197) [20:21:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P30588 and previous config saved to /var/cache/conftool/dbconfig/20220628-202156-ladsgroup.json [20:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:44] (03CR) 10Clare Ming: [C: 03+2] Prevent skinStyles from applying to the Vector 2022 skin. [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809245 (https://phabricator.wikimedia.org/T310197) (owner: 10Jdlrobson) [20:23:18] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:23:18] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts webperf2002.codfw.wmnet [20:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:26] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin2002 for hosts: `webperf2002.codfw.wmnet` - webperf2002.codfw.wmnet (**PASS**) - Downtimed host on Ici... [20:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:39] (03PS4) 10Jdlrobson: Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) [20:24:47] (03PS2) 10Ryan Kemper: elastic: don't mutate keystore group [puppet] - 10https://gerrit.wikimedia.org/r/809276 (https://phabricator.wikimedia.org/T309648) [20:24:50] (03CR) 10CI reject: [V: 04-1] Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [20:25:04] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.decommission: fix call to ganeti sync [cookbooks] - 10https://gerrit.wikimedia.org/r/809275 (owner: 10Volans) [20:25:26] (03PS5) 10Jdlrobson: Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) [20:25:29] (03CR) 10Ryan Kemper: "Before this patch, restarting elasticsearch failed due to the wrong group permissions on the keystore file. We're hoping this fixes it." [puppet] - 10https://gerrit.wikimedia.org/r/809276 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [20:25:37] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: don't mutate keystore group [puppet] - 10https://gerrit.wikimedia.org/r/809276 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [20:25:44] (03CR) 10Volans: [C: 03+2] "tested on cumin2002, we're now passing the new cluster group short name already" [cookbooks] - 10https://gerrit.wikimedia.org/r/809275 (owner: 10Volans) [20:26:13] (03CR) 10Jdlrobson: [C: 04-1] "ergg something went wrong in that edit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [20:26:26] (03PS2) 10Ahmon Dancy: scap: Make scap3 provider packages depend on /usr/bin/scap [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) [20:27:30] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix call to ganeti sync [cookbooks] - 10https://gerrit.wikimedia.org/r/809275 (owner: 10Volans) [20:27:52] (03PS6) 10Jdlrobson: Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) [20:28:58] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix call to ganeti sync [cookbooks] - 10https://gerrit.wikimedia.org/r/809275 (owner: 10Volans) [20:29:08] (03CR) 10CI reject: [V: 04-1] scap: Make scap3 provider packages depend on /usr/bin/scap [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy) [20:29:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:42] (03CR) 10Jdlrobson: Enable title above tabs on group 1 and group 0 wikis (1/2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [20:29:59] (03PS8) 10BryanDavis: striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) [20:30:08] (03PS1) 10Andrew Bogott: Add fake striker secrets for codfw1dev to cloudweb.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/809277 [20:30:10] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "thank you very much!. yes, this fixed it https://phabricator.wikimedia.org/T311241#8034714" [puppet] - 10https://gerrit.wikimedia.org/r/809085 (https://phabricator.wikimedia.org/T311241) (owner: 10Majavah) [20:31:32] !log volans@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM cuminunpriv1001.eqiad.wmnet [20:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:57] !log mforns@deploy1002 Started deploy [analytics/refinery@2f5987d]: Regular analytics weekly train [analytics/refinery@2f5987d] [20:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:55] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add fake striker secrets for codfw1dev to cloudweb.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/809277 (owner: 10Andrew Bogott) [20:35:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:35:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P30589 and previous config saved to /var/cache/conftool/dbconfig/20220628-203701-ladsgroup.json [20:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:29] !log volans@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cuminunpriv1001.eqiad.wmnet [20:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:35] (03Merged) 10jenkins-bot: Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/808071 (https://phabricator.wikimedia.org/T310197) (owner: 10Jdlrobson) [20:39:56] (03CR) 10Andrew Bogott: [C: 03+2] striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [20:40:11] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:41:29] (03CR) 10Clare Ming: [C: 03+2] Prevent skinStyles from applying to the Vector 2022 skin. [extensions/VisualEditor] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809249 (https://phabricator.wikimedia.org/T310197) (owner: 10Jdlrobson) [20:42:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:13] (03PS1) 10Ryan Kemper: elasic: grep path differs btw OS vers [puppet] - 10https://gerrit.wikimedia.org/r/809282 (https://phabricator.wikimedia.org/T309648) [20:44:27] (03Merged) 10jenkins-bot: Prevent skinStyles from applying to the Vector 2022 skin. [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809245 (https://phabricator.wikimedia.org/T310197) (owner: 10Jdlrobson) [20:44:36] !log volans@cumin2002 START - Cookbook sre.ganeti.makevm for new host sretest2001.codfw.wmnet [20:44:37] !log volans@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host sretest2001.codfw.wmnet [20:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:55] !log volans@cumin2002 START - Cookbook sre.ganeti.makevm for new host sretest2001.codfw.wmnet [20:45:55] !log volans@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host sretest2001.codfw.wmnet [20:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:04] (03CR) 10Ryan Kemper: [C: 03+2] elasic: grep path differs btw OS vers [puppet] - 10https://gerrit.wikimedia.org/r/809282 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [20:46:55] Jdlrobson: wmf.17 patches are up on mwdebug1002 if you can test -- still waiting for your wmf.18 patch to merge [20:47:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:47:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:24] !log volans@cumin2002 START - Cookbook sre.ganeti.makevm for new host sretest2001.codfw.wmnet [20:48:25] !log volans@cumin2002 START - Cookbook sre.dns.netbox [20:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:43] 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10Jclark-ctr) a:05Jclark-ctr→03AndrewBonamici [20:52:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298560)', diff saved to https://phabricator.wikimedia.org/P30590 and previous config saved to /var/cache/conftool/dbconfig/20220628-205206-ladsgroup.json [20:52:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [20:52:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [20:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:13] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [20:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T298560)', diff saved to https://phabricator.wikimedia.org/P30591 and previous config saved to /var/cache/conftool/dbconfig/20220628-205220-ladsgroup.json [20:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:25] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:52:25] !log volans@cumin2002 START - Cookbook sre.dns.wipe-cache sretest2001.codfw.wmnet on all recursors [20:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:28] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2001.codfw.wmnet on all recursors [20:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:52] !log mforns@deploy1002 deploy aborted: Regular analytics weekly train [analytics/refinery@2f5987d] (duration: 21m 55s) [20:53:54] (03PS1) 10Ryan Kemper: elastic: grep path differs btw OS vers [puppet] - 10https://gerrit.wikimedia.org/r/809284 (https://phabricator.wikimedia.org/T309648) [20:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:06] (03PS1) 10Andrew Bogott: cloudweb: include ::profile::docker::ferm so docker can mess with iptables [puppet] - 10https://gerrit.wikimedia.org/r/809285 (https://phabricator.wikimedia.org/T306469) [20:54:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:54:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:55] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Jclark-ctr) @Marostegui Dimm has arrived [20:55:03] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36101/console" [puppet] - 10https://gerrit.wikimedia.org/r/809284 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [20:55:26] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: grep path differs btw OS vers [puppet] - 10https://gerrit.wikimedia.org/r/809284 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [20:55:47] !log mforns@deploy1002 Started deploy [analytics/refinery@2f5987d]: Regular analytics weekly train [analytics/refinery@2f5987d] [20:55:48] (03CR) 10Andrew Bogott: [C: 03+2] cloudweb: include ::profile::docker::ferm so docker can mess with iptables [puppet] - 10https://gerrit.wikimedia.org/r/809285 (https://phabricator.wikimedia.org/T306469) (owner: 10Andrew Bogott) [20:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:00] (03CR) 10BryanDavis: [C: 03+1] cloudweb: include ::profile::docker::ferm so docker can mess with iptables [puppet] - 10https://gerrit.wikimedia.org/r/809285 (https://phabricator.wikimedia.org/T306469) (owner: 10Andrew Bogott) [20:56:13] !log mforns@deploy1002 Finished deploy [analytics/refinery@2f5987d]: Regular analytics weekly train [analytics/refinery@2f5987d] (duration: 00m 26s) [20:56:14] (03PS1) 10Volans: sre.ganeti.makevm: fix VLAN name generation [cookbooks] - 10https://gerrit.wikimedia.org/r/809286 [20:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:42] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.17/skins/Vector/includes/templates/skin.mustache: Backport: [[gerrit:808068|Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` (T310197)]] (duration: 03m 33s) [20:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:47] T310197: Move editing toolbar below page toolbar - https://phabricator.wikimedia.org/T310197 [20:56:55] (03Merged) 10jenkins-bot: Prevent skinStyles from applying to the Vector 2022 skin. [extensions/VisualEditor] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809249 (https://phabricator.wikimedia.org/T310197) (owner: 10Jdlrobson) [20:57:24] !log volans@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host sretest2001.codfw.wmnet [20:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:34] (03PS1) 10Ryan Kemper: elastic: delegate echo loc to PATH [puppet] - 10https://gerrit.wikimedia.org/r/809287 (https://phabricator.wikimedia.org/T309648) [20:57:59] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: delegate echo loc to PATH [puppet] - 10https://gerrit.wikimedia.org/r/809287 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [20:59:07] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:59:20] !log volans@cumin2002 START - Cookbook sre.hosts.decommission for hosts sretest2001.codfw.wmnet [20:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:45] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/VisualEditor/modules: Backport: [[gerrit:808071|Rename `data-ve-target-container` attribute to `data-mw-ve-target-container` (T310197)]] (duration: 03m 33s) [21:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:35] !log deployed spicerack 3.0.0 to cumin1001 [21:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:04] !log volans@cumin2002 START - Cookbook sre.dns.netbox [21:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:20] (03CR) 10Clare Ming: [C: 03+2] Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [21:03:59] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: fix VLAN name generation [cookbooks] - 10https://gerrit.wikimedia.org/r/809286 (owner: 10Volans) [21:04:22] (03Merged) 10jenkins-bot: Enable title above tabs on group 1 and group 0 wikis (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808056 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [21:04:35] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/VisualEditor: Backport: [[gerrit:809245|Prevent skinStyles from applying to the Vector 2022 skin. (T310197)]] (duration: 03m 27s) [21:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:40] T310197: Move editing toolbar below page toolbar - https://phabricator.wikimedia.org/T310197 [21:05:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Jclark-ctr) cloudcontrol1006 B2 U11 Cableid 20220204 Port 28 cloudcontrol1007 D2 U1 Cableid 20220203 Por... [21:05:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:03] (03Merged) 10jenkins-bot: sre.ganeti.makevm: fix VLAN name generation [cookbooks] - 10https://gerrit.wikimedia.org/r/809286 (owner: 10Volans) [21:07:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P30592 and previous config saved to /var/cache/conftool/dbconfig/20220628-210725-ladsgroup.json [21:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:33] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:07:34] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sretest2001.codfw.wmnet [21:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye [21:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:47] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloud... [21:12:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:12:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:50] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/VisualEditor: Backport: [[gerrit:809249|Prevent skinStyles from applying to the Vector 2022 skin. (T310197)]] (duration: 03m 33s) [21:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:55] T310197: Move editing toolbar below page toolbar - https://phabricator.wikimedia.org/T310197 [21:13:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:36] Jdlrobson: per DM, 3 patches on wmf.17 are live, 1 patch on wmf.18 is live, and syncing your config patch now - will let you know when that one is finished syncing [21:16:41] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/VisualEditor: Backport: [[gerrit:809249|Prevent skinStyles from applying to the Vector 2022 skin. (T310197)]] (duration: 03m 38s) [21:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:19:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Jclark-ctr) [21:20:44] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:808056|Enable title above tabs on group 1 and group 0 wikis (1/2) (T310054)]] (duration: 03m 34s) [21:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:49] T310054: Deploy new toolbar order - https://phabricator.wikimedia.org/T310054 [21:20:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [21:21:04] Jdlrobson: alrighty - all your patches should be live now [21:22:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Jclark-ctr) [21:22:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [21:22:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P30594 and previous config saved to /var/cache/conftool/dbconfig/20220628-212230-ladsgroup.json [21:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:25] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr) [21:25:04] !log end of UTC late backport window [21:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:35] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service,elasticsearch_6@production-search-psi-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:30] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) @Ottomata @wiki_willy We are at full capacity in 10g racks in a,b,c,d Taking into account kafka-jumbo100[6-9] 3 are in 1 row already. my proposal is Server... [21:30:32] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddb2002-dev.codfw.wmnet with OS bullseye [21:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:38] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host clouddb20... [21:31:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye [21:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:06] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloud... [21:32:09] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:23] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:26] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) [21:37:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T298560)', diff saved to https://phabricator.wikimedia.org/P30595 and previous config saved to /var/cache/conftool/dbconfig/20220628-213735-ladsgroup.json [21:37:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [21:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:42] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [21:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [21:38:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298560)', diff saved to https://phabricator.wikimedia.org/P30596 and previous config saved to /var/cache/conftool/dbconfig/20220628-213806-ladsgroup.json [21:38:09] (03PS1) 10Ahmon Dancy: Setup .gitconfig for mwpresync system user [puppet] - 10https://gerrit.wikimedia.org/r/809297 (https://phabricator.wikimedia.org/T303857) [21:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:11] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:55:39] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Ottomata) Hi, I think that is fine. Spread over at least 3 rows is pretty good. 1007 is C and 1008 and 1009 are D. [21:57:17] (03PS1) 10Dzahn: alertmanager: change phab project for automated tasks for serviceops [puppet] - 10https://gerrit.wikimedia.org/r/809300 [22:01:17] (03PS1) 10Cwhite: loki: add loki as an optional grafana component [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) [22:01:42] (03CR) 10Dzahn: [C: 03+2] alertmanager: change phab project for automated tasks for serviceops [puppet] - 10https://gerrit.wikimedia.org/r/809300 (owner: 10Dzahn) [22:03:30] (03PS2) 10Ahmon Dancy: Setup .gitconfig for mwpresync system user [puppet] - 10https://gerrit.wikimedia.org/r/809297 (https://phabricator.wikimedia.org/T303857) [22:04:22] (03PS2) 10Cwhite: loki: add loki as an optional grafana component [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) [22:04:57] (03PS1) 10Dzahn: alertmanager: replace email address for service-ops-collab notifications [puppet] - 10https://gerrit.wikimedia.org/r/809303 [22:05:51] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:05:54] (03PS1) 10Papaul: Replae labs-hosts1-b-codfw with cloud-hosts1-b [puppet] - 10https://gerrit.wikimedia.org/r/809304 (https://phabricator.wikimedia.org/T306854) [22:06:09] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/36103/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/809297 (https://phabricator.wikimedia.org/T303857) (owner: 10Ahmon Dancy) [22:07:17] (03CR) 10Dzahn: [C: 03+2] alertmanager: replace email address for service-ops-collab notifications [puppet] - 10https://gerrit.wikimedia.org/r/809303 (owner: 10Dzahn) [22:07:54] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1002/36104/" [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [22:08:45] (03CR) 10Papaul: [C: 03+2] Replae labs-hosts1-b-codfw with cloud-hosts1-b [puppet] - 10https://gerrit.wikimedia.org/r/809304 (https://phabricator.wikimedia.org/T306854) (owner: 10Papaul) [22:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:11:26] (03CR) 10Dzahn: [C: 03+2] "I change the email and phab tag to our new service-collab version. Now merging this to just test it in production." [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn) [22:13:48] (03CR) 10Cwhite: [C: 03+2] logstash: duplicate scap.announce logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [22:19:47] (03PS1) 10BryanDavis: striker: require ::profile::docker::ferm in ::profile::wmcs::striker::docker [puppet] - 10https://gerrit.wikimedia.org/r/809306 (https://phabricator.wikimedia.org/T306469) [22:20:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb2002-dev.codfw.wmnet with OS bullseye [22:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2... [22:22:25] (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/36105/" [puppet] - 10https://gerrit.wikimedia.org/r/809306 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [22:23:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Papaul) @ayounsi On clouddb2002 i was getting the error mesage below ` Failed... [22:27:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye [22:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cu... [22:28:32] (03CR) 10Legoktm: [C: 03+1] "Ready to do now :(" [puppet] - 10https://gerrit.wikimedia.org/r/806489 (owner: 10Legoktm) [22:29:51] (03CR) 10RLazarus: [C: 03+2] admin: Temporarily disable legoktm's access [puppet] - 10https://gerrit.wikimedia.org/r/806489 (owner: 10Legoktm) [22:31:47] (03PS2) 10RLazarus: admin: Temporarily disable legoktm's access [puppet] - 10https://gerrit.wikimedia.org/r/806489 (owner: 10Legoktm) [22:32:13] (03PS1) 10Jdlrobson: Do not grey out page title while loading on Vector 2022 [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809308 (https://phabricator.wikimedia.org/T310839) [22:35:57] is it ok to backport a patch to wmf.17 now? [22:36:35] (03CR) 10Andrew Bogott: [C: 03+2] striker: require ::profile::docker::ferm in ::profile::wmcs::striker::docker [puppet] - 10https://gerrit.wikimedia.org/r/809306 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [22:39:05] cjming: sure, if (or someone) is around to test/watch logs [22:39:35] thcipriani: thanks -- then I will go ahead [22:40:09] thanks thcipriani [22:47:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage [22:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage [22:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:56] @cjming do you need me to test or are you able to handle this one? [22:52:22] Jdlrobson: i can cover -- it's just the font size right? [22:52:45] of the icons [22:52:59] (03CR) 10Clare Ming: [C: 03+2] Do not grey out page title while loading on Vector 2022 [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809308 (https://phabricator.wikimedia.org/T310839) (owner: 10Jdlrobson) [22:54:46] yeh if the font-size on https://en.wikipedia.org/wiki/Giant_panda?veaction=edit looks correct you are golden to sync [22:55:16] cool - i can take it from here [22:55:23] Thank you! <3 [22:55:29] np! [22:58:31] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:18] PROBLEM - Disk space on labweb1002 is CRITICAL: DISK CRITICAL - /run/docker/netns/5419ec8186b9 is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=labweb1002&var-datasource=eqiad+prometheus/ops [23:02:20] PROBLEM - Disk space on labweb1001 is CRITICAL: DISK CRITICAL - /run/docker/netns/240ccca59011 is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=labweb1001&var-datasource=eqiad+prometheus/ops [23:13:10] (03Merged) 10jenkins-bot: Do not grey out page title while loading on Vector 2022 [extensions/VisualEditor] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809308 (https://phabricator.wikimedia.org/T310839) (owner: 10Jdlrobson) [23:13:33] andrewbogott: those icinga alerts on labweb are kind of what we used to get whenever we added a docker network to a host. looks like that might be the case here [23:14:39] obviously not a real issue but if you wanted them off it's 'profile::monitoring::nrpe_check_disk_options" in Hiera that mention -exclude-type=tracefs [23:17:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb2002-dev.codfw.wmnet with OS bullseye [23:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:37] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host clouddb20... [23:18:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:19:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:19:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2003-dev.codfw.wmnet with OS bullseye [23:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:37] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/VisualEditor/modules/ve-mw/preinit: Backport: [[gerrit:809308|Do not grey out page title while loading on Vector 2022 (T310839)]] (duration: 03m 28s) [23:20:37] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloud... [23:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:42] T310839: Do not grey out page title when in edit mode (Vector 2022) - https://phabricator.wikimedia.org/T310839 [23:20:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:28:22] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Papaul) [23:39:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [23:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:08] mutante: that's for sure what's happening. I'll look at profile::monitoring::nrpe_check_disk_options [23:41:48] andrewbogott: ACK! they are fairly long but copying from something else should work. should be that "tracefs" part [23:42:08] so like... [23:42:09] -w 10% -c 5% -W 6% -K 3% -l -e -A -i '/(var/lib|run)/(docker|kubelet)/*' --exclude-type=tracefs [23:42:11] ? [23:43:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [23:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:32] andrewbogott: yes, that looks good [23:43:44] based on what like the kubernetes servers use [23:45:04] (in our case it's just docker and not k8s but looks like that'll work for both) [23:46:44] (03PS1) 10Andrew Bogott: cloudweb/labweb: don't alert on full docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/809323 [23:46:44] andrewbogott: works [23:46:46] [labweb1001:/etc/nagios/nrpe.d] $ /usr/lib/nagios/plugins/check_disk -w 10% -c 5% -W 6% -K 3% -l -e -A -i '/(var/lib|run)/(docker|kubelet)/*' --exclude-type=tracefs [23:46:49] DISK OK| /dev=0MB;28929;30536;0;32144 /run=690MB;5788;6110;0;6432 /=8033MB;67122;70851;0;74580 /dev/shm=0MB;28945;30553;0;32162 /run/lock=0MB;4;4;0;5 /sys/fs/cgroup=0MB;28945;30553;0;32162 /srv=24585MB;606333;640018;0;673704 /run/user/2093=0MB;5788;6110;0;6432 /run/user/3518=0MB;5788;6110;0;6432 /run/user/2075=0MB;5788;6110;0;6432 [23:47:04] mutante: https://gerrit.wikimedia.org/r/c/operations/puppet/+/809323 <- [23:48:01] (03CR) 10Dzahn: [C: 03+1] "yea, that works. since it's an NRPE command it can run locally like:" [puppet] - 10https://gerrit.wikimedia.org/r/809323 (owner: 10Andrew Bogott) [23:48:35] (03CR) 10Dzahn: [V: 03+1 C: 03+1] cloudweb/labweb: don't alert on full docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/809323 (owner: 10Andrew Bogott) [23:48:48] (03PS2) 10Cwhite: beta-logs: change opensearch version to 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/802863 (https://phabricator.wikimedia.org/T304440) [23:49:08] (03CR) 10Andrew Bogott: [C: 03+2] cloudweb/labweb: don't alert on full docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/809323 (owner: 10Andrew Bogott) [23:49:32] thank you for the quick fix mutante ! [23:49:56] yw, it will need puppet run on labweb* and after that on alert* and then 5 minutes, but I'm sure it will resolve [23:53:34] (03PS1) 10Dzahn: mediawiki/appservers: redirect policy and related sites to wikimediafoundation.org/advocacy/ [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) [23:55:07] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 3 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8030481, @LSobanski wrote: > @Varnent could we get a clarification of the timeline for this request? The... [23:55:15] (03PS2) 10Dzahn: mediawiki/appservers: redirect policy and related sites to wikimediafoundation.org/advocacy/ [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) [23:55:28] I'm going back to ignoring my screen, will check later to be sure it resolved. [23:55:37] (03PS6) 10Krinkle: Move $wgCentralAuthTokenCacheType from redis_local to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [23:56:15] (03CR) 10CI reject: [V: 04-1] Move $wgCentralAuthTokenCacheType from redis_local to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [23:57:01] (03PS2) 10Dzahn: switch policy.wikimedia.org back from Wordpress to WMF DNS [dns] - 10https://gerrit.wikimedia.org/r/808309 (https://phabricator.wikimedia.org/T310738) [23:57:21] (03PS1) 10Krinkle: [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) [23:57:38] (03CR) 10Dzahn: "will follow-up with httpbb tests change to be merged first" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [23:58:20] (03CR) 10CI reject: [V: 04-1] [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) (owner: 10Krinkle) [23:58:38] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) [23:58:58] (03CR) 10CI reject: [V: 04-1] mediawiki/appservers: redirect policy and related sites to wikimediafoundation.org/advocacy/ [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [23:59:08] (03PS7) 10Krinkle: Move $wgCentralAuthTokenCacheType from redis_local to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [23:59:10] (03PS2) 10Krinkle: [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) [23:59:14] (03CR) 10Krinkle: [C: 03+1] Move $wgCentralAuthTokenCacheType from redis_local to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [23:59:18] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:59:29] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) >>! In T310738#8033789, @LSobanski wrote: > @Varnent After chatting about this some more, how about we do the following:...